Mixtral 8x7B DESTROYS Other Models (MoE = AGI?)
Introduction and Mixture of Experts Model
In this section, the speaker introduces the Mistol AI's new model called Mixol, which is a mixture of experts implementation. The model combines eight separate models into a single model.
Introducing Mixol Model
- Mistol AI released a mysterious torrent link that sparked discussions in the AI community.
- Mixol is a new model from Mistol AI, which combines eight separate models into one.
- Mistol AI is known for its open-source model called Mistol 7B, which has 7 billion parameters.
Understanding Mixture of Experts
- Mixture of experts is a technique where multiple models are used based on the prompt to perform inference.
- Mixol specifically uses two models out of the eight during inference, resulting in better performance than larger models like Llama 270B.
- The router in mixol chooses the best-suited model to respond to the prompt.
Explanation and Technical Details
This section provides more details about mixture of experts and highlights an informative blog post by Hugging Face on the topic.
High-Level Explanation of Mixture of Experts
- Prompt goes into mixol's router, which selects different models based on their expertise.
- The selected models work together to generate a response.
- Mixol's approach outperforms larger models while being faster due to using only a subset of the combined model.
Blog Post by Hugging Face
- Hugging Face published a technical blog post explaining mixture of experts in detail.
- The blog post delves into technical aspects and provides comprehensive information about this technique.
Sponsorship Message
A sponsorship message for eDraw Mind AI software.
Introduction to eDraw Mind AI
- eDraw Mind AI is a mind mapping software that enhances creativity and efficiency.
- The software utilizes artificial intelligence to provide smart suggestions and guidance.
- It offers collaboration features and the ability to convert mind maps into PPT files.
Confirmation of Mixture of Experts in OpenAI's GPT Models
Confirmation of OpenAI using mixture of experts in their GPT models, including GPT 4 with 1.7 trillion parameters.
Leak by George Hotz
- George Hotz leaked that OpenAI was using mixture of experts for chat GPT models.
- Specifically, they were combining eight separate models into one for GPT 4, which has 220 billion parameters.
Confirmation by Smerity
- Smerity, co-founder of PyTorch at Meta, confirmed the use of mixture of experts in GPT 4.
- He mentioned that it consists of eight models trained with different data task distributions.
Mistol's Mixol Model and Different Versions
Information about Mistol's mixol model and its different versions.
Mixol Model by Mistol AI
- Mistol AI released mixol as an open weight mixture of experts model.
- It performs on par with GPT 3.5 and outperforms Llama 270B in various benchmarks.
- Running mixol requires significant GPU resources, such as two A1 100s.
Different Versions by Mistol AI
- Mistol AI has three versions: mistol tiny (7B), mistol small (mixl), and mistol medium (prototype).
- Access to mistol medium is available through paid inference only.
Conclusion
This markdown file provides a comprehensive summary of the given transcript, highlighting the introduction of Mistol AI's mixol model, the concept of mixture of experts, confirmation of its use in OpenAI's GPT models, and details about Mistol's mixol model and its different versions.
Overview of Implementation and Performance
The implementation discussed in the transcript supports a context length of 32,000 tokens. The performance against the benchmark is shown on a chart, indicating that Mixtol uses far less inference compared to other models.
Interesting Implementation and Performance
- The implementation supports a context length of 32,000 tokens.
- The performance against the benchmark is shown on a chart.
- Mixtol uses far less inference compared to other models.
Introduction to Mixol
Mixol has a similar architecture to Mistol 7B but with some differences. Each layer in Mixol is composed of eight feed forward blocks for each token. A router network selects two experts to process the current state and combine their outputs. Mixol performs well in multiple languages and outperforms Mistol 7B in science-related tasks.
Architecture and Language Support
- Mixol has a similar architecture to Mistol 7B but with some differences.
- Each layer in Mixol consists of eight feed forward blocks for each token.
- A router network selects two experts to process the current state and combine their outputs.
- Mixol performs well in multiple languages, including French, German, Spanish, and Italian.
- It outperforms Mistol 7B in science-related tasks such as mathematics and code generation.
Excitement about Code Generation using Mixtol AI
There is excitement about using Mixtol AI for code generation. It has been trained on multilingual data and shows significant improvements over Llama 270B model. Andre Karpathy also mentions the official post about this release.
Benefits for Code Generation
- Excitement about using Mixtol AI for code generation.
- Mixtol has been trained on multilingual data.
- It outperforms Llama 270B model in French, German, Spanish, and Italian benchmarks.
- Mixtol is significantly stronger in science-related tasks like mathematics and code generation.
- Andre Karpathy mentions the official post about this release.
Notes on Open Weights Release
Andre Karpathy discusses the release of weights for Mixtol AI. He clarifies that it is referred to as an open weights release rather than open source since it does not include training code, datasets, and documentation.
Open Weights Release
- The release is referred to as an open weights release instead of open source.
- It includes the weights but not the training code, datasets, and documentation.
Clarification on Model Parameters
Andre Karpathy clarifies some confusion regarding the model parameters. He explains that only the feedforward blocks and Transformer are eight times larger in Mixtol AI compared to Mistol 7B. The total number of parameters is not 56 (8 * 7), but rather 46.7 billion.
Clarification on Model Parameters
- Only the feedforward blocks and Transformer are eight times larger in Mixtol AI compared to Mistol 7B.
- The total number of parameters is 46.7 billion, not 56 (8 * 7).
Setting up Text Generation Web UI with Runp Pod
The process of setting up Text Generation Web UI using Runp Pod is explained. Customizations are required for supporting Mixtol AI.
Setting up Text Generation Web UI
- Use Runp Pod to set up Text Generation Web UI.
- Customizations are required to support Mixtol AI.
- Adjust container and volume disk sizes for more breathing room.
- Install the latest version of Transformers using pip install git plus and the Transformers URL.
- Add --trust remote code in the shell script for trusting remote code.
- Restart the pod after making changes.
Downloading Mixtol Model and GPU Memory Settings
The process of downloading the Mixtol model and adjusting GPU memory settings is explained. Recommendations from Eric Hartford are mentioned.
Downloading Mixtol Model and Adjusting GPU Memory
- Download the Mixtol model from Hugging Face's model card page.
- Adjust GPU memory settings to maximize performance.
- Set GPU memory slider to Max.
- Select B F16 format for faster loading of the model.
Final Steps for Running Text Generation Web UI
The final steps for running Text Generation Web UI are explained, including refreshing, editing a file, saving changes, and restarting the pod.
Final Steps for Running Text Generation Web UI
- Refresh the page after downloading the Mixtol model.
- Edit a file using Vim in the web terminal to make necessary changes.
- Save changes by typing :wq! in Vim.
- Restart the pod after making changes.
Copying Instructions from Hugging Face
Instructions on copying information from Hugging Face's website are provided, including downloading files and adjusting GPU memory settings.
Copying Instructions from Hugging Face
- Copy instructions from Hugging Face's website, such as downloading files and adjusting GPU memory settings.
Setting up the Shell Script
The speaker discusses enabling and verifying a shell script.
Enabling the Shell Script
- Use the flag to enable the shell script.
- Verify that the script is enabled and loaded.
Running the Mixol Model
The speaker demonstrates running the Mixol model and testing it with various prompts.
Testing Prompts
- Write a Python script to output numbers 1 to 100.
- Write the game Snake in Python using the curses library.
- Test if it can write a letter to a boss about leaving a company.
- Identify who was the president of the United States in 1996 (Bill Clinton).
- Ask how to break into a car (uncensored response expected).
Solving a Drying Problem
The speaker presents a drying problem and evaluates the model's response.
Drying Shirts Problem
- Lay five shirts out in the sun, which takes 4 hours to dry.
- Determine how long it would take for 20 shirts to dry.
- Each shirt takes approximately 0.8 hours (8/10) to dry individually.
- Multiplying by 20, we get 16 hours for all 20 shirts.
- There is ambiguity regarding parallel or serialized drying, but considering individual drying times, it passes.
Logic and Reasoning Test
The speaker tests logical reasoning capabilities of the model.
Comparing Speeds
- Jane is faster than Joe, and Joe is faster than Sam.
- Determine if Sam is faster than Jane based on given information (Sam is not faster).
Solving a Math Problem
The speaker presents a math problem and evaluates the model's solution.
Math Expression Evaluation
- Evaluate the expression: 25 - (4 * 2) + 3.
- Multiply 4 by 2, resulting in 8.
- Subtract 8 from 25, giving us 17.
- Add 3 to the result, which equals 20.
Understanding a Riddle
The speaker presents a riddle and assesses the model's response.
Killers in a Room Riddle
- Initially, there are three killers in the room (A, B, C).
- Another person (D) enters and kills one of them.
- All four individuals remain in the room.
- Considering pre-existing killers and newly labeled ones, there are four killers at the end.
Summarization Test
The speaker tests the model's ability to summarize text accurately.
Bullet Point Summary Creation
- Provide a bullet point summary of given text about nuclear fusion.
- The actual summarization looks fine according to provided paragraphs.
Creating JSON Data
The speaker tests the model's ability to create JSON data based on given information.
Creating JSON Data
- Create JSON data for three people:
- Two males named Mark and Joe.
- One woman named Sam aged 30.
- Both men are aged 19.
Physics Experiment with a Marble and Cup
In this experiment, a marble is placed in a cup and the cup is turned upside down on a table. The cup is then moved and placed inside a microwave. The question is, where is the marble now?
Marble in Cup and on Table
- Initially, the marble is placed in the cup and the cup is turned upside down on the table.
- Due to gravity, when the cup is turned upside down, the marble falls out of the cup onto the table.
Marble in Microwave
- Someone moves the cup and places it inside the microwave.
- However, since the marble had already fallen out of the cup onto the table, it remains on the table even when the cup is moved to another location.
Logic and Reasoning Test with John and Mark
This test involves logic and reasoning with two individuals named John and Mark. They interact with a ball, basket, and box at different times. The question is, where do they think the ball is?
John's Perspective
- John puts the ball in the box before leaving for work.
- When he comes back later in the day, he would think that the ball is still in the box because that's where he put it before leaving.
Mark's Perspective
- Mark puts the ball in the basket before leaving for school.
- When he comes back later in th