NEW LLaMA Model Coming - Completely Rebuilt From the Ground Up

NEW LLaMA Model Coming - Completely Rebuilt From the Ground Up

Red Pajama: An Open Source Llama Model

In this video, the speaker introduces a new project called Red Pajama that aims to create an open-source Llama model. The speaker discusses the limitations of current open-source models and how Red Pajama aims to solve these problems.

Introduction

  • Red Pajama is a new project that aims to create an open-source Llama model.
  • The current closest rival to OpenAI's GPT model is Llama, but it has two major flaws: it's from Meta and not commercially open source.
  • Red Pajama aims to solve this problem by creating a high-quality model that is also commercially available.

About Red Pajama

  • Red Pajama is a project to create leading fully open-source models.
  • They have already put together a data set of 1.2 trillion tokens, which is on par with Llama.
  • The biggest issue with open-source models is the quality gap between them and closed models like GPT4.
  • Even the best open-source models are not commercially viable, but Red Pajama aims to change that.

Three Stages of Red Pajama

  • There are three stages in the creation of the Red Pajama model: pre-training data, base models trained at scale on this data, and fine-tuning for specific tasks.
  • Pre-training data needs to be both high quality and have broad coverage. This is where technical advancements will only go so far; data quality is everything.
  • Component two involves training base models at scale on the pre-training data set.

Instruction Tuning

In this section, the speaker talks about instruction tuning and how it can help improve the base model and make it usable and safe. They also discuss how Llama was trained on high-quality instruction data to create a 7 billion parameter model that is particularly valuable for the open-source community.

Instruction Tuning

  • Instruction tuning means taking a bunch of instruction examples, such as code for XYZ, and then rating the output from the model to improve its usability and safety.
  • Llama was trained on high-quality instruction data, such as writing code in Python to count from one to a hundred, to create a 7 billion parameter model that is particularly valuable for the open-source community.
  • The 7 billion parameter llama model is trained for much longer than other models to ensure the best quality at that size.
  • Llama and all of its derivatives are only available for non-commercial research purposes. However, Red Pajama aims to create a fully open-source reproduction of Llama that would be available for commercial applications.

Red Pajama-Based Data Set

In this section, the speaker talks about the Red Pajama-based dataset used by Red Pajama to recreate Llama. They also discuss different slices of data sets used by Red Pajama.

Red Pajama-Based Data Set

  • The Red Pajama-based dataset contains 1.2 trillion tokens carefully filtered for quality.
  • There are seven data slices in total: Common Crawl, C4 dataset, GitHub (good at programming), AR XIV scientific articles database, corpus of open books deduplicated by content similarity, Wikipedia (subset of pages removing boilerplate), and Stack Exchange (wide variety of questions and answers).
  • The largest slice is Common Crawl, followed by C4.

Training a Strong Base Model

In this section, the speaker talks about training a strong base model as part of the Insight program with support from Oak Ridge Leadership Computing Facility. They also discuss how Red Pajama plans to release instruction-tuned versions of the Red Pajama models.

Training a Strong Base Model

  • As part of the Insight program with support from Oak Ridge Leadership Computing Facility, Red Pajama is training a full suite of models with the first becoming available in the coming weeks.
  • Red Pajama plans to release instruction-tuned versions of the Red Pajama models using hundreds of thousands of high-quality natural user instructions received via Open Chat Kit.

Understood, thank you for the detailed instructions. I will follow them to create a clear and concise markdown file that makes use of timestamps when available.

Introduction

The video starts with music playing in the background.

Understanding the Basics of Machine Learning

In this section, the speaker explains what machine learning is and how it works.

What is Machine Learning?

  • Machine learning is a subset of artificial intelligence that involves training algorithms to make predictions or decisions based on data.
  • It involves feeding large amounts of data into an algorithm and allowing it to learn from that data.
  • The goal of machine learning is to create models that can make accurate predictions or decisions on new data.

Types of Machine Learning

  • There are three main types of machine learning - supervised learning, unsupervised learning, and reinforcement learning.
  • Supervised learning involves training an algorithm on labeled data, where the correct output is known.
  • Unsupervised learning involves training an algorithm on unlabeled data, where the correct output is unknown.
  • Reinforcement learning involves training an algorithm through trial and error by rewarding it for making good decisions.

Applications of Machine Learning

In this section, the speaker discusses some common applications of machine learning.

Image Recognition

  • One application of machine learning is image recognition.
  • Machine learning algorithms can be trained to recognize objects in images and classify them into different categories.

Natural Language Processing

  • Another application of machine learning is natural language processing.
  • Machine learning algorithms can be used to analyze and understand human language, allowing for applications such as chatbots and voice assistants.

Fraud Detection

  • Machine learning can also be used for fraud detection.
  • By analyzing patterns in data, machine learning algorithms can identify potential cases of fraud and alert authorities.

Conclusion

The video concludes with a summary of the key points covered in the video.

Video description

In this video, we will look at a collective of companies rebuilding LLaMA from scratch, this time wholly open-sourced (including commercial viability). together.xyz will rebuild the dataset, train the model, and apply instruction tuning from scratch while ensuring LLaMA will be completely open-sourced. Note: Let me know if the audio is better in this video. Thanks. Enjoy :) Join My Newsletter for Regular AI Updates šŸ‘‡šŸ¼ https://forwardfuture.ai/ My Links šŸ”— šŸ‘‰šŸ» Subscribe: https://www.youtube.com/@matthew_berman šŸ‘‰šŸ» Twitter: https://twitter.com/matthewberman šŸ‘‰šŸ» Discord: https://discord.gg/xxysSXBxFW šŸ‘‰šŸ» Patreon: https://patreon.com/MatthewBerman Media/Sponsorship Inquiries šŸ“ˆ https://bit.ly/44TC45V Links: Homepage - https://www.together.xyz/ Blogpost - https://www.together.xyz/blog/redpajama Data Sample - https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T Content of Video -- 0:00 - Intro 0:20 - Blog Post Review 7:00 - Data Sourced 8:17 - Next Steps 9:20 - Outro My Workstation Setup: Apple MacBook Air M2 - https://amzn.to/3GQFexg LG Ultrawide 5k Monitor - https://amzn.to/3XsnBuC Logitech Litro Glow - https://amzn.to/3HkP1wX Vivo Monitor Stand - https://amzn.to/3Xv0TlU Logitech MX Master S2 - https://amzn.to/3kyghiH Logitech Craft Wireless Keyboard - https://amzn.to/3QSsHhx Logitech HD Video Camera - https://amzn.to/3XMFFQc Blue Yeti Microphone - https://amzn.to/3XICOaP Uplift Standing Desk - https://amzn.to/3XMFYKQ Apple AirPods Max Headphones - https://amzn.to/3XOwYF1 Large Black Desk Pad - https://amzn.to/3YdNChz