Colossal AI: Scaling AI Models in Big Model Era

Name: Colossal AI: Scaling AI Models in Big Model Era
Uploaded: 2023-07-26T21:04:23.000Z
Duration: 1 h 24 min 26 s

New Section

This section introduces the challenges and opportunities posed by big models, discusses n-dimensional parallelization system, efficient memory system, and highlights examples of speed ups achieved on important problems.

Challenges and Opportunities

The growth of big models has been exponential, exceeding Moore's law.

There is a growing gap between demand for larger models and the available supply.

Large models offer better performance and accuracy.

The dropping cost of hardware provides an opportunity to train these models more efficiently.

Efficient Computing

Colossal AI offers an n-dimensional parallelization system that allows for mapping AI training to various parallel platforms.

The system automatically customizes parallelization schemes based on the hardware available.

An efficient memory system determines the most efficient place in memory to store and compute with data, enabling training of larger models.

Speed Ups

Colossal AI has achieved significant speed ups on important problems like chat GPT and AI-generated content.

These speed ups can be up to a factor of 10x compared to previous approaches.

Challenges Faced

The size of language models has been growing at a rapid pace, far exceeding the growth rate of GPU memory.

Scalable and efficient computing is required to meet the demand for large models.

Cost Considerations

The cost of training and running large models needs to be affordable for small and medium-sized enterprises (SMEs).

The dropping cost of hardware provides an opportunity for more efficient and cheaper model training.

Structure of Colossal AI

Colossal AI aims to automatically optimize mapping problems onto different hardware architectures.

The goals include maximizing computational efficiency, minimizing running time and data movement, and reducing programming effort.

Conclusion

Colossal AI offers solutions to the challenges posed by big models, including efficient parallelization and memory systems.

The system has achieved significant speed ups on important problems and aims to make high-performance computing accessible to all.

Timestamps are provided for each section to help navigate the transcript.

Efficient Memory System and Parallelization

In this section, the speaker discusses the three-layered system provided by Colossal AI to optimize memory usage and parallelization for efficient AI processing.

Efficient Memory System

The first layer of Colossal AI is an efficient memory system designed to make the best use of available memory.

The goal is to minimize memory usage as it can be expensive.

Parallelization

Colossal AI utilizes a second layer called "in-dimensional parallelism" to map problems onto hardware effectively.

This layer optimizes parallel processing for improved performance.

The speaker mentions advanced methods such as data parallelism, tensor parallelism, pipeline parallelism, and sequence parallelism.

These methods are designed to achieve optimal results and outperform existing approaches.

Sequence parallelism is a newer concept developed by the team behind Colossal AI.

Popularity and Adoption

In this section, the speaker highlights the popularity and adoption of Colossal AI among users worldwide.

GitHub Stars and Interest

The number of GitHub stars for Colossal AI has exceeded 30,000 since its inception.

Compared to other well-known frameworks like Spark, Colossal AI has gained significant interest from users.

Global User Base

Users from approximately 140 different countries and regions have adopted Colossal AI.

Other frameworks like PyTorch, Lightning AI, Hugging Face, and Facebook Opt recommend using Colossal AI through their official webpages.

Various Forms of Parallelism

This section focuses on different forms of parallelism supported by Colossal AI.

Complete Set of Parallelism Methods

Colossal AI provides a comprehensive set of parallelism methods including data parallelism, tensor parallelism, pipeline parallelism, and sequence parallelism.

These methods offer advanced techniques that outperform existing approaches and are provably optimal.

2D, 3D, and 2.5D Parallelism

The speaker introduces the concepts of 2D, 3D, and 2.5D parallelism without providing further details.

Large Batch Algorithms for Data Parallelism

This section discusses large batch algorithms optimized for data parallelism in Colossal AI.

Data Parallelism

Data parallelism involves splitting the dataset into partitions and training each partition on a separate GPU.

The model is replicated across all GPUs to ensure synchronization.

Gradients computed using subsets of data are averaged and shared among processors to update local models.

Increasing the batch size allows for more parallel processing and significant speedups.

Challenges with Large Batch Size

Using a large batch size can lead to convergence issues and difficulty finding sharp minima during optimization.

Existing systems have limitations on batch size (around 8,000) that hinder effective parallelization.

Scalable Optimizers: Lars and Lamb

Colossal AI introduces two new approaches called Lars (Layer Wise Adaptive Rate Scaling) and Lamb (Layer Wise Adaptive Moments Optimizer for Batch Training).

These scalable optimizers address the challenges of large batch training.

Lars and Lamb have been widely adopted in the industry, leading to significant speedups without sacrificing accuracy.

Conclusion

Colossal AI offers an efficient memory system, various forms of parallelism, and scalable optimizers like Lars and Lamb. It has gained popularity worldwide with a large user base. The provided methods optimize memory usage, improve performance through parallel processing, and overcome challenges associated with large batch training.

Model Parallelism Techniques

In this section, the speaker discusses popular techniques for model parallelism, specifically tensor parallel and pipeline parallel. These techniques involve splitting the model across multiple processors to improve performance.

Tensor Parallel and Pipeline Parallel

Tensor parallel and pipeline parallel are two popular techniques used for model parallelism.

Tensor parallel involves splitting the model's data matrix into parts and assigning each part to different processors.

Nvidia's Megatron LM and Microsoft's deep speed are two well-known systems in the open-source community that implement these techniques.

Megatron LM splits the data matrix in one direction, giving half of it to one processor and half to another.

Deep speed is compatible with Megatron LM and improves data parallel training through its zero redundancy optimizer.

Advanced Model Parallelism Techniques

This section introduces more advanced methods of model parallelism that can significantly improve performance. These methods involve sharding inputs and outputs along tensor dimensions.

1D Parallelism vs. Advanced Methods

Colossal AI provides a system called 1D parallelism, which splits weights between processors.

More advanced methods shard inputs and outputs along tensor dimensions, allowing for faster processing.

Each subset of data or work is represented by a colored cube in a diagram.

The subsets assigned to different processors correspond to nested loops in matrix multiplication.

2.5D Algorithm for Matrix Multiplication

The speaker explains the 2.5D algorithm for matrix multiplication, which further reduces communication costs compared to traditional approaches like Summa.

Matrix Replication and Contribution

The 2.5D algorithm breaks up each nested loop defining matrix multiplication.

Matrices A, B, and C are replicated as often as memory permits.

Different processors contribute to the final C matrix by computing different contributions.

This reduces communication costs even further than the Summa algorithm.

Scaling Results

The speaker presents scaling results for strong scaling experiments conducted on a server with 200 GPU nodes. The results show the superiority of the proposed methods compared to Megatron LM.

Strong Scaling Results

Strong scaling experiments involved adding more processors while keeping the problem size fixed.

Comparisons were made between forward and backward time for batch processing.

The proposed methods consistently outperformed Megatron LM, achieving higher throughput and lower time.

2.5D algorithm showed a factor of improvement in throughput compared to 2D parallelism.

Sequence Parallelism

The speaker discusses the challenges of dealing with longer sequences in training models due to memory constraints.

Longer Sequences and Memory Constraints

Real-world applications often involve long sequences, such as articles or amino acid sequences.

Current networks trained on massive clusters are limited to around 500 to 2000 tokens per sequence.

Increasing the number of sequences improves accuracy but faces memory limitations.

Overcoming Sequence Length Limitations

The speaker explains how overcoming sequence length limitations can lead to better accuracy in model training.

Increasing Sequence Length

Allowing for longer sequences can improve model accuracy.

Memory consumption is a challenge due to model and optimizer states as well as input data size.

Training Models with Long Sequences

The speaker discusses the challenge of training models with long sequences and proposes a solution called sequence parallelism to reduce memory usage. This technique involves splitting the full-length sequence data into subsequences, allowing each device to hold only a single subsequence. This reduces the memory burden on a single device and enables training on longer sequences.

Sequence parallelism splits full-length sequence data into subsequences.

Each device holds a single subsequence, reducing memory burden.

Enables training on longer sequences that couldn't fit on a single GPU.

Used in speeding up Alpha fold for analyzing amino acid sequences.

Communication between devices is done through passing keys in a ring self-attention network.

Auto Parallelism for Code Optimization

The speaker introduces auto parallelism as a feature to automatically optimize models by combining various techniques discussed earlier. It includes zero redundancy optimization and activation checkpointing. The process involves tracing the computation graph, optimizing strategies for each node, annotating them, and executing them on available hardware.

Auto parallelism optimizes models automatically using various techniques.

Traces computation graph and assigns costs to different operations.

Optimizes strategies for each node to minimize compute time and data movement time.

Annotates strategies to nodes and handles data movement automatically.

Can be customized for extensibility if needed.

Efficient Memory Usage in Heterogeneous Systems

The speaker explains the challenge of limited GPU memory in accommodating large-scale models. They propose using CPU memory and non-volatile disk memory as alternatives. To efficiently swap data among these platforms during runtime, they introduce chunk-based memory management as an improvement over existing solutions like zero redundancy optimization.

Limited GPU memory can be supplemented with CPU memory and disk storage.

Efficient data swapping among different memory platforms is crucial.

Existing solution: zero redundancy optimization in Microsoft deep speed.

Deep speed partitions and distributes model parameters across multiple GPUs.

Offloading system based on chunk-based memory management for efficient memory usage.

Improving Offloading System with Chunk-Based Memory Management

The speaker discusses the limitations of the static strategy used in zero offloading, which led to wasted memory space. They introduce a new offloading system based on chunk-based memory management, where consecutive parameters are stored in chunks. This approach allows for more efficient utilization of GPU and CPU memory.

Zero offloading used a static strategy for offloading tensors to CPU and disk.

Static strategy led to wasted memory space and larger total memory requirements.

New offloading system based on chunk-based memory management introduced.

Consecutive parameters stored in chunks for efficient utilization of GPU and CPU memory.

The transcript provided does not contain enough information to create additional sections.

Utilization and Memory Tracing

In this section, the speaker discusses how memory utilization is still good and introduces an example of improving on deep speed by tracing memory usage during training steps.

Memory Tracing for Improved Performance

Deep speed traces memory usage automatically at different points during training.

It generates a strategy to decide which chunks to move between the CPU and GPU and when to do it.

When GPU memory is insufficient, more chunks can be evicted to the CPU to avoid out-of-memory errors.

Colossal AI can handle situations that Deep speed cannot by evicting chunks to the CPU when necessary.

In some cases, when CPU memory is not enough, fewer chunks are evicted to the CPU, allowing more effective use of GPU memory.

Collective memory usage information is used in later steps to control data movement strategy precisely.

Performance Results Comparison

This section presents performance results comparing PyTorch, Deep speed, and Colossal AI in terms of model size and throughput.

Model Size and Throughput Comparison

Colossal AI can handle models up to 120 times larger than PyTorch and up to three times larger than Deep speed using a single cheap GPU.

Throughput comparison shows that as model size increases, Deep speed fails due to out-of-memory errors while Colossal AI continues succeeding.

Benchmarking Use Cases: Large AI Model Training

The speaker discusses benchmarking use cases related to large AI model training with GPT2.

GPT2 Training Benchmark

GPT2 benefits from the n-dimensional parallelism system provided by Colossal AI.

It allows training up to a 24 times larger model on the same hardware compared to PyTorch.

Colossal AI achieves over 3x acceleration compared to Deep speed in this benchmark.

Benchmarking Use Cases: Inference

This section focuses on benchmarking use cases related to inference with large-scale models.

Large-Scale Model Inference Benchmark

Existing inference solutions often lack support for running large-scale models on a single GPU.

Colossal AI's inference module enables efficient parallel inference for large-scale models.

It provides pre-built examples and hides the detailed parallel runtime implementation.

Integration with an HTTP service allows quick deployment of the models.

Dedicated recipes, such as Bloom, are provided for efficient int 8 quantization and deployment of large model inference services.

Benchmarking Use Cases: Multimodal AI Applications

The speaker discusses benchmarking use cases related to multimodal AI applications, specifically textually generated images.

Multimodal AI Applications Benchmark

Colossal AI provides a stable diffusion model recipe for low-cost training, fine-tuning, and inference.

Memory requirements for stable diffusion can be significantly reduced using Colossal AI, resulting in cost savings.

By using cheap GPUs instead of expensive ones like A100, hardware costs can be reduced up to a factor of 46.

Memory Management System: Gemini

The speaker explains how the memory management system Gemini helps break through the GPU memory wall by utilizing both CPU and GPU memory simultaneously.

Utilizing CPU and GPU Memory with Gemini

Gemini allows data movement between CPU and GPU based on memory requirements.

It enables personalized text-to-image models like Stable Diffusion with only four gigabytes of GPU memory.

This makes it more accessible for users who don't have expensive GPUs.

The original version of Stable Diffusion required 16 gigabytes of memory.

Modeling Levels and RLHF

The speaker briefly mentions dealing with different levels of modeling and introduces Chat GPT as a hot topic using reinforcement learning from human feedback.

Dealing with Different Levels of Modeling

The speaker briefly mentions the challenge of dealing with different levels of modeling but does not provide further details on this topic.

New Section

This section discusses the training process of Chat GPT, including the use of reinforcement learning and fine-tuning models.

Training Process of Chat GPT

The training process involves generating multiple responses that are ranked by humans.

These rankings are used to train a reward model that simulates human evaluations for model-generated content.

Reinforcement learning, specifically Proximal Policy Optimization (PPO), is used to further fine-tune the models.

Training multiple models simultaneously in stage two of reinforcement learning is challenging.

Different models require separate forward and backward operations during training.

The paper on Instruct GPT mentions using 175 billion parameter GPT3 pre-trained models for actor and SFT models, while critic and reward models use 60 billion parameter models.

Running this training process requires approximately 3.6 terabytes of GPU memory, making it expensive.

New Section

This section highlights the introduction of an open-source version called Colossal Chat, which is similar to Chat GPT but with reduced parameters.

Introduction of Colossal Chat

Colossal Chat is an open-source version closely resembling the original Chat GPT technology.

It includes the complete RLHF (Reinforcement Learning from Human Feedback) process.

Colossal Chat only requires 10 billion parameters compared to the larger parameter size of other versions.

It offers bilingual capabilities similar to Chat GPT 3.5, allowing users to ask questions and receive relevant answers.

The open-source solution includes data training, inference, and evaluation components.

New Section

This section compares the performance of Colossal AI with PyTorch DDP in terms of throughput for both inference and training.

Performance Comparison: Colossal AI vs. PyTorch DDP

Colossal AI demonstrates better throughput compared to PyTorch DDP in various scenarios.

Inference with Colossal AI is approximately 1.4 times faster, while training is about 7.7 times faster.

The tests were conducted on an A100 server with eight GPUs, assuming the four models in the RLHF stage were of the same size (700 megabytes).

PyTorch DDP is memory-limited and can only run up to the GPT 2L model, which has a limit of 700 million parameters.

New Section

This section discusses the scalability of Colossal AI and its ability to train larger models compared to PyTorch DDP.

Scalability of Colossal AI

Even the smallest GPT model encounters out-of-memory errors when running on a consumer-grade GPU with PyTorch DDP.

By using an AWS pd4e node (80 gig A100 + 1 terabyte memory), Colossal AI can train models up to 8 billion parameters.

This is ten times larger than what PyTorch DDP can handle (0.78 billion parameters).

Colossal AI significantly reduces the replication cost associated with training Chat GPT models.

New Section

This section showcases examples of what can be achieved using Colossal Chat, including question answering, translation, writing tasks, and coding capabilities.

Examples of Colossal Chat Capabilities

Colossal Chat excels in various tasks such as question answering, translation (including Chinese), writing letters of recommendation, and coding tasks.

It provides realistic responses and performs better than other models like Alpaca in certain scenarios.

New Section

This section discusses the effort put into building an evaluation dataset, associated criteria, and a benchmark for comparing different foundation models.

Evaluation of Foundation Models

A comprehensive evaluation dataset, evaluation criteria, and benchmark have been developed.

These tools provide a scientific basis for comparing different foundation models.

The goal is to choose the best model for further fine-tuning based on objective evaluations.

New Section

This section introduces Fast Fold, a system designed to reduce training time for protein models in medical drug discovery.

Fast Fold: Accelerating Protein Model Training

Fast Fold is a system specifically designed for protein model training in the biomedical industry.

It utilizes Dynamic Axis Parallelism instead of traditional tensor parallelism.

Dynamic Axis Parallelism divides data in the sequence direction and uses all-to-all communication.

Other optimizations like Duality Async are combined to overlap computation and communication.

Fast Fold reduces training times from 11 days to 67 hours and achieves faster inference speeds (7.5x up to 9.5x).

Customized optimizations using Triton were implemented to optimize specific layers like flash attention and fast norm.

Sequences up to a length of ten thousand can be processed efficiently, covering almost all proteins.

New Section

This section concludes the presentation by thanking the audience and providing information on how to learn more about Colossal AI through their Slack team and GitHub.

Conclusion

The speaker expresses gratitude for the audience's time.

Interested individuals can join their Slack team or visit their GitHub page for more information on Colossal AI.