Getting started with pre-training foundation models on Amazon SageMaker | Amazon Web Services
Getting Started with Pre-Training Foundation Models on SageMaker
Section Overview
This section introduces the process of pre-training a foundation model using Amazon SageMaker, specifically focusing on the Llama 2 70-billion parameter model.
Introduction to SageMaker Studio
- Emily introduces the session and outlines the goal: pre-training a Llama 2 model using SageMaker.
- Acknowledgment is given to Arun and other contributors for developing the resources used in this tutorial.
- The environment being utilized is Jupyter Lab within SageMaker Studio, where necessary files are organized in a directory named "pre-train llama."
Setting Up the Environment
- Initial steps include installing required packages locally within the notebook before downloading datasets.
- The dataset used is the wiki corpus, which will be tokenized locally; sufficient computational power is emphasized (C5.18 XL instance).
Tokenization and Data Upload
- After local tokenization, users are instructed to upload their processed data to an S3 bucket for further use.
Configuring Training Parameters
- Key hyperparameters for training are highlighted, including world size and parallel processing configurations.
- The base Docker image utilized is an AWS-managed deep learning container optimized for PyTorch training with NeuronX technology.
Executing Training Jobs
- The PyTorch estimator script (
run_llama_nxd) can be downloaded from GitHub; it facilitates running training jobs on Trainium instances.