Deploy LLM to Production on Single GPU: REST API for Falcon 7B (with QLoRA) on Inference Endpoints

Name: Deploy LLM to Production on Single GPU: REST API for Falcon 7B (with QLoRA) on Inference Endpoints
Uploaded: 2023-06-21T15:00:08.000Z
Duration: 44 min

Deploying a Fine-tuned Language Model with Hugging Face Inference Endpoint

In this video, the presenter demonstrates how to deploy a fine-tuned language model using Hugging Face inference endpoint. The process involves merging the weights of the fine-tuned model with the base model, uploading necessary files to Hugging Face Hub, and using a custom handler for deploying the model behind a REST API.

Deploying a Fine-tuned Language Model

The best model is obtained by merging the weights of the fine-tuned model with the original base model.

The original tokenizer and supporting files are uploaded to Hugging Face Hub.

Inference endpoints are used to deploy the model behind a REST API.

A custom handler is created to take prompts as input, create a pipeline based on the merged model and tokenizer, and return responses.

Accessing Tutorial Resources

ML Expert Pro subscribers can access a complete text tutorial on deploying LLM to production.

The tutorial includes code examples and links to relevant resources such as Google Colab notebooks.

Setting Up Environment

A Google Colab notebook with 16GB VRAM and Tesla T4 GPU is used for running the code.

Required dependencies such as Torch, Bits and Bytes Library, Accelerate Library are installed.

Necessary imports are made in the notebook.

Uploading Files to Hugging Face Hub

Files from a previous repository containing QR chat models (adapter config and adapter model) are loaded into variables.

The original Falcon 7B base model is loaded into another variable after converting it into 16-bit format.

Remote code execution is trusted in order to download required files for running the merged model.

The best model is loaded by merging the original Falcon 7B base model with QR adapter weights.

Merging Weights of Models

The function path_model_merge_and_unload() is used to merge the weights of the adapter with the layers of the base model.

After merging, the water model part is removed and only the RW for causal language modeling remains.

Pushing Model and Tokenizer to Hugging Face Hub

The merged model is pushed to a repository named "model_to_merge_tutorial".

The original tokenizer from Falcon 7B is also pushed to the same repository.

The upload process takes around 3 to 4 minutes.

Checking Uploaded Files

In the repository, binary files of the model and index.json file are visible.

The tokenizer config and tokenizer files are also present.

The generation config shows the beginning of sequence token ID.

Timestamps have been associated with relevant bullet points.

Setting up the Configuration

In this section, the speaker discusses setting up the configuration for deploying a large language model to production.

Updating the Configuration

The fixed configuration is available in the ml expert IO tutorial for deploying a large language model to production.

The updated configuration is taken from the Falcon 7B repository.

The speaker pastes the updated configuration into the project.

Adding Additional Files

Two additional files are needed: configuration_RW.py and modeling_RW.py.

The configuration_RW.py file is copied from the Falcon 7B repository and added to the project.

The modeling_RW.py file, also taken from the Falcon 7B repository, is added with a minor typo correction.

Committing Changes

After adding both files, the changes are committed to the project.

Downloading Model Files

This section focuses on downloading necessary model files for deployment.

Downloading Tokenizer and Model

The tokenizer file (modeling_RW) is downloaded from the repository.

Next, the main model file starts downloading.

It may take some time depending on internet speed.

After approximately three and a half minutes, both files are successfully downloaded and loaded into GPU memory.

Generating Text with Language Model

Here, text generation using a language model pipeline is demonstrated.

Creating Generation Config

A generation config similar to previous videos is created.

The text generation pipeline from Transformers library is used for wrapping models.

Running Text Generation Pipeline

A prompt "How can I create an account?" is passed through the pipeline along with generation config.

Despite receiving a warning about causal language modeling not being officially supported, the pipeline appears to work fine.

The generated text response is obtained and displayed.

Deployment Strategy and Inference Endpoints

This section discusses the deployment strategy using heavy face inference endpoints.

Deployment Considerations

Heavy face inference endpoints are used for deployment.

A debit or credit card is required as it is a paid service billed by hour or minute.

The speaker mentions that the deployment strategy may vary depending on the provider (e.g., AWS, Azure, Google).

Benefits of Inference Endpoints

Inference endpoints are fast and efficient for deploying models behind an API.

The Home Face Inference Endpoints documentation provides step-by-step instructions for creating and deploying models from the repository.

Additional Files Required

An endpoint handler file needs to be added to the project.

This file will be responsible for loading the model, generating responses based on API code, and utilizing tokenizer, model, generation config, and pipeline.

Conclusion

The speaker concludes by explaining how the endpoint handler works in loading models and generating responses based on API code.

Model Loading and Testing

In this section, the speaker discusses loading the model into memory and testing the handler.

Loading the Model

The model is loaded into memory or GPU.

Since the model has already been downloaded, this process should be quick.

Testing the Handler

The speaker runs a code to create a prompt and a dictionary with inputs for the handler.

The prediction is obtained and printed out.

The code for the handler is added to a new file called "Handler.py" in the repository.

The requirements.txt file is also added.

Deploying the Model

This section focuses on deploying the model using inference endpoints.

Deploying Options

There are different options available for deployment, including inference API and inference endpoints.

Inference API supports models up to 10 gigabytes of VRAM but has limitations.

Inference endpoints are chosen for deployment in this case.

Creating an Endpoint

A new endpoint is created using the repository.

Instance type (small instance) and advanced configurations (custom task framework, default container type) are selected.

Security level is set to "protected" for authentication purposes.

The estimated cost of this instance is mentioned as 60 cents per hour.

Initializing and Running Endpoint

After initialization, the instance starts running.

The endpoint URL is provided for testing inputs through API calls.

Checking Repository ID and Logs

This section covers checking repository ID and logs after deploying the model.

Repository ID Check

It is important to check that the repository ID matches with what was added at the start.

Logs Analysis

Custom dependencies such as torch 2.0 and the latest version of the Transformers Library are installed properly.

The model is initialized successfully, but there is a warning about text generation not being supported.

Making API Calls

This section demonstrates how to make API calls to the deployed model.

Importing Requests Library

The requests library is imported for making API calls.

Sending API Request

The URL of the inference endpoint and an authorization token are passed as parameters.

A prompt is sent as payload in the request.

The response time for getting a result from the API is mentioned (around 8 or 9 seconds).

The generated text from the response is printed out, which matches the previous results obtained during testing.

This summary has been created based on the provided transcript.

New Section Libraries in Python

The speaker thanks the viewers for watching and encourages them to like, share, and subscribe. They also mention joining the ML Expert Pro subscribers for full access to the text tutorial of the video and invite viewers to join the Discord channel.

Introduction

The speaker expresses gratitude to the viewers for watching.

Viewers are encouraged to like, share, and subscribe.

Joining the ML Expert Pro subscribers provides full access to the text tutorial of the video.

Viewers are invited to join the Discord channel.

Please note that this is a preliminary summary based on limited information.