How to Run Local LLMs with Llama.cpp: Complete Guide
Introduction to Llama C++
Overview of Llama C++
- Are you interested in running your own AI model without paying for OpenAI tokens? This video introduces Llama C++, which allows users to run models privately and build applications like games without incurring token costs.
- The presenter promises that by the end of the video, viewers will have a solid understanding of how to use Llama C++ effectively, likening the learning process to Ichigo's journey in "Bleach."
Key Features of Llama C++
- Llama C++ enables easy execution of large language models (LLMs) with minimal setup and supports a wide range of hardware configurations, including CPUs and M1 MacBooks. Users do not necessarily need an Nvidia GPU.
- Models must be in GGF format, which consolidates all necessary components such as layers and metadata into one file, simplifying the model management process.
Comparison with Other Tools
- There are various options for LLM inference available, including VLM, Olama, and LM Studio; however, many are simply wrappers around Llama C++. These wrappers may simplify model loading but can delay access to new features found directly in Llama C++.
- While using wrappers like Olama offers ease of use (e.g., automatic downloads), they may not provide the same level of control or feature access as using Llama C++ directly. Additionally, it is noted that some tools focus solely on inference rather than broader functionalities offered by Llama C++.
Additional Functionalities
- Beyond basic inference capabilities, Llama C++ provides extensive features such as model preparation (including conversion from safe tensor format), quantization for efficiency improvements, and benchmarking tools for performance evaluation. It serves as a comprehensive toolkit rather than just an inference tool.
- For high-demand scenarios requiring parallel requests (like API serving), VLM is recommended over Llama C++, which is better suited for single-user applications or local deployments. This distinction helps users choose the right tool based on their needs.
Getting Started with Llama C++
- The installation process begins with cloning the repository from GitHub; this step is crucial for accessing all features provided by Llama C++. The presenter emphasizes exploring the repository locally after cloning it.
- The README file within the repository outlines multiple ways to run models: via command line interface (CLI) or setting up a server compatible with OpenAI's API standards. Additional tools include perplexity measurement utilities and benchmarking capabilities essential for evaluating model performance effectively.
Getting Started with Llama C++: Installation and Setup
Overview of Llama C++ Installation
- The installation process for Llama C++ requires obtaining a binary, as it is a C++ program that needs to be compiled for the specific machine.
- Users can opt to download via package managers like Brew, but this may exclude useful Python scripts associated with the program.
Building from Source
- It is recommended to build Llama C++ from source for better control over build parameters and easier setup.
- Essential prerequisites include having a C++ compiler installed; Linux users can install it easily, while Mac OS typically has it pre-installed with Xcode.
Checking Compiler Installation
- A simple command can verify if a C compiler is installed. If issues arise, using ChatGPT for troubleshooting is suggested.
- For Windows users, guidance on installing a C++ compiler should be sought online due to the speaker's lack of experience with that platform.
Build Instructions
- The build process varies based on hardware; CPU builds are generally applicable across platforms including Mac OS.
- Specific instructions for Nvidia GPU builds are available in the README file, though not covered in detail by the speaker.
Executables and Python Libraries
- After building, executables such as quantizing tools and server runners will be found in the 'bin' folder within the build directory.
- To run certain scripts effectively, necessary Python libraries must also be installed alongside Llama C++.
Installing Python Dependencies
- The installation of dependencies through Poetry may lead to incorrect packages; instead, using
requirements.txtis advised for accurate installations.
- Users need to ensure they have Python installed before creating a virtual environment for dependency management.
Final Steps in Setup
- Once in the correct virtual environment, running
pip install -r requirements.txtwill install all necessary libraries.
- Familiarity with key directories such as 'build' and 'bin' is crucial as they contain important components of Llama C++.
This structured approach provides clarity on setting up Llama C++, ensuring users understand each step involved in compiling and preparing their environment effectively.
Understanding Llama C++ and GGF Models
Introduction to Llama C++ and Swift UI
- The speaker discusses the importance of examples in learning how to use frameworks, specifically mentioning a Swift UI example that demonstrates building applications with Llama C++.
- It is noted that while there are examples available, they are not exhaustive; users should not expect comprehensive coverage of all topics.
Code and Model Overview
- The speaker mentions various models included in tutorials, such as GPT2 and BERT, but questions their relevance for current needs.
- Emphasis is placed on focusing on the build folder and top-level scripts when working with Llama C++, as these contain essential components for building binaries.
Getting Started with GGF Format
- To begin using models, the speaker suggests visiting Hugging Face to find models in GGF format, which consolidates everything into a single file for ease of use.
- The GGF format is highlighted as beneficial because it eliminates the need for separate tokenizers or model files.
Selecting and Downloading Models
- The Jamaat 3 billion model is introduced as a popular choice; viewers are encouraged to check previous videos for more information about different models.
- Users are advised to select models based on their GPU capacity; quantized versions can help fit larger models into limited RAM environments.
Running Inference with Llama CLI
- A new folder named "custom models" will be created to store downloaded GGF files.
- The speaker explains how to utilize the llama CLI for running an LLM without needing a server setup, focusing on loading the model directly for inference.
Key Parameters in Model Configuration
- Important parameters like MGL (memory management), which determines VRAM usage, are discussed. MacBooks handle memory differently due to unified architecture.
- Sampling parameters such as temperature affect output randomness; lower temperatures yield more deterministic results while higher values increase variability.
Advanced Sampling Techniques
- Top-k sampling allows selection from a defined number of probable tokens during generation. Adjusting this parameter influences diversity in outputs.
- Top-p sampling (nucleus sampling), where only tokens contributing up to a certain probability mass are considered, is also explained.
This structured overview captures key insights from the transcript while providing timestamps for easy reference back to specific parts of the discussion.
Understanding LLM Inference and Configuration
Structured Outputs in JSON
- The speaker discusses the importance of using valid JSON for structured outputs, which ensures that model outputs adhere to a specific schema.
- If no chat template is specified, a default one is used. This is crucial for models that do not define their own templates.
Model Loading and Interaction
- Upon loading the model on an Apple M1 Pro, it takes time initially but transitions into interactive mode where users can input queries similar to ChatGPT.
- The NGL parameter determines GPU layer allocation; on MacBooks, all layers are assigned by default, while Nvidia GPUs allow for more granular control.
Performance Metrics
- The speaker emphasizes measuring tokens generated per second during inference as a key performance metric.
- Running inference with CPU-only processing results in slower token generation rates compared to GPU processing, highlighting the efficiency of hardware utilization.
Server Setup and API Compatibility
- Introduction of the llama server allows users to run LLM inference with customizable parameters and provides a user-friendly interface for interaction.
- The server offers an API endpoint compatible with OpenAI's completion API, facilitating integration into applications.
Benchmarking and Perplexity Measures
- Using llama bench enables benchmarking different settings or models by running tests like PP 512 and TG tests to measure tokens per second.
- Perplexity measures how well adapted an LLM is to a specific domain by evaluating its predictions against known data (e.g., README files).
Evaluating Domain Adaptation
- A perplexity score indicates how likely the model predicts certain tokens based on provided text; lower scores suggest better adaptation to technical domains.
- An example shows that a perplexity score of 7.6 indicates reasonable adaptation of the JMA model to technical content.
Structured Outputs for Applications
- The speaker highlights the utility of structured outputs when building applications that require predictable responses from LLM models.
Understanding JSON Schema and Llama C++
Introduction to JSON Schema
- The speaker introduces a JSON schema that specifies output as a dictionary with two parameters:
name(string) andh(integer).
- Emphasizes the popularity of JSON schema, particularly in Python applications like Pydantic, which can easily generate a JSON schema.
Generating Valid JSON Output
- Demonstrates how to prompt for specific data, such as listing Barcelona football players from 2005, expecting valid JSON output instead of free text.
- Highlights the utility of generating structured data formats like JSON for better organization and retrieval.
Memory Management in Llama C++
- Discusses initial confusion regarding RAM usage when running the Llama server; notes that Chrome consumes significant memory.
- Clarifies that Llama C++ uses memory mapping rather than loading the entire model into RAM, optimizing resource use.
Understanding Memory Mapping
- Explains how memory mapping works by only loading necessary pages of the model into RAM when accessed.
- Describes commands to check memory usage and emphasizes that not all model data is loaded at once, which is an efficient feature of Llama C++.
Model Loading Options
- Introduces options for managing model loading behavior: using flags to load everything into RAM or keeping it there after loading.
Preparing Models for Use with Llama C++
Converting Models to GGF Format
- Addresses scenarios where new models may not be available in GGF format required by Llama C++, necessitating conversion.
- Mentions using a specific model (Llama 3.18 billion), noting its size and format challenges.
Cloning and Preparing Models
- Outlines steps to clone models from Hugging Face repositories into custom directories; humorously notes potential delays during this process.
Conversion Process Overview
- Details the need to convert large safe tensor files into GGF format before they can be utilized effectively within Llama C++.
Using Python Scripts for Conversion
- Introduces Python scripts provided by Llama C++ for converting models, emphasizing the importance of using a virtual environment during this process.
Model Conversion and Quantization Process
Initial Setup for Model Conversion
- The process begins with setting the file parameter to convert a model, specifically naming it with a GGF extension. The conversion targets a large 16 GB model.
Managing Disk Space
- After creating the GGF file, there is a need to delete unnecessary files due to low disk space, emphasizing the importance of maintaining system storage.
Quantizing the Model
- The speaker introduces quantization as a method to reduce model size. They mention that the initial gemma tree model was already quantized to Q40 and now aim to quantize their LLM (Large Language Model).
Available Options for Quantization
- The script
llama quantiseis identified as essential for this process. Various options are available, including floating point formats and 8-bit or 4-bit quantizations.
- Among these options, 4-bit quantization (Q4 KM) is highlighted as particularly useful, while other variants like Q40 are mentioned but deemed less practical.
Executing the Quantization Script
- To execute the script, users must specify both the input GGF file and output location. It’s standard practice to append quantization information at the end of model names for clarity.
- Upon successful execution, a significant reduction in size from 16 GB to 4 GB is achieved through quantization.
Running the Converted Model
- Instructions are provided on how to run this newly converted completion model using llama CLI. This version focuses on text completion rather than question answering.
Limitations of Unsupervised Models
- A demonstration shows that without fine-tuning, models may produce odd completions; an example given is "hello I am April 1st 2018," indicating limitations in generating coherent outputs without additional training.
Fine-Tuning with LoRA
Introduction to LoRA
- The speaker discusses LoRA (Low-Rank Adaptation), explaining its role in fine-tuning models by adding small sets of weights on top of base models.
Example of Fine-Tuning
- An example is shared about fine-tuning a model based on Donald Trump’s interviews, resulting in behavior mimicking Trump’s speaking style with only minimal weight adjustments (160 MB).
Converting LoRA Models
- To utilize LoRA effectively within llama C++, it must first be converted into GGF format using specific scripts designed for this purpose.
Important Conversion Steps
- Users must ensure they use the correct script (
Laura to GGF) instead of previous methods which may fail during conversion processes.
Finalizing Integration with Base Model
- After conversion, integration involves specifying both adapter weights and chat template files tailored for conversational contexts influenced by fine-tuning efforts.
Tokenizer and Model Configuration
Overview of Tokenizer and Model Variants
- The discussion begins with the tokenizer, highlighting that there is no authorization for certain features. The focus shifts to the Llama model variants, specifically mentioning 3.18 billion and 8 billion instruct models.
- It is noted that the base model lacks a chat template in its tokenizer configuration, which limits its ability to process conversational inputs effectively.
- In contrast, the instruct variant has been fine-tuned on a chat template, allowing it to better understand conversation dynamics and respond appropriately.
Chat Template Functionality
- The chat template provides options for formatting messages, including system messages. By default, Llama C++ attempts to load this template if available in the DGF file.
- Users can either utilize built-in chat templates or create their own using Ginger templating language; however, familiarity with Ginger is necessary for custom templates.
Running Models with Chat Templates
Executing a Fine-Tuned Model
- To run a fine-tuned model like Laura on top of another model (Trump GGF), users must specify the use of a chat template during execution.
- An example interaction reveals that while initial responses are coherent (e.g., identifying as Donald Trump), longer conversations may lead to less relevant answers due to limited fine-tuning on multi-turn dialogues.
Distribution Considerations
- If users wish to share their fine-tuned models without requiring others to manage multiple files or configurations, merging the LoRa with the base model into one GGF file is suggested.
- However, merging alters the base model significantly; thus, dynamic switching between different LoRas becomes impossible once merged.
Merging Models and Quantization
Merging Process
- It’s emphasized that merging should only occur between unquantized versions of models to avoid introducing errors during dequantization processes.
- The merging command involves specifying both base and LoRa models along with an output filename; care must be taken regarding naming conventions for clarity.
Finalizing Model Size
- After merging results in a large file size (16 GB), quantization is necessary. This involves running specific commands tailored for quantizing merged models.
Building Applications with Llama C++
Application Development Insights
- When integrating Llama C++ into applications, compiling it into a binary format requires additional steps such as writing bindings in preferred programming languages.
- Pre-existing bindings are available as packages that facilitate calling methods from compiled binaries without needing extensive coding knowledge.
How to Install and Use Llama C++ with Python
Installation Process
- The speaker discusses the installation of binaries for various programming languages, specifically focusing on integrating Llama C++ with a Python application.
- To use Llama C++, the speaker mentions the need to install Python bindings that will also build Llama C++ from source, allowing interaction through Python.
- The installation command
pip install llama C++ Pythonis suggested, although it is noted that there isn't a direct pip installer available initially.
Troubleshooting Installation
- The speaker suggests running the installation in the terminal instead of using pip directly and encourages checking documentation for detailed guidance.
- Emphasizes that all functionalities offered by Llama C++ can be accessed through any preferred programming language, enhancing versatility.
Loading and Using Models
- After resolving installation issues, the speaker demonstrates how to load a model (specifically mentioning "gema tree model") into their environment.
- The API for interacting with Llama C++ is described as being similar to OpenAI's API, where users provide messages formatted in a specific way.
Example Interaction
- An example conversation is shown where an AI responds positively to user input, illustrating how one can chat using the loaded model.
- The importance of bundling applications when developing in Python is highlighted, noting that actual models must be included in deployments.