How to Run Local LLMs with Llama.cpp: Complete Guide

Name: How to Run Local LLMs with Llama.cpp: Complete Guide
Uploaded: 2025-09-07T16:01:14.000Z
Duration: 2 h 14 min 19 s

Introduction to Llama C++

Overview of Llama C++

Are you interested in running your own AI model without paying for OpenAI tokens? This video introduces Llama C++, which allows users to run models privately and build applications like games without incurring token costs.

The presenter promises that by the end of the video, viewers will have a solid understanding of how to use Llama C++ effectively, likening the learning process to Ichigo's journey in "Bleach."

Key Features of Llama C++

Llama C++ enables easy execution of large language models (LLMs) with minimal setup and supports a wide range of hardware configurations, including CPUs and M1 MacBooks. Users do not necessarily need an Nvidia GPU.

Models must be in GGF format, which consolidates all necessary components such as layers and metadata into one file, simplifying the model management process.

Comparison with Other Tools

There are various options for LLM inference available, including VLM, Olama, and LM Studio; however, many are simply wrappers around Llama C++. These wrappers may simplify model loading but can delay access to new features found directly in Llama C++.

While using wrappers like Olama offers ease of use (e.g., automatic downloads), they may not provide the same level of control or feature access as using Llama C++ directly. Additionally, it is noted that some tools focus solely on inference rather than broader functionalities offered by Llama C++.

Additional Functionalities

Beyond basic inference capabilities, Llama C++ provides extensive features such as model preparation (including conversion from safe tensor format), quantization for efficiency improvements, and benchmarking tools for performance evaluation. It serves as a comprehensive toolkit rather than just an inference tool.

For high-demand scenarios requiring parallel requests (like API serving), VLM is recommended over Llama C++, which is better suited for single-user applications or local deployments. This distinction helps users choose the right tool based on their needs.

Getting Started with Llama C++

The installation process begins with cloning the repository from GitHub; this step is crucial for accessing all features provided by Llama C++. The presenter emphasizes exploring the repository locally after cloning it.

The README file within the repository outlines multiple ways to run models: via command line interface (CLI) or setting up a server compatible with OpenAI's API standards. Additional tools include perplexity measurement utilities and benchmarking capabilities essential for evaluating model performance effectively.

Getting Started with Llama C++: Installation and Setup

Overview of Llama C++ Installation

The installation process for Llama C++ requires obtaining a binary, as it is a C++ program that needs to be compiled for the specific machine.

Users can opt to download via package managers like Brew, but this may exclude useful Python scripts associated with the program.

Building from Source

It is recommended to build Llama C++ from source for better control over build parameters and easier setup.

Essential prerequisites include having a C++ compiler installed; Linux users can install it easily, while Mac OS typically has it pre-installed with Xcode.

Checking Compiler Installation

A simple command can verify if a C compiler is installed. If issues arise, using ChatGPT for troubleshooting is suggested.

For Windows users, guidance on installing a C++ compiler should be sought online due to the speaker's lack of experience with that platform.

Build Instructions

The build process varies based on hardware; CPU builds are generally applicable across platforms including Mac OS.

Specific instructions for Nvidia GPU builds are available in the README file, though not covered in detail by the speaker.

Executables and Python Libraries

After building, executables such as quantizing tools and server runners will be found in the 'bin' folder within the build directory.

To run certain scripts effectively, necessary Python libraries must also be installed alongside Llama C++.

Installing Python Dependencies

The installation of dependencies through Poetry may lead to incorrect packages; instead, using requirements.txt is advised for accurate installations.

Users need to ensure they have Python installed before creating a virtual environment for dependency management.

Final Steps in Setup

Once in the correct virtual environment, running pip install -r requirements.txt will install all necessary libraries.

Familiarity with key directories such as 'build' and 'bin' is crucial as they contain important components of Llama C++.

This structured approach provides clarity on setting up Llama C++, ensuring users understand each step involved in compiling and preparing their environment effectively.

Understanding Llama C++ and GGF Models

Introduction to Llama C++ and Swift UI

The speaker discusses the importance of examples in learning how to use frameworks, specifically mentioning a Swift UI example that demonstrates building applications with Llama C++.

It is noted that while there are examples available, they are not exhaustive; users should not expect comprehensive coverage of all topics.

Code and Model Overview

The speaker mentions various models included in tutorials, such as GPT2 and BERT, but questions their relevance for current needs.

Emphasis is placed on focusing on the build folder and top-level scripts when working with Llama C++, as these contain essential components for building binaries.

Getting Started with GGF Format

To begin using models, the speaker suggests visiting Hugging Face to find models in GGF format, which consolidates everything into a single file for ease of use.

The GGF format is highlighted as beneficial because it eliminates the need for separate tokenizers or model files.

Selecting and Downloading Models

The Jamaat 3 billion model is introduced as a popular choice; viewers are encouraged to check previous videos for more information about different models.

Users are advised to select models based on their GPU capacity; quantized versions can help fit larger models into limited RAM environments.

Running Inference with Llama CLI

A new folder named "custom models" will be created to store downloaded GGF files.

The speaker explains how to utilize the llama CLI for running an LLM without needing a server setup, focusing on loading the model directly for inference.

Key Parameters in Model Configuration

Important parameters like MGL (memory management), which determines VRAM usage, are discussed. MacBooks handle memory differently due to unified architecture.

Sampling parameters such as temperature affect output randomness; lower temperatures yield more deterministic results while higher values increase variability.

Advanced Sampling Techniques

Top-k sampling allows selection from a defined number of probable tokens during generation. Adjusting this parameter influences diversity in outputs.

Top-p sampling (nucleus sampling), where only tokens contributing up to a certain probability mass are considered, is also explained.

This structured overview captures key insights from the transcript while providing timestamps for easy reference back to specific parts of the discussion.

Understanding LLM Inference and Configuration

Structured Outputs in JSON

The speaker discusses the importance of using valid JSON for structured outputs, which ensures that model outputs adhere to a specific schema.

If no chat template is specified, a default one is used. This is crucial for models that do not define their own templates.

Model Loading and Interaction

Upon loading the model on an Apple M1 Pro, it takes time initially but transitions into interactive mode where users can input queries similar to ChatGPT.

The NGL parameter determines GPU layer allocation; on MacBooks, all layers are assigned by default, while Nvidia GPUs allow for more granular control.

Performance Metrics

The speaker emphasizes measuring tokens generated per second during inference as a key performance metric.

Running inference with CPU-only processing results in slower token generation rates compared to GPU processing, highlighting the efficiency of hardware utilization.

Server Setup and API Compatibility

Introduction of the llama server allows users to run LLM inference with customizable parameters and provides a user-friendly interface for interaction.

The server offers an API endpoint compatible with OpenAI's completion API, facilitating integration into applications.

Benchmarking and Perplexity Measures

Using llama bench enables benchmarking different settings or models by running tests like PP 512 and TG tests to measure tokens per second.

Perplexity measures how well adapted an LLM is to a specific domain by evaluating its predictions against known data (e.g., README files).

Evaluating Domain Adaptation

A perplexity score indicates how likely the model predicts certain tokens based on provided text; lower scores suggest better adaptation to technical domains.

An example shows that a perplexity score of 7.6 indicates reasonable adaptation of the JMA model to technical content.

Structured Outputs for Applications

The speaker highlights the utility of structured outputs when building applications that require predictable responses from LLM models.

Understanding JSON Schema and Llama C++

Introduction to JSON Schema

The speaker introduces a JSON schema that specifies output as a dictionary with two parameters: name (string) and h (integer).

Emphasizes the popularity of JSON schema, particularly in Python applications like Pydantic, which can easily generate a JSON schema.

Generating Valid JSON Output

Demonstrates how to prompt for specific data, such as listing Barcelona football players from 2005, expecting valid JSON output instead of free text.

Highlights the utility of generating structured data formats like JSON for better organization and retrieval.

Memory Management in Llama C++

Discusses initial confusion regarding RAM usage when running the Llama server; notes that Chrome consumes significant memory.

Clarifies that Llama C++ uses memory mapping rather than loading the entire model into RAM, optimizing resource use.

Understanding Memory Mapping

Explains how memory mapping works by only loading necessary pages of the model into RAM when accessed.

Describes commands to check memory usage and emphasizes that not all model data is loaded at once, which is an efficient feature of Llama C++.

Model Loading Options

Introduces options for managing model loading behavior: using flags to load everything into RAM or keeping it there after loading.

Preparing Models for Use with Llama C++

Converting Models to GGF Format

Addresses scenarios where new models may not be available in GGF format required by Llama C++, necessitating conversion.

Mentions using a specific model (Llama 3.18 billion), noting its size and format challenges.

Cloning and Preparing Models

Outlines steps to clone models from Hugging Face repositories into custom directories; humorously notes potential delays during this process.

Conversion Process Overview

Details the need to convert large safe tensor files into GGF format before they can be utilized effectively within Llama C++.

Using Python Scripts for Conversion

Introduces Python scripts provided by Llama C++ for converting models, emphasizing the importance of using a virtual environment during this process.

Model Conversion and Quantization Process

Initial Setup for Model Conversion

The process begins with setting the file parameter to convert a model, specifically naming it with a GGF extension. The conversion targets a large 16 GB model.

Managing Disk Space

After creating the GGF file, there is a need to delete unnecessary files due to low disk space, emphasizing the importance of maintaining system storage.

Quantizing the Model

The speaker introduces quantization as a method to reduce model size. They mention that the initial gemma tree model was already quantized to Q40 and now aim to quantize their LLM (Large Language Model).

Available Options for Quantization

The script llama quantise is identified as essential for this process. Various options are available, including floating point formats and 8-bit or 4-bit quantizations.

Among these options, 4-bit quantization (Q4 KM) is highlighted as particularly useful, while other variants like Q40 are mentioned but deemed less practical.

Executing the Quantization Script

To execute the script, users must specify both the input GGF file and output location. It’s standard practice to append quantization information at the end of model names for clarity.

Upon successful execution, a significant reduction in size from 16 GB to 4 GB is achieved through quantization.

Running the Converted Model

Instructions are provided on how to run this newly converted completion model using llama CLI. This version focuses on text completion rather than question answering.

Limitations of Unsupervised Models

A demonstration shows that without fine-tuning, models may produce odd completions; an example given is "hello I am April 1st 2018," indicating limitations in generating coherent outputs without additional training.

Fine-Tuning with LoRA

Introduction to LoRA

The speaker discusses LoRA (Low-Rank Adaptation), explaining its role in fine-tuning models by adding small sets of weights on top of base models.

Example of Fine-Tuning

An example is shared about fine-tuning a model based on Donald Trump’s interviews, resulting in behavior mimicking Trump’s speaking style with only minimal weight adjustments (160 MB).

Converting LoRA Models

To utilize LoRA effectively within llama C++, it must first be converted into GGF format using specific scripts designed for this purpose.

Important Conversion Steps

Users must ensure they use the correct script (Laura to GGF) instead of previous methods which may fail during conversion processes.

Finalizing Integration with Base Model

After conversion, integration involves specifying both adapter weights and chat template files tailored for conversational contexts influenced by fine-tuning efforts.

Tokenizer and Model Configuration

Overview of Tokenizer and Model Variants

The discussion begins with the tokenizer, highlighting that there is no authorization for certain features. The focus shifts to the Llama model variants, specifically mentioning 3.18 billion and 8 billion instruct models.

It is noted that the base model lacks a chat template in its tokenizer configuration, which limits its ability to process conversational inputs effectively.

In contrast, the instruct variant has been fine-tuned on a chat template, allowing it to better understand conversation dynamics and respond appropriately.

Chat Template Functionality

The chat template provides options for formatting messages, including system messages. By default, Llama C++ attempts to load this template if available in the DGF file.

Users can either utilize built-in chat templates or create their own using Ginger templating language; however, familiarity with Ginger is necessary for custom templates.

Running Models with Chat Templates

Executing a Fine-Tuned Model

To run a fine-tuned model like Laura on top of another model (Trump GGF), users must specify the use of a chat template during execution.

An example interaction reveals that while initial responses are coherent (e.g., identifying as Donald Trump), longer conversations may lead to less relevant answers due to limited fine-tuning on multi-turn dialogues.

Distribution Considerations

If users wish to share their fine-tuned models without requiring others to manage multiple files or configurations, merging the LoRa with the base model into one GGF file is suggested.

However, merging alters the base model significantly; thus, dynamic switching between different LoRas becomes impossible once merged.

Merging Models and Quantization

Merging Process

It’s emphasized that merging should only occur between unquantized versions of models to avoid introducing errors during dequantization processes.

The merging command involves specifying both base and LoRa models along with an output filename; care must be taken regarding naming conventions for clarity.

Finalizing Model Size

After merging results in a large file size (16 GB), quantization is necessary. This involves running specific commands tailored for quantizing merged models.

Building Applications with Llama C++

Application Development Insights

When integrating Llama C++ into applications, compiling it into a binary format requires additional steps such as writing bindings in preferred programming languages.

Pre-existing bindings are available as packages that facilitate calling methods from compiled binaries without needing extensive coding knowledge.

How to Install and Use Llama C++ with Python

Installation Process

The speaker discusses the installation of binaries for various programming languages, specifically focusing on integrating Llama C++ with a Python application.

To use Llama C++, the speaker mentions the need to install Python bindings that will also build Llama C++ from source, allowing interaction through Python.

The installation command pip install llama C++ Python is suggested, although it is noted that there isn't a direct pip installer available initially.

Troubleshooting Installation

The speaker suggests running the installation in the terminal instead of using pip directly and encourages checking documentation for detailed guidance.

Emphasizes that all functionalities offered by Llama C++ can be accessed through any preferred programming language, enhancing versatility.

Loading and Using Models

After resolving installation issues, the speaker demonstrates how to load a model (specifically mentioning "gema tree model") into their environment.

The API for interacting with Llama C++ is described as being similar to OpenAI's API, where users provide messages formatted in a specific way.

Example Interaction

An example conversation is shown where an AI responds positively to user input, illustrating how one can chat using the loaded model.

The importance of bundling applications when developing in Python is highlighted, noting that actual models must be included in deployments.