After This, 16GB Feels Different

Name: After This, 16GB Feels Different
Uploaded: 2026-04-08T14:16:35.000Z
Duration: 24 min 35 s

Understanding Model Compression and Memory Management

Introduction to Image Compression

The speaker compares two images, one compressed (1.8 MB) and one uncompressed (14.6 MB), highlighting that while they appear identical to humans, computers can distinguish between them.

Emphasizes the benefits of compression in maximizing storage capacity for images, videos, and other data types.

Hardware Limitations

Introduces two Mac systems: a Mac Mini with 16 GB RAM and a more powerful machine with 128 GB RAM.

Notes that running large models is feasible on machines with higher memory but requires careful selection on lower-memory devices.

Model Size Challenges

Discusses the Quen 3.5 model family, particularly its 9 billion parameter version which requires 19.3 GB of memory.

Explains the significance of model size fitting within available memory for effective operation.

Quantization Techniques

Describes BF16 quantization as a method to reduce model size by using 16-bit floating-point numbers for weights.

Highlights how smaller quantizations not only save disk space but also decrease memory requirements when loading models.

Trade-offs in Compression

Discusses various quantization levels (e.g., Q8, Q6, Q4), noting that while lower bit counts reduce size, they may compromise output quality.

Warns against going too low in quantization (below 4 bits), as it can lead to poor performance or nonsensical outputs from language models.

Memory Management During Model Execution

Memory Usage Insights

Shares personal experience of running a model on the Mac Mini where initial memory usage was at 77 GB out of 128 GB after loading a model.

Illustrates how context length increases memory consumption significantly during processing tasks.

KV Cache Mechanism

Explains the role of KV cache as short-term memory storing key-value pairs from previous tokens processed by the LLM.

Clarifies that while quantization reduces model weight sizes, Turbo Quant specifically targets reducing KV cache size for better efficiency.

Turbo Quant: A New Approach

Benefits and Limitations

Introduces Turbo Quant as beneficial primarily for systems with limited memory rather than those equipped with ample resources.

Community Efforts and Experiments

Mentions ongoing community efforts around Turbo Quant implementation in projects like Llama CPP which currently lacks official integration.

Initial Testing Results

Shares personal testing experiences indicating mixed results; some savings were observed in KV cache space but overall performance varied across different hardware setups.

Performance Analysis of Turbo Quantization Models

Model Performance and Variants

Both prefill speed and decode speed are significantly affected by the model used, with older models like Quen 2.5 showing similar results to newer ones such as Turbo 3 and Turbo 4.

The Turbo variants differ in aggressiveness: Turbo 2 aims for a fourfold compression, Turbo 3 targets two and a half times, while Turbo 4 achieves about 1.9 times compression.

A mixture of experts model (Quent 3.535B) performed best in terms of KV cache squashing without substantial loss in decoding speed, although it was still slower overall.

Asymmetric vs Symmetric Approaches

Initial tests applied turbo quantization symmetrically to both K (key) and V (value), which was not optimal; Tom suggested an asymmetric approach using Q8 for K and different turbos for V.

The asymmetric method allowed for better performance on the Mac Mini, where loading the Q8 version caused crashes but running Turbo 3 managed a comfortable context length.

Context Length Impact

Testing various context lengths (32K, 65K, 131K tokens) revealed significant differences in memory usage between Q8 runs and turbo runs, with turbo providing more usable context.

The KV cache size was notably smaller when using turbo quantization compared to Q8, allowing extra headroom for processing.

Quality Assessment through "Needle in a Haystack" Test

A quality test called "needle in a haystack" assessed how well models could find hidden secrets within text at varying context lengths from 1K to 32K tokens.

Initial symmetric approaches yielded poor results with only one out of three secrets found using Turbo variants; however, switching to an asymmetric approach improved performance dramatically.

Speed Comparisons Across Models

On short context lengths, the M4 showed limited improvement with turbo quantization; however, the M5 Max demonstrated significant gains in decode speed across various depths.

For instance, decode speeds dropped from about 54 tokens per second down to around 37 tokens per second at depth zero versus depth eight K on Q8 but remained stable on Turbo Quant.

Conclusion on Model Behavior

Each model behaves differently under turbo quantization; while some perform poorly, newer models like Quen 3.5 show promising responsiveness to this technique on Apple hardware.