I Thought DGX Spark Was Slower… Until I Changed ONE Thing

Name: I Thought DGX Spark Was Slower… Until I Changed ONE Thing
Uploaded: 2026-01-20T16:33:46.000Z
Duration: 30 min 4 s

What If the DGX Spark is Faster Than You Think?

Introduction to Benchmarking

The speaker challenges the notion that the DGX Spark and other older machines like Dell DB10 and MSI Edge Expert are slower than perceived.

Four machines are tested: Mac Studio M3 Ultra, AMD Strikes Halo, DJX Spark, and a custom-built machine with AMD Radeon 9060 XT.

Performance Metrics

Performance results show significant differences in token generation speed:

Mac Studio: 99 tokens/second (winner)

DJX Spark: 67 tokens/second

AMD Strikes Halo: 62 tokens/second

Custom Bench: 18 tokens/second with AMD Radeon.

Importance of Concurrency in Testing

The initial benchmarks simulate single-user chat scenarios, which may not reflect real-world usage where multiple requests occur simultaneously.

Concurrency is crucial for maintaining performance under load; it allows GPUs to handle multiple requests efficiently.

Real-World Application Scenarios

Examples of concurrency importance include:

Multiple requests from users or agents.

Keeping short requests responsive even during heavy processing loads.

Advanced Benchmarking Techniques

The speaker discusses using various models for different tasks (e.g., GPT for research, Claw for coding).

Introduction of Chat LLM Teams as a comprehensive dashboard for managing multiple AI models effectively.

Exploring Llama CPP and VLM

Insights on Model Performance

Discussion on running sweeps with Llama CPP to find optimal settings for model performance regarding concurrency and max tokens.

Impact of Concurrent Requests

Results from testing two concurrent requests show improved performance across all machines:

Mac Studio: Increased from 97 to approximately 178 tokens/second.

Further Testing Recommendations

Suggestion that after four concurrent connections, additional concurrency does not significantly impact performance. Future videos will explore alternative methods to run Llama CPP more effectively.

Machine Learning Frameworks on Different Hardware

Overview of MLX and Its Performance

The speaker discusses the performance of various machine learning models, particularly highlighting that MLX works exceptionally well on Nvidia hardware but is not yet compatible with Apple Silicon.

A comparison between MLX and Llama CPP reveals that MLX outperforms Llama CPP in nearly all tests conducted by the speaker.

Using LM Studio for Compatibility

For users wanting to utilize MLX on Apple Silicon, downloading LM Studio is recommended as it supports GGUF models, which are essentially Llama CPP models.

The speaker expresses curiosity about the performance differences between VLM and Llama CPP when tested on specific hardware (Stricks Halo Box).

Performance Metrics and Concurrency Testing

Initial tests show that Llama CPP performs better than VLM for certain models and quantizations, surprising the speaker.

As concurrency increases (up to 64), VLM shows a significant throughput advantage over Llama CPP, achieving 265 tokens per second at this level.

Understanding Concurrency in Requests

The concept of concurrency is explained as simulating multiple requests sent to a server running different models (Llama CPP, VLM, or MLX).

With one request at a time, Llama CPP remains faster than VLM; however, matching quantizations across different frameworks is crucial for accurate comparisons.

Exploring Quantization Formats

Various quantization formats are discussed: Q8 (integer 8), FP8 (floating-point format), and their implications for model performance.

A reference is made to Julia Turk's channel for further insights into floating-point versus integer quantization methods.

Increasing Throughput with Higher Concurrency Levels

Doubling the concurrency leads to increased throughput; processing two requests simultaneously improves efficiency significantly.

At maximum tested levels of concurrency (up to 1,024), Spark under VLM achieves an impressive rate of 1,125 tokens per second.

Comparative Analysis Across Different Hardware

The AMD Radeon 960XT shows competitive performance with a throughput of 918 tokens per second under VLM conditions.

Comparisons reveal that while MLX offers nearly double the performance compared to Llama CPP on some machines, it still lags behind Spark under VLM.

Max Tokens Impact on Performance

An analysis of max token settings indicates varying performances across devices; Max Studio excels at higher max token values compared to others.

Notably, Max Studio struggles with Llama CPP beyond low concurrency levels but maintains good performance with MLX.

New Developments in Nvidia Technology

The discussion concludes with mention of new FP4 format capabilities in Nvidia Blackwell chips aimed at enhancing low precision inference accuracy.

FP4 Format and Performance Insights

Overview of FP4 Format on Blackwell Chips

The speaker expresses a sense of accomplishment after spending a week writing code for the video, indicating the effort behind the content.

The FP4 format on Blackwell chips is expected to quadruple throughput; however, it did not meet this expectation in the current scenario.

Performance Metrics and Comparisons

Achieved 1,573 tokens per second on Spark with FP4 quantization, outperforming other setups significantly.

Llama CPP Q4KM quantization on Spark yields 76 tokens per second, while Mac Studio M3 Ultra achieves 234 tokens per second—indicating varying performance across different hardware setups.

All mentioned models are GGUF from Llama CPP; despite lower numbers than expected, they still surpass one-to-one chat performance metrics.

Key Takeaways for Hardware Selection

When selecting hardware or evaluating current setups, consider intended use cases rather than relying solely on single-user benchmarks which may misrepresent real-world usage scenarios.

Emphasizes the importance of testing under load and using multiple inference engines to get a comprehensive understanding of performance. The speaker acknowledges their own past reliance on single-user benchmarks.