I Thought DGX Spark Was Slower… Until I Changed ONE Thing

I Thought DGX Spark Was Slower… Until I Changed ONE Thing

What If the DGX Spark is Faster Than You Think?

Introduction to Benchmarking

  • The speaker challenges the notion that the DGX Spark and other older machines like Dell DB10 and MSI Edge Expert are slower than perceived.
  • Four machines are tested: Mac Studio M3 Ultra, AMD Strikes Halo, DJX Spark, and a custom-built machine with AMD Radeon 9060 XT.

Performance Metrics

  • Performance results show significant differences in token generation speed:
  • Mac Studio: 99 tokens/second (winner)
  • DJX Spark: 67 tokens/second
  • AMD Strikes Halo: 62 tokens/second
  • Custom Bench: 18 tokens/second with AMD Radeon.

Importance of Concurrency in Testing

  • The initial benchmarks simulate single-user chat scenarios, which may not reflect real-world usage where multiple requests occur simultaneously.
  • Concurrency is crucial for maintaining performance under load; it allows GPUs to handle multiple requests efficiently.

Real-World Application Scenarios

  • Examples of concurrency importance include:
  • Multiple requests from users or agents.
  • Keeping short requests responsive even during heavy processing loads.

Advanced Benchmarking Techniques

  • The speaker discusses using various models for different tasks (e.g., GPT for research, Claw for coding).
  • Introduction of Chat LLM Teams as a comprehensive dashboard for managing multiple AI models effectively.

Exploring Llama CPP and VLM

Insights on Model Performance

  • Discussion on running sweeps with Llama CPP to find optimal settings for model performance regarding concurrency and max tokens.

Impact of Concurrent Requests

  • Results from testing two concurrent requests show improved performance across all machines:
  • Mac Studio: Increased from 97 to approximately 178 tokens/second.

Further Testing Recommendations

  • Suggestion that after four concurrent connections, additional concurrency does not significantly impact performance. Future videos will explore alternative methods to run Llama CPP more effectively.

Machine Learning Frameworks on Different Hardware

Overview of MLX and Its Performance

  • The speaker discusses the performance of various machine learning models, particularly highlighting that MLX works exceptionally well on Nvidia hardware but is not yet compatible with Apple Silicon.
  • A comparison between MLX and Llama CPP reveals that MLX outperforms Llama CPP in nearly all tests conducted by the speaker.

Using LM Studio for Compatibility

  • For users wanting to utilize MLX on Apple Silicon, downloading LM Studio is recommended as it supports GGUF models, which are essentially Llama CPP models.
  • The speaker expresses curiosity about the performance differences between VLM and Llama CPP when tested on specific hardware (Stricks Halo Box).

Performance Metrics and Concurrency Testing

  • Initial tests show that Llama CPP performs better than VLM for certain models and quantizations, surprising the speaker.
  • As concurrency increases (up to 64), VLM shows a significant throughput advantage over Llama CPP, achieving 265 tokens per second at this level.

Understanding Concurrency in Requests

  • The concept of concurrency is explained as simulating multiple requests sent to a server running different models (Llama CPP, VLM, or MLX).
  • With one request at a time, Llama CPP remains faster than VLM; however, matching quantizations across different frameworks is crucial for accurate comparisons.

Exploring Quantization Formats

  • Various quantization formats are discussed: Q8 (integer 8), FP8 (floating-point format), and their implications for model performance.
  • A reference is made to Julia Turk's channel for further insights into floating-point versus integer quantization methods.

Increasing Throughput with Higher Concurrency Levels

  • Doubling the concurrency leads to increased throughput; processing two requests simultaneously improves efficiency significantly.
  • At maximum tested levels of concurrency (up to 1,024), Spark under VLM achieves an impressive rate of 1,125 tokens per second.

Comparative Analysis Across Different Hardware

  • The AMD Radeon 960XT shows competitive performance with a throughput of 918 tokens per second under VLM conditions.
  • Comparisons reveal that while MLX offers nearly double the performance compared to Llama CPP on some machines, it still lags behind Spark under VLM.

Max Tokens Impact on Performance

  • An analysis of max token settings indicates varying performances across devices; Max Studio excels at higher max token values compared to others.
  • Notably, Max Studio struggles with Llama CPP beyond low concurrency levels but maintains good performance with MLX.

New Developments in Nvidia Technology

  • The discussion concludes with mention of new FP4 format capabilities in Nvidia Blackwell chips aimed at enhancing low precision inference accuracy.

FP4 Format and Performance Insights

Overview of FP4 Format on Blackwell Chips

  • The speaker expresses a sense of accomplishment after spending a week writing code for the video, indicating the effort behind the content.
  • The FP4 format on Blackwell chips is expected to quadruple throughput; however, it did not meet this expectation in the current scenario.

Performance Metrics and Comparisons

  • Achieved 1,573 tokens per second on Spark with FP4 quantization, outperforming other setups significantly.
  • Llama CPP Q4KM quantization on Spark yields 76 tokens per second, while Mac Studio M3 Ultra achieves 234 tokens per second—indicating varying performance across different hardware setups.
  • All mentioned models are GGUF from Llama CPP; despite lower numbers than expected, they still surpass one-to-one chat performance metrics.

Key Takeaways for Hardware Selection

  • When selecting hardware or evaluating current setups, consider intended use cases rather than relying solely on single-user benchmarks which may misrepresent real-world usage scenarios.
  • Emphasizes the importance of testing under load and using multiple inference engines to get a comprehensive understanding of performance. The speaker acknowledges their own past reliance on single-user benchmarks.
Video description

I ran the “obvious” benchmark on DGX Spark and thought I had the story… then one change completely rewrote the results. Try out ChatLLM - http://chatllm.abacus.ai/ltf and Abacus AI DeepAgent - http://deepagent.abacus.ai/ltf 🛒 Gear Links 🛒 🪛🪛Highly rated precision driver kit: https://amzn.to/4fkMVfg 💻☕ Favorite 15" display with magnet: https://amzn.to/3zD1DhQ 🎧⚡ Great 40Gbps T4 enclosure: https://amzn.to/3JNwBGW 🛠️🚀 My nvme ssd: https://amzn.to/3YLEySo 📦🎮 My gear: https://www.amazon.com/shop/alexziskind 🎥 Related Videos 🎥 🧳🧰 Mini PC portable setup - https://youtu.be/4RYmsrarOSw 🍎💻 Dev setup on Mac - https://youtu.be/KiKUN4i1SeU 💸🧠 Cheap mini runs a 70B LLM 🤯 - https://youtu.be/xyKEQjUzfAk 🧪🔥 RAM torture test on Mac - https://youtu.be/l3zIwPgan7M 🍏⚡ FREE Local LLMs on Apple Silicon | FAST! - https://youtu.be/bp2eev21Qfo 🧠📉 REALITY vs Apple’s Memory Claims | vs RTX4090m - https://youtu.be/fdvzQAWXU7A 🧬🐍 Set up Conda - https://youtu.be/2Acht_5_HTo ⚡💥 Thunderbolt 5 BREAKS Apple’s Upcharge - https://youtu.be/nHqrvxcRc7o 🧠🚀 INSANE Machine Learning on Neural Engine - https://youtu.be/Y2FOUg_jo7k 🧱🖥️ Mac Mini Cluster - https://youtu.be/GBR6pHZ68Ho * 🛠️ Developer productivity Playlist - https://www.youtube.com/playlist?list=PLPwbI_iIX3aQCRdFGM7j4TY_7STfv2aXX 🔗 AI for Coding Playlist: 📚 - https://www.youtube.com/playlist?list=PLPwbI_iIX3aSlUmRtYPfbQHt4n0YaX0qw — — — — — — — — — ❤️ SUBSCRIBE TO MY YOUTUBE CHANNEL 📺 Click here to subscribe: https://www.youtube.com/@AZisk?sub_confirmation=1 — — — — — — — — — Join this channel to get access to perks: https://www.youtube.com/channel/UCajiMK_CY9icRhLepS8_3ug/join — — — — — — — — — 📱 ALEX on X: https://x.com/digitalix 📱 Julia’s channel: https://youtube.com/@juliaturc1?si=cPyqhT0P1apvpcCc #macstudio #gdxspark #strixhalo