DeepSeek R1 Fully Tested - Insane Performance
Model Testing of Deep Seek R1
Introduction to Model Testing
- The video introduces the testing of the new Deep Seek R1 model using a comprehensive LLM rubric, powered by Vulture's bare metal GPUs in their cloud.
- The presenter connects to Vulture's cloud and utilizes Open Web UI, an open-source frontend framework for LLMs.
Initial Observations
- The model exhibits a human-like internal monologue, often pausing with phrases like "Okay" and "wait a second," indicating its thought process.
- When asked about the letter 'R' in "strawberry," it correctly identifies three occurrences at specific positions.
Coding Test: Snake Game
- The first coding challenge is to write the game Snake in Python. The model has 671 billion parameters, making it unsuitable for consumer-grade GPUs.
- Instead of directly outputting code, the model outlines its approach first, demonstrating planning before execution.
- After generating code for Snake, it provides detailed instructions on how to play and features of the game.
Successful Execution
- The presenter runs the generated code locally and successfully plays a working version of Snake on the first try, noting flawless controls and functionality.
Coding Test: Tetris Game
- For a more complex task, the model is challenged to create Tetris in Python. It begins by considering essential components such as graphics libraries and shapes.
- The modelβs internal reflections during problem-solving are highlighted as impressive; it thinks through potential issues before coding.
Depth of Thought Process
- Compared to previous models (01 and 03), Deep Seek R1 demonstrates significantly more human-like reasoning during its thought process.
- It engages deeply with edge cases like collision detection and rotation validity within grid constraints while developing Tetris logic.
Conclusion on Thinking Models
Tetris Game Development and Logic Challenges
Tetris Game Functionality
- The speaker discusses the development of a Tetris game, highlighting the various shapes involved and the coding process that resulted in 179 lines of code.
- Upon testing, the game successfully generates new pieces and clears completed lines, although it lacks scoring and next piece previews.
- The hardware used includes a powerful setup with 128 CPU cores, 256 threads, and eight AMD Instinct GPUs with substantial VRAM to support the model's performance.
Envelope Size Restrictions Problem
- The speaker introduces a logic problem regarding postal size restrictions for envelopes, emphasizing the need to convert measurements from millimeters to centimeters for accurate comparison.
- There is uncertainty about whether both dimensions must fit within specified ranges or if they can be oriented differently; this highlights common challenges faced by models in interpreting constraints.
- After checking dimensions against requirements, it is confirmed that the envelope meets acceptable size criteria.
Self-referential Word Count Challenge
- A trick question asks how many words are in a response. The model attempts to count words while acknowledging its own limitations in providing an accurate count until completion.
- The model outputs "there are seven words" correctly after processing self-referential logic, demonstrating its ability to handle complex reasoning tasks effectively.
Killer Riddle Analysis
- A riddle presents a scenario involving three killers where one enters and kills another. The speaker analyzes how this affects the total number of killers present.
- Initial calculations consider whether the entrant becomes a killer based on their actions. This leads to nuanced thinking about definitions of 'killer' regardless of life status.
- The discussion reveals deep reasoning as it navigates ambiguities inherent in language and logic puzzles while reflecting human-like thought processes.
Conclusion on Model Performance
- Overall, there is satisfaction with how well the model thinks through problems step-by-step without overcomplicating responses or losing sight of core questions.
Understanding Censorship and Logical Reasoning in AI Models
Analyzing the Number of Killers
- The discussion begins with a logical deduction about the number of killers, concluding that there are three living individuals identified as killers. A dead person could be considered a killer, but this is not included in the final count.
The Marble Riddle
- A riddle involving a marble and an inverted glass cup is presented. The correct answer reveals that when the glass is turned upside down on the table, the marble remains trapped beneath it.
Comparing Decimal Numbers
- A straightforward comparison between 9.11 and 9.9 illustrates how to determine which number is larger by rewriting 9.9 as 9.90 for clarity in decimal comparison.
- The conclusion drawn from this analysis confirms that 9.9 is indeed greater than 9.11.
Exploring Censorship in AI Models
- The speaker tests a Chinese AI model's censorship capabilities by attempting to inquire about Tiananmen Square, revealing that it refuses to provide an answer due to built-in restrictions.
Comparison with US Models
- In contrast, US models also exhibit censorship; however, they engage in moral reasoning before refusing certain inquiries (e.g., asking how to rob a bank).
Hardcoded Responses on Taiwan's Status
- When asked about Taiwan's independence status, the model provides a definitive response aligned with China's stance without any reasoning process involved.
Request for Creative Output
- Finally, when tasked with generating ten sentences ending with "apple," the model performs flawlessly, showcasing its capability for creative output despite previous limitations discussed.
Conclusion