Gemini 3 isn't the answer. How to Solve 1 Million Steps with 0 Errors
Solving a Million-Step LLM Task with Zero Errors
Introduction to the Paper
- A new paper titled "Solving a Million-Step LLM Task with Zero Errors" was published in November 2025 by Cognizant AI lab, addressing a major failure mode in the AI industry.
- Current AI agents can perform tasks like writing code or planning trips but struggle with long, complex tasks, often leading to failures such as drifting and hallucination.
Rethinking Model Limitations
- The authors argue that the problem is not merely about model capabilities or context window limitations; rather, it’s an engineering architecture issue.
- They achieved success without using advanced models or large context windows through their framework called Maker (Massively Decomposed Agentic Processes).
Understanding Probability and Task Complexity
- The paper highlights how probability affects task success rates: a model with 99% accuracy drops significantly when tasked with multiple steps.
- For example, solving real-world tasks often requires thousands of steps, making high accuracy crucial for success.
Benchmarking Against Tower of Hanoi
- Researchers used the Tower of Hanoi puzzle as a benchmark, which requires 148,575 moves for 20 discs. Standard models failed due to context drift.
- Context drift occurs when models become confused by their own past outputs as conversation history grows.
The Maker Framework Explained
Pillar One: Maximal Decomposition
- Maker's first pillar involves treating each step as an isolated problem without retaining past actions. This prevents confusion from previous steps.
Pillar Two: Red Flagging
- The second pillar focuses on identifying potential logic errors through syntax errors. If output deviates from expected formats, it triggers a retry instead of attempting repairs.
Pillar Three: K Voting Mechanism
- The third pillar employs a voting mechanism where multiple answers are generated for each step. Even less accurate models can achieve high reliability through this method.
Economic Implications of the Findings
- The research reveals that smaller models combined with voting mechanisms can be more cost-effective than larger models for complex tasks.
- It suggests that simpler models performing single logical steps are sufficient and cheaper than relying on high-end models for every task.
Conclusion and Future Directions
- While the findings present significant advancements in AI task execution reliability and cost-effectiveness, they also open avenues for further exploration into architectural frameworks that enhance performance across various applications.
Understanding Software Development Strategies
Importance of November 2025 for Developers
- The date serves as a pivotal reference point, providing a blueprint for current software development practices.
- Developers are encouraged to stop relying on chat history for state management and instead define their atomic state clearly.
Defining Atomic State in Development
- For coding tasks, the atomic state is represented by the file system and compiler error logs.
- In data analysis, the atomic state corresponds to the data frame being utilized.
Task Decomposition Techniques
- Developers should break down complex tasks into micro-level components rather than asking an agent to perform large functions.
- Suggested breakdown includes having separate agents for defining inputs, writing function signatures, and implementing specific logic (e.g., tax brackets).
Implementing Voting Mechanisms
- Critical decision points should involve voting mechanisms; not every step requires this process.
- For significant decisions where errors could disrupt processes, five parallel calls can be initiated. Disagreement among these signals uncertainty in the model's output.
Reliability as an Engineering Challenge
- The discussion emphasizes that reliability issues are engineering problems that can be addressed now without waiting for model companies to resolve hallucinations.
- By treating LLMs (Large Language Models) as stochastic components needing redundancy and strict input verification, developers can create more reliable systems compared to existing models.