Why the Smartest AI Teams Are Panic-Buying Compute: The 36-Month AI Infrastructure Crisis Is Here
Understanding the AI Compute Crisis
Overview of the Current Situation
- The global economy has shifted to rely heavily on AI, leading to a structural crisis in technology infrastructure due to insufficient compute resources.
- This discussion will explore the unique aspects of this crisis compared to previous technology supply issues and analyze its strategic implications for enterprises.
Key Drivers of the Crisis
- Exponential Demand: Enterprise AI consumption is growing at least 10x annually, driven by increased usage per worker and the rise of agentic systems.
- Supply Constraints: Semiconductor capacity is fully allocated, with DRAM fabrication taking 3-4 years; high bandwidth memory is sold out until at least 2028.
- Hoarding by Hyperscalers: Major companies like Google, Microsoft, Amazon, and Meta have secured compute allocations for years ahead, limiting availability for other enterprises.
Economic Implications
- Pricing Surge: Memory costs are projected to increase by 40% to 60% in early 2026; effective inference costs could double or triple within 18 months due to severe demand-supply imbalance.
- Broken Planning Frameworks: Traditional capex models and procurement cycles fail under unpredictable demand and supply conditions in the AI era.
Urgency for Enterprises
- The opportunity window for securing compute capacity is closing; proactive enterprises can lock in allocations before peak crisis hits.
Consumption Dynamics in AI
Understanding Token Consumption
- A knowledge worker using advanced AI tools may consume around a billion tokens annually, with potential ceilings reaching up to 25 billion tokens per year as capabilities expand.
Factors Driving Increased Consumption
- Capability Unlocking Usage: Improvements in models lead users to discover new applications that significantly increase demand.
- Integration Across Platforms: AI tools are becoming embedded across various software environments (e.g., email clients), creating continuous consumption opportunities.
Agentic Systems Impact
- The shift from human-in-the-loop systems to agentic workflows dramatically increases token consumption; one agentic workflow can exceed a human's monthly output within an hour.
Financial Projections at Scale
Cost Implications for Enterprises
- For a company with 10,000 employees consuming one billion tokens each annually, total inference costs could escalate from $20 million to $2 billion as token consumption rises dramatically.
The Future of AI Consumption and Memory Constraints
The Limitations of Human vs. Agentic Systems
- Human workers have natural rate limits, such as typing speed and breaks, which restrict their output to around 50 million tokens per day.
- In contrast, agentic systems can operate continuously, potentially consuming billions of tokens daily; fleets of agents could reach trillions.
Current Enterprise Deployments
- Enterprises are already utilizing agentic systems for various applications like code review and customer service, leading to a demand for continuous inference that surpasses human-generated data.
- Google reported processing 1.3 quadrillion tokens monthly, indicating a significant growth trajectory in AI consumption.
Planning for Future Demand
- Companies planning based on human worker token consumption may underestimate future needs by not accounting for the additional demand from deployed agents.
- The total consumption footprint could be 10 to 100 times higher than current calculations suggest.
Memory Bottlenecks in AI Infrastructure
- AI inference is heavily reliant on memory; high bandwidth memory (HBM) is crucial for performance but currently faces supply issues.
- DRAM prices are projected to rise significantly due to under-supply and reallocations towards enterprise segments focused on AI.
Structural Issues in Memory Production
- Major players controlling global memory production are shifting focus away from consumer products towards enterprise needs, exacerbating shortages.
- HBM concentration with limited availability further complicates the situation as it is primarily allocated to major companies like Nvidia and AMD.
Long-Term Supply Challenges
- New semiconductor fabrication facilities require substantial investment and time (3–4 years), delaying any potential relief from current shortages.
- TSMC's advanced chip manufacturing capacity is fully allocated, with no immediate solutions available for increased demand.
GPU Allocation Crisis
- Nvidia holds an 80% market share in AI training chips; their GPUs are sold out with lead times exceeding six months due to high demand from hyperscalers.
- Major tech companies have committed vast resources to secure GPU allocations, leaving little availability for other enterprises.
The Challenges of AI Infrastructure and Market Dynamics
Current State of GPU Alternatives
- AMD's Instinct MI300X offers competitive specs but lacks a mature software ecosystem compared to Nvidia.
- Intel's Gaudi accelerators have not gained market share despite attractive pricing; software adoption remains a hurdle.
- Custom silicon solutions like Google's TPU and Amazon's Tranium are primarily for internal use, limiting enterprise access.
The Conflict of Interest Among Hyperscalers
- Major cloud providers (AWS, Azure, Google Cloud) are also AI product companies that compete with their enterprise customers.
- When compute resources are scarce, the competition intensifies as every GPU allocated to enterprises reduces availability for internal products like Gemini or Copilot.
- This conflict is evident in OpenAI and Anthropic as well; they prioritize their own needs over customer demands.
Pricing Dynamics in Scarcity
- API pricing has decreased while rate limits have tightened, making it harder for enterprises to secure high-volume allocations.
- Hyperscalers rationally prioritize their strategic AI products over selling capacity to enterprises due to internal business metrics.
Implications of Supply Constraints
- In a constrained market, prices will spike as demand outstrips supply; buyers will bid against each other leading to premium pricing.
- Historical precedents show significant price spikes during shortages (e.g., DRAM prices increased by 300% in 2016).
Business Model Vulnerabilities
- Companies heavily reliant on AI may face unviable business models if inference costs double due to rising prices.
- Enterprises using AI internally might justify cost increases if value creation is substantial but will likely face budget scrutiny.
Planning for Uncertainty in Enterprise IT
- Traditional IT planning methods are outdated; they assume predictable demand and stable technology which no longer holds true.
- CTO’s applying old frameworks risk systematic failures due to unpredictable demand and rapidly changing technology landscapes.
Understanding the Risks of Long-Term Tech Investments
The Dangers of Overcommitment
- Enterprises risk making poor decisions by overcommitting to long-term tech purchases, leading to stranded assets and underinvestment in flexibility.
- A hypothetical scenario illustrates this: an enterprise invests $5 million in AI workstations, expecting them to provide significant value but quickly finds them inadequate due to increased workload demands.
Consequences of Obsolete Technology
- By year two, the purchased workstations become obsolete as they cannot handle the increased demand for processing power.
- The enterprise faces three options: continue using outdated hardware (Option A), purchase new hardware at a loss (Option B), or lease technology (Option C).
Evaluating Options for Hardware Investment
- Leasing may seem ideal as it transfers depreciation risk; however, large-scale leasing has proven difficult for enterprises.
- Committing to cloud services can defer costs but also leads to potential traps with multi-year agreements that may not align with actual usage.
Navigating Cloud Commitments and Consumption Predictions
Challenges with Multi-Year Cloud Agreements
- Three scenarios illustrate the risks of cloud commitments: undercommitting leads to budget issues, overcommitting results in wasted expenditure, and accurately predicting consumption is nearly impossible.
- Many enterprises opt for committed use agreements while accepting overages due to unpredictable growth.
Strategic Actions by CTOs
- Sharp CTOs prioritize securing capacity before peak demand hits, focusing on contractual guarantees rather than just pricing per token.
- Building a routing layer becomes essential; it optimizes cost management and maintains optionality across different infrastructures.
Principles for Effective Technology Management
Key Principles Adopted by Successful CTOs
- Principle 1: Secure inference capacity early through contractual guarantees rather than relying solely on price negotiations.
- Principle 2: Develop a sophisticated routing layer that abstracts infrastructure details and enhances negotiating leverage.
Treating Hardware as Consumables
- Principle 3 emphasizes treating hardware like consumables; plan refresh cycles every 18–24 months due to rapid advancements in GPU architecture.
Investing in Efficiency
- Principle 4 highlights that efficiency is crucial; reducing token consumption directly increases effective capacity for additional workloads.
Deepseek's Innovations in Token Efficiency
Importance of Reducing Token Usage
- Deepseek's work on engram is notable for its ability to significantly reduce token usage during inference, particularly for factual lookups.
- Effective prompt design can lead to lower token consumption, emphasizing the importance of well-crafted queries.
- Caching strategies also contribute to reduced token usage, showcasing multiple avenues for efficiency.
Cost-effective Retrieval Methods
- Embedding-based retrieval methods are highlighted as being substantially cheaper than traditional raw inference techniques.
- Quantization allows smaller models to achieve performance levels comparable to larger models on specific tasks, enhancing operational efficiency.
The Shift Towards Efficiency Investments
- Traditionally, investments in capability have taken precedence over efficiency; however, this trend is shifting due to current economic constraints.
- Enterprises that prioritize efficient operations can potentially increase their capacity by tenfold amidst rising demand and flat supply curves.
Navigating the Global Inference Crisis
- The speaker observes an impending global inference crisis driven by exponential demand against a static supply curve.
- Companies must secure their operational capacity now and develop routing layers that allow flexible model allocation based on needs.
Strategic Recommendations for Enterprises
- IT departments need to treat hardware as consumable resources rather than fixed assets, adapting to new operational paradigms.
- Investing in efficiency should be viewed as a competitive advantage; diversification across technology stacks is crucial to mitigate reliance on single ecosystem players.
- Organizations that implement these strategies will be better positioned during the crisis and will maintain competitiveness when market conditions stabilize.