Why the Smartest AI Teams Are Panic-Buying Compute: The 36-Month AI Infrastructure Crisis Is Here

Why the Smartest AI Teams Are Panic-Buying Compute: The 36-Month AI Infrastructure Crisis Is Here

Understanding the AI Compute Crisis

Overview of the Current Situation

  • The global economy has shifted to rely heavily on AI, leading to a structural crisis in technology infrastructure due to insufficient compute resources.
  • This discussion will explore the unique aspects of this crisis compared to previous technology supply issues and analyze its strategic implications for enterprises.

Key Drivers of the Crisis

  • Exponential Demand: Enterprise AI consumption is growing at least 10x annually, driven by increased usage per worker and the rise of agentic systems.
  • Supply Constraints: Semiconductor capacity is fully allocated, with DRAM fabrication taking 3-4 years; high bandwidth memory is sold out until at least 2028.
  • Hoarding by Hyperscalers: Major companies like Google, Microsoft, Amazon, and Meta have secured compute allocations for years ahead, limiting availability for other enterprises.

Economic Implications

  • Pricing Surge: Memory costs are projected to increase by 40% to 60% in early 2026; effective inference costs could double or triple within 18 months due to severe demand-supply imbalance.
  • Broken Planning Frameworks: Traditional capex models and procurement cycles fail under unpredictable demand and supply conditions in the AI era.

Urgency for Enterprises

  • The opportunity window for securing compute capacity is closing; proactive enterprises can lock in allocations before peak crisis hits.

Consumption Dynamics in AI

Understanding Token Consumption

  • A knowledge worker using advanced AI tools may consume around a billion tokens annually, with potential ceilings reaching up to 25 billion tokens per year as capabilities expand.

Factors Driving Increased Consumption

  • Capability Unlocking Usage: Improvements in models lead users to discover new applications that significantly increase demand.
  • Integration Across Platforms: AI tools are becoming embedded across various software environments (e.g., email clients), creating continuous consumption opportunities.

Agentic Systems Impact

  • The shift from human-in-the-loop systems to agentic workflows dramatically increases token consumption; one agentic workflow can exceed a human's monthly output within an hour.

Financial Projections at Scale

Cost Implications for Enterprises

  • For a company with 10,000 employees consuming one billion tokens each annually, total inference costs could escalate from $20 million to $2 billion as token consumption rises dramatically.

The Future of AI Consumption and Memory Constraints

The Limitations of Human vs. Agentic Systems

  • Human workers have natural rate limits, such as typing speed and breaks, which restrict their output to around 50 million tokens per day.
  • In contrast, agentic systems can operate continuously, potentially consuming billions of tokens daily; fleets of agents could reach trillions.

Current Enterprise Deployments

  • Enterprises are already utilizing agentic systems for various applications like code review and customer service, leading to a demand for continuous inference that surpasses human-generated data.
  • Google reported processing 1.3 quadrillion tokens monthly, indicating a significant growth trajectory in AI consumption.

Planning for Future Demand

  • Companies planning based on human worker token consumption may underestimate future needs by not accounting for the additional demand from deployed agents.
  • The total consumption footprint could be 10 to 100 times higher than current calculations suggest.

Memory Bottlenecks in AI Infrastructure

  • AI inference is heavily reliant on memory; high bandwidth memory (HBM) is crucial for performance but currently faces supply issues.
  • DRAM prices are projected to rise significantly due to under-supply and reallocations towards enterprise segments focused on AI.

Structural Issues in Memory Production

  • Major players controlling global memory production are shifting focus away from consumer products towards enterprise needs, exacerbating shortages.
  • HBM concentration with limited availability further complicates the situation as it is primarily allocated to major companies like Nvidia and AMD.

Long-Term Supply Challenges

  • New semiconductor fabrication facilities require substantial investment and time (3–4 years), delaying any potential relief from current shortages.
  • TSMC's advanced chip manufacturing capacity is fully allocated, with no immediate solutions available for increased demand.

GPU Allocation Crisis

  • Nvidia holds an 80% market share in AI training chips; their GPUs are sold out with lead times exceeding six months due to high demand from hyperscalers.
  • Major tech companies have committed vast resources to secure GPU allocations, leaving little availability for other enterprises.

The Challenges of AI Infrastructure and Market Dynamics

Current State of GPU Alternatives

  • AMD's Instinct MI300X offers competitive specs but lacks a mature software ecosystem compared to Nvidia.
  • Intel's Gaudi accelerators have not gained market share despite attractive pricing; software adoption remains a hurdle.
  • Custom silicon solutions like Google's TPU and Amazon's Tranium are primarily for internal use, limiting enterprise access.

The Conflict of Interest Among Hyperscalers

  • Major cloud providers (AWS, Azure, Google Cloud) are also AI product companies that compete with their enterprise customers.
  • When compute resources are scarce, the competition intensifies as every GPU allocated to enterprises reduces availability for internal products like Gemini or Copilot.
  • This conflict is evident in OpenAI and Anthropic as well; they prioritize their own needs over customer demands.

Pricing Dynamics in Scarcity

  • API pricing has decreased while rate limits have tightened, making it harder for enterprises to secure high-volume allocations.
  • Hyperscalers rationally prioritize their strategic AI products over selling capacity to enterprises due to internal business metrics.

Implications of Supply Constraints

  • In a constrained market, prices will spike as demand outstrips supply; buyers will bid against each other leading to premium pricing.
  • Historical precedents show significant price spikes during shortages (e.g., DRAM prices increased by 300% in 2016).

Business Model Vulnerabilities

  • Companies heavily reliant on AI may face unviable business models if inference costs double due to rising prices.
  • Enterprises using AI internally might justify cost increases if value creation is substantial but will likely face budget scrutiny.

Planning for Uncertainty in Enterprise IT

  • Traditional IT planning methods are outdated; they assume predictable demand and stable technology which no longer holds true.
  • CTO’s applying old frameworks risk systematic failures due to unpredictable demand and rapidly changing technology landscapes.

Understanding the Risks of Long-Term Tech Investments

The Dangers of Overcommitment

  • Enterprises risk making poor decisions by overcommitting to long-term tech purchases, leading to stranded assets and underinvestment in flexibility.
  • A hypothetical scenario illustrates this: an enterprise invests $5 million in AI workstations, expecting them to provide significant value but quickly finds them inadequate due to increased workload demands.

Consequences of Obsolete Technology

  • By year two, the purchased workstations become obsolete as they cannot handle the increased demand for processing power.
  • The enterprise faces three options: continue using outdated hardware (Option A), purchase new hardware at a loss (Option B), or lease technology (Option C).

Evaluating Options for Hardware Investment

  • Leasing may seem ideal as it transfers depreciation risk; however, large-scale leasing has proven difficult for enterprises.
  • Committing to cloud services can defer costs but also leads to potential traps with multi-year agreements that may not align with actual usage.

Navigating Cloud Commitments and Consumption Predictions

Challenges with Multi-Year Cloud Agreements

  • Three scenarios illustrate the risks of cloud commitments: undercommitting leads to budget issues, overcommitting results in wasted expenditure, and accurately predicting consumption is nearly impossible.
  • Many enterprises opt for committed use agreements while accepting overages due to unpredictable growth.

Strategic Actions by CTOs

  • Sharp CTOs prioritize securing capacity before peak demand hits, focusing on contractual guarantees rather than just pricing per token.
  • Building a routing layer becomes essential; it optimizes cost management and maintains optionality across different infrastructures.

Principles for Effective Technology Management

Key Principles Adopted by Successful CTOs

  • Principle 1: Secure inference capacity early through contractual guarantees rather than relying solely on price negotiations.
  • Principle 2: Develop a sophisticated routing layer that abstracts infrastructure details and enhances negotiating leverage.

Treating Hardware as Consumables

  • Principle 3 emphasizes treating hardware like consumables; plan refresh cycles every 18–24 months due to rapid advancements in GPU architecture.

Investing in Efficiency

  • Principle 4 highlights that efficiency is crucial; reducing token consumption directly increases effective capacity for additional workloads.

Deepseek's Innovations in Token Efficiency

Importance of Reducing Token Usage

  • Deepseek's work on engram is notable for its ability to significantly reduce token usage during inference, particularly for factual lookups.
  • Effective prompt design can lead to lower token consumption, emphasizing the importance of well-crafted queries.
  • Caching strategies also contribute to reduced token usage, showcasing multiple avenues for efficiency.

Cost-effective Retrieval Methods

  • Embedding-based retrieval methods are highlighted as being substantially cheaper than traditional raw inference techniques.
  • Quantization allows smaller models to achieve performance levels comparable to larger models on specific tasks, enhancing operational efficiency.

The Shift Towards Efficiency Investments

  • Traditionally, investments in capability have taken precedence over efficiency; however, this trend is shifting due to current economic constraints.
  • Enterprises that prioritize efficient operations can potentially increase their capacity by tenfold amidst rising demand and flat supply curves.

Navigating the Global Inference Crisis

  • The speaker observes an impending global inference crisis driven by exponential demand against a static supply curve.
  • Companies must secure their operational capacity now and develop routing layers that allow flexible model allocation based on needs.

Strategic Recommendations for Enterprises

  • IT departments need to treat hardware as consumable resources rather than fixed assets, adapting to new operational paradigms.
  • Investing in efficiency should be viewed as a competitive advantage; diversification across technology stacks is crucial to mitigate reliance on single ecosystem players.
  • Organizations that implement these strategies will be better positioned during the crisis and will maintain competitiveness when market conditions stabilize.
Video description

My site: https://natebjones.com Full Story w/ Prompts: https://natesnewsletter.substack.com/p/executive-briefing-the-global-inference?r=1z4sm5&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true ________________________________________ What's really happening with AI compute infrastructure? The common story is that supply will catch up to demand—but the reality is more complicated when DRAM prices spike 60% quarterly and every hyperscaler is hoarding capacity. In this video, I share the inside scoop on why the global inference crisis is not a prediction but an observation of current conditions: • Why enterprise token consumption is scaling from 1 billion to 100 billion per worker annually • How memory, semiconductor, and GPU bottlenecks compound with no relief until 2028 • What hyperscalers choosing their own products over customers means for enterprise allocation • Where sharp CTOs are securing capacity and building routing layers now For enterprise leaders navigating the next 24 months, traditional planning frameworks are broken—and the window to act is closing fast. Chapters 00:00 The Global Inference Crisis 02:52 Token Consumption Is Exploding 04:50 Agentic Systems Change Everything 07:09 The Memory Bottleneck 08:58 DRAM Prices Spiking 50-60% Quarterly 11:10 Semiconductor Fab Constraints 12:08 The GPU Allocation Crisis 14:15 Hyperscalers Are Competitors, Not Partners 17:38 Which Business Models Are Most Exposed 19:24 Why Traditional Planning Frameworks Fail 22:16 Cloud Commitments Can Become Traps 24:11 What Sharp CTOs Are Doing Now Subscribe for daily AI strategy and news. For deeper playbooks and analysis: https://natesnewsletter.substack.com/