Apple: “AI Can’t Think” — Are They Wrong?

Name: Apple: “AI Can’t Think” — Are They Wrong?
Uploaded: 2025-06-09T23:02:39.000Z
Duration: 44 min 5 s

The Illusion of Thinking: Apple's Critique on Large Language Models

Overview of Apple's Paper

Apple released a paper titled "The Illusion of Thinking," which critiques the capabilities of large language models (LLMs), suggesting they are not genuinely thinking and may not outperform traditional models significantly.

The paper highlights issues such as data contamination in model training, implying that these models might be "cheating" to achieve benchmark success.

Core Assertions and Insights

The main thesis argues that we overestimate the abilities of large reasoning models (LRMs), which include advanced versions like OpenAI's GPT-3.5 and 4.0.

Current evaluations focus on mathematical accuracy but suffer from data contamination, failing to assess the quality and structure of reasoning processes within these models.

Proposed Benchmarking Methodology

Apple suggests a new benchmarking approach using puzzles with varying complexities instead of relying solely on existing benchmarks, which they claim do not accurately reflect reasoning capabilities.

This puzzle-based evaluation is reminiscent of the ARC AGI benchmark, designed to test generalization in LLMs through problem-solving tasks.

Limitations and Questions Raised

The paper questions whether LRMs can perform generalizable reasoning or if they merely rely on pattern matching techniques.

It raises critical inquiries about how performance varies with increasing complexity and compares against standard non-thinking LLM counterparts under similar conditions.

Critique of Existing Benchmarks

Apple criticizes current benchmarks for their lack of controlled experimental conditions across different settings, leading to unreliable assessments due to potential data contamination.

Evaluation of Reasoning Models through Puzzles

Introduction to Puzzle-Based Evaluation

The paper discusses using controllable puzzle environments to systematically evaluate complexity by adjusting puzzle elements while maintaining core logic.

Four puzzles are utilized as evaluation metrics: Tower of Hanoi, checkers jumping, river crossing, and blocks world. Each requires logic and reasoning to solve.

Complexity is varied by increasing the number of plates in Tower of Hanoi or adding more entities in other puzzles, which increases the number of moves needed for completion.

Contributions and Findings

The paper questions current evaluation paradigms for large reasoning models (LRMs), highlighting that state-of-the-art LRMs still struggle with generalizable problem-solving capabilities.

It identifies a scaling limit in LRMs' reasoning efforts concerning problem complexity and suggests extending evaluations beyond final accuracy to include intermediate solutions or thinking traces.

Benchmark Analysis

The paper critiques standard benchmarks like Amy 2024 and Amy 2025 for potential data contamination affecting model performance comparisons.

A key question raised is why thinking models outperform non-thinking versions on certain benchmarks; this may be due to increased exposure to benchmark data or enhanced inference compute allocated to thinking tokens.

Performance Comparisons

Non-thinking language models (LLMs), under equivalent inference token budgets, can achieve comparable performance to thinking models on benchmarks like math 500 and Amy 2024.

Non-thinking models generate multiple candidate solutions (pass at k), selecting the best one rather than relying solely on a single output from thinking models.

Observations on Model Performance

Despite parity in performance under equal token budgets, a widening gap between thinking and non-thinking models is observed on newer benchmarks like Amy 2024 and 2025.

This gap could indicate that more complex problems require sophisticated reasoning processes inherent in thinking models or reduced data contamination in newer benchmarks.

Human vs. Model Performance Insights

Interestingly, human performance was higher on Amy 2025 compared to Amy 2024, suggesting it might be less complex despite non-thinking model performance declining.

The discussion leads back to puzzles; unlike memorized math problems, puzzles cannot be easily recalled even if trained upon, indicating their effectiveness as evaluation tools.

Results Overview

Performance of Thinking vs. Non-Thinking Models in Complex Tasks

Overview of Model Performance

Both thinking and non-thinking models perform similarly on simple tasks (1-3 disks). However, as complexity increases (4-10 disks), the non-thinking model shows a significant drop in performance while the thinking model performs better.

At high complexity levels, both models fail to pass the tasks, indicating that increased difficulty leads to a collapse in performance for both types.

Experimental Setup

The experiments utilize Claude and Deep Seek models due to their ability to provide visibility into the chain of thought, unlike other models that abstract this process.

Each puzzle instance generates 25 samples, with average performance reported across these samples. The analysis categorizes tasks into low, medium, and high complexity.

Token Usage and Reasoning Efficiency

As puzzle complexity rises, thinking models initially use more tokens but experience a decline in accuracy until reaching a critical point where reasoning collapses and token usage decreases sharply.

This raises questions about whether the models are giving up or simply becoming less efficient as they tackle more complex problems.

Overthinking Phenomenon

For simpler problems, reasoning models often find correct solutions early but continue exploring alternatives unnecessarily—this is termed "overthinking," leading to wasted computational resources.

In moderately complex problems, the trend reverses; models first explore incorrect solutions before arriving at correct ones later in their thought process.

Analysis of Specific Puzzles

In Tower of Hanoi puzzles, as complexity increases, there’s an observable shift where models start finding incorrect answers earlier while eventually arriving at correct answers later—a desirable outcome.

Testing with provided algorithms showed no improvement in performance despite simplifying execution steps; failures occurred at similar points regardless of guidance.

Insights from Expert Commentary

Understanding the Potential of Digital Intelligence

The Comparison Between Biological and Digital Intelligence

The speaker discusses the brain as a biological computer, questioning why digital computers cannot replicate its functions. They suggest that Ilya's perspective indicates progress towards achieving Artificial General Intelligence (AGI), emphasizing that digital computers should be capable of simulating human intelligence.

Limitations of Current Models

The findings highlight fundamental limitations in existing models, which struggle with generalizable reasoning capabilities beyond certain complexity thresholds despite having sophisticated self-reflection mechanisms.

The experiments conducted represent only a narrow slice of reasoning tasks, failing to encompass the diversity found in real-world or knowledge-intensive problems. This suggests that current tests may not fully capture the essence of intelligence.

Achievements vs. Limitations

Despite their inability to solve complex puzzles at certain thresholds, current models demonstrate significant achievements such as deep thinking, generating compelling images and videos, and coding abilities. This raises questions about what constitutes "thinking."

A critical oversight in the paper is its failure to acknowledge these models' ability to write code for solving puzzles. The speaker argues that if a model can generate code to address challenges, it should be recognized for its problem-solving capabilities.

Practical Demonstration: Tower of Hanoi Game

The speaker conducts an experiment using Cloud 3.7 by prompting it to write HTML and JavaScript code for simulating the Tower of Hanoi game without providing a solution method.