If you’ve ever been impressed by an AI model explaining its reasoning step by step, you’re not alone. But what if that reasoning is more of an illusion than a genuine thought process?
That’s the provocative claim at the heart of Apple's new research paper, The Illusion of Reasoning: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity. It has started fresh debate in the AI community not just for its technical rigour, but for what might be the limits of even the most powerful AI tools we use today.

What are reasoning models?

As we previously covered in a recent edition of our newsletter, reasoning models are the next generation of artificial intelligence systems. Unlike traditional language models that simply predict the next word, reasoning models approach problems through a structured, step by step process.

They break complex tasks into manageable parts, apply logic, consult external tools (such as code interpreters or web browsers), and build solutions in a structured manner. This mimics how people tackle difficult challenges by planning, testing, and refining ideas. Some even use visual inputs and dynamic tool selection.

In practical terms, reasoning models are designed to handle more analytical, multi-step tasks across fields like finance, law, data analysis, and operations.

Apple’s key claim

Apple’s researchers tested Large Reasoning Models (LRMs), including o3-mini, DeepSeek-R1, Claude-3.7-Sonnet-Thinking, on classic logic puzzles like the Tower of Hanoi, Missionaries & Cannibals, and Sudoku. These were designed to scale in difficulty while keeping the underlying logic consistent.

What they found was a consistent three phase pattern across models:

Low complexity: Tasks were solved through pattern recall.
Moderate complexity: Reasoning aids like chain-of-thought prompting improved performance.
High complexity: Accuracy collapsed to nearly zero even with more tokens or when explicitly given the correct algorithm.

Apple calls this the illusion of reasoning. AI outputs can appear thoughtful and deliberate, but break down once the task exceeds a narrow band of difficulty. The reasoning paths often look methodical, but they're typically just plausible continuations of learned patterns not actual problem solving.

Think of it like an actor delivering a convincing monologue - they may seem knowledgeable, but if asked to improvise beyond the script, they struggle. Apple argues this is precisely what today’s AI is doing: it's acting, not thinking.

Accuracy and thinking tokens vs. problem complexity for reasoning models across puzzle environments. Credit: Apple Machine Learning Research

Not everyone agrees

Within days of the paper’s release, critical responses emerged. Some researchers argued that Apple’s findings were partly shaped by how the tests were structured. For example, puzzles like the Tower of Hanoi grow exponentially in complexity, meaning the models may have simply run out of space (or “tokens”) to complete their reasoning.

Others highlighted that several tasks such as the river-crossing puzzle with more than five actors may have been mathematically unsolvable under the parameters used.

Additionally, when models are instead asked to write an algorithm rather than directly solve a puzzle, their performance improves significantly. This shows that context and framing have a major impact on results.

In short, while Apple’s study reveals important limitations, it does not represent a universal verdict on AI reasoning. Rather, it reinforces the idea that reasoning performance depends heavily on how the problem is posed, the tools available, and the environment in which the model operates.

What Comes Next

Apple’s research raises important questions but also opens up exciting possibilities for the future of AI.

Rather than signalling a dead end, the study helps clarify where current models shine, and where future improvements are needed. It also points to a growing consensus: reasoning in AI isn’t a single capability but it’s an evolving skillset shaped by how we design tasks, structure inputs, and combine different systems.

Future research will likely move in several promising directions:

Hybrid architectures that pair language models with traditional algorithms or symbolic planners.
Integrated tool use, where AI systems can dynamically call on calculators, code interpreters, or external databases to reason more effectively.
More nuanced evaluation methods that go beyond right or wrong answers and examine how models think, adapt, and course-correct.

For businesses, this shift is not a setback, it’s an opportunity. As models become more modular and collaborative, organisations can begin to build custom AI stacks tailored to their specific reasoning needs. This could mean integrating language models with industry specific logic engines, legal databases, or financial forecasting tools.

To prepare, organisations should:

Invest in AI literacy across teams, especially around what models can and cannot do.
Run pilot tests on tasks involving planning, analysis, or decision making then refine based on observed behaviour.
Adopt a modular mindset, combining AI models with structured tools and human oversight to build more reliable workflows.

In short, the illusion Apple describes isn’t a failure but it’s a signal. A reminder that we’re still learning how to build AI that can reason with the same flexibility, precision, and depth as humans. But it’s also a prompt to shape the next wave of innovation not by expecting perfection, but by designing better systems around what AI does best.

Beyond Smart Talk: The Limits of AI Reasoning According to Apple

Fair Use or Free Ride? AI Copyright Law Just Shifted

How AI Is Redefining Entry-Level Jobs: What Employers Should Do Next