‘Not how you build a digital mind’: How reasoning failures are preventing AI models from achieving human-level intelligence

Architectural constraints in today’s most popular artificial intelligence (AI) tools may limit their ability to become smarter, new research suggests.
A study published on February 5 on the prepublication arXiv Server argues that modern large language models (LLMs) are inherently prone to breakdowns in their problem-solving logic, known as “reasoning failures.”
Based on the performance of LLMs in assessments such as Humanity’s Last Examsome scientists say the underlying architecture of the neural network could one day lead to a model capable of achieving human-level cognition. Although the transformer architecture makes LLMs extremely capable for tasks such as language generation, researchers say it also inhibits the kind of reliable logical processes needed to achieve true human-level reasoning.
“LLMs demonstrated remarkable reasoning skills, achieving impressive results across a wide range of tasks,” researchers said in the study. “Despite this progress, significant reasoning failures persist, occurring even in seemingly simple scenarios… This failure is attributed to an inability for holistic planning and deep thinking.”
Limitations of LLMs
LLMs are trained on massive amounts of text data and generate responses to user prompts by predicting, word by word, a plausible answer. They do this by gathering units of text, called “tokens,” based on statistical models drawn from their training data.
Transformers also use a mechanism called “self-attention” to track relationships between words and concepts across long strings of text. Personal attention, combined with their massive training databases, is what makes modern chatbots so effective at generating compelling responses to user prompts.
However, LLMs do not do real “thinking” in the conventional sense. Instead, their answers are determined by an algorithm. For lengthy tasks, especially those that require true multi-step problem solving, transformers can lose track of key information and default to models learned from their training data. This leads to failures in reasoning.
This is not real reasoning in the human sense – it is simply a symbolic prediction disguised as a chain of thought.
Federico Nanni, senior data researcher at the Alan Turing Institute
“This fundamental weakness extends beyond basic tasks, to compositions of mathematical problemsverification of multi-fact statements and other inherently compositional tasks,” the researchers said in the study.
Reasoning failures are also why LLMs often circle the same answer to a user query even after being told it is incorrect, or produce a different answer to the same question when worded slightly differently, even when asked to explain their reasoning step by step.
Federico Nannisenior data researcher at the UK’s Alan Turing Institute, says that what LLMs typically present as reasoning is mostly just window dressing.
“People have figured out that if you tell an LLM, instead of answering directly, to ‘think step by step’ and write down a reasoning process first, they often get the right answer,” Nanni told Live Science. “But it’s a trick. It’s not real reasoning in the human sense, it’s just a symbolic prediction disguised as a chain of thought,” he said. “When we say ‘reason’ to these models, we really mean that they write down a reasoning process – something that looks like a plausible chain of reasoning.”
Gaps in existing AI benchmarks
Current LLM performance assessment methods fall short in three key areas, the researchers found. First, results may be affected by rephrasing a prompt. Second, benchmarks degrade and become contaminated the more they are used. And finally, they only evaluate the outcome, rather than the reasoning process a model uses to reach its conclusion.
This means that current benchmarks may significantly overestimate the capability of LLMs and underestimate how often they fail in real-world use.

“Our position is not that the benchmarks are flawed, but that they need to evolve,” co-author of the study Song of Peiyangstudent in computer science and robotics at Caltech, told Live Science via email. Likewise, benchmarks tend to seep into LLM training data, Nanni said, meaning that later LLMs figure out how to cheat them.
“Also, now that the models are deployed to production, the usage itself becomes a kind of benchmark,” Nanni said. “You put the system in front of users and see what’s wrong – that’s the new test. So yes, we need better benchmarks and we need to rely less on AI to check AI. But it’s very difficult in practice, because these tools are now integrated into the way we work, and it’s extremely convenient to just use them.”
A new architecture for AGI?
Unlike the others recent researchThe new study does not claim that neural network-based approaches to AI are a dead end in the quest to achieve goals. general artificial intelligence (AGI). Instead, the researchers compare it to the early days of computer science, noting that understanding why LLMs fail is essential to improving them.
However, they argue that simply training models on more data or augmenting them at scale is unlikely to solve the problem alone. This means that AGI development may require a fundamentally different approach to how models are built.
“Neural networks, and LLMs in particular, are clearly part of the AGI landscape. Their progress has been extraordinary,” Song said. “However, our investigation suggests that scaling alone is unlikely to resolve all reasoning failures… [meaning] Achieving human-level reasoning may require architectural innovations, stronger world models, improved robustness training, and deeper integration with structured reasoning and embodied interaction.
Nanni agreed. “From a philosophical point of view, I would say that we have essentially found the limits of transformers. They are not how you build a digital mind,” he said. “They model text extremely well, to the point where it’s almost impossible to tell whether a passage was written by a human or a machine. But that’s what they are: language models… There’s only so far you can push this architecture.”




