Top AI coding assistants fail one in four tasks, revealing serious gaps between hype and actual performance reliability


- Report Finds AI Coding Assistants Regularly Fail One in Four Structured Output Tasks
- Even advanced proprietary models only achieve about 75% accuracy.
- Open source AI models perform less well, with an average reliability close to 65%.
The promise of artificial intelligence as a tireless coding assistant has hit a significant hurdle after new research claimed such tools can run into a range of problems.
A recent study from the University of Waterloo found that AI struggles with software development, with even the most advanced models failing at one in four structured output tasks.
The research evaluated 11 large language models in 18 different structured formats and 44 tasks to test how well the systems could follow predefined rules, finding a clear disparity between performance on text-based tasks and results involving multimedia or complex structures.
Article continues below
Benchmarking reveals worrying reliability gap
While text-related tasks were generally handled with moderate success, tasks requiring the generation of images, videos, or websites proved much more problematic.
Accuracy in these areas has fallen sharply, raising questions about how these AI tools can be safely integrated into professional workflows.
“With this type of study, we want to measure not only the syntax of the code, that is, whether it follows established rules, but also whether the results produced for various tasks were accurate,” said Dongfu Jiang, a doctoral student and co-first author of the study.
Structured output, designed to enforce format consistency via JSON, XML, or Markdown, was intended to make AI responses more reliable for developers.
AI companies including OpenAI, Google, and Anthropic have introduced structured results to force answers into predictable formats.
The Waterloo study suggests that this approach has not yet achieved the level of reliability required by developers.
Waterloo’s benchmarking found that even the most advanced proprietary models only achieved around 75% accuracy, while open source alternatives performed closer to 65%.
These results suggest that, despite improvements, AI systems still make significant errors that cannot be ignored in professional development environments.
The report emphasizes the need for human oversight, noting: “Developers can have these agents work for them, but they still need significant human supervision. »
Although structured results are a step forward from free-form natural language responses, errors are still common.
The technology is not yet robust enough to operate independently in complex development scenarios.
One might reasonably question whether the industry’s enthusiasm for AI and vibrational coding assistants has outpaced the actual capabilities of the underlying technology.
Even the most advanced models show a significant failure rate on structured tasks, revealing a significant gap between marketing claims and actual performance.
Therefore, for now, developers should view these tools as experimental aids rather than standalone colleagues.
Follow TechRadar on Google News And add us as your favorite source to get our news, reviews and expert opinions in your feeds. Make sure to click the Follow button!
And of course you can too follow TechRadar on TikTok for news, reviews, unboxings in video form and receive regular updates from us on WhatsApp Also.



