Ask AI Why It Sucks at Sudoku. You’ll Find Out Something Troubling About Chatbots

Chatbots are really impressive when you watch them do things in which they are good, such as writing a basic email or creating strange futuristic images. But ask a generative AI to solve one of these puzzles in the back of a newspaper, and things can quickly leave the rails.
This is what researchers at the University of Colorado Boulder found when they challenged large language models to solve Sudoku. And not even standard 9×9 puzzles. A easier 6×6 puzzle often exceeded the capacities of an LLM without external help (in this case, specific tools for puzzle resolution).
A greater observation came when the models were invited to show their work. For the most part, they couldn’t. Sometimes they lied. Sometimes they explained things in a way that made no sense. Sometimes they hallucinated and started talking about the weather.
If the Gen AI tools cannot explain their decisions with precision or in a transparent manner, this should make us cautious when we give these things more control over our lives and decisions, said Ashutosh Trivedi, computer teacher at the University of Colorado in Boulder and one of the authors of the article published in July in the conclusions of the Linguistic For Computation Association.
“We would really like these explanations to be transparent and reflect why the AI made this decision, and not the AI trying to manipulate the human by providing an explanation that a human could love,” said Trivedi.
When you make a decision, you can try to justify it, or at least explain how you got there. An AI model may not be able to do with precision or in a transparent way of the same. Do you want to trust him?
Look at this: Telsa found responsible for the automatic pilot accident, prices are starting to have an impact on prices and more | Tech today
Why the LLM find it difficult with sudoku
We have already seen the models of AI fail in basic games and puzzles before. The Openai chatgpt (among others) was completely crushed in failures by the computer opponent in an Atari game in 1979. A recent Apple search document revealed that models can fight with other puzzles, such as the Hanoi tower.
This has to do with the operation of the LLM and fill the gaps in the information. These models try to fill these shortcomings according to what is happening in similar cases in their training data or other things they have seen in the past. With a sudoku, the question is that of logic. The AI could try to fill each gap in order, depending on what seems to be a reasonable answer, but to resolve it correctly, it must rather look at the whole image and find a logical order which changes from one puzzle to the other.
Find out more: Ai Estaastials: 29 ways to operate Gen Ai for you, according to our experts
Chatbots are bad for failures for a similar reason. They find the next logical movements, but do not necessarily think that three, four or five movements – the fundamental competence necessary to play failures well. Chatbots also sometimes tend to move chess parts in a way that does not really follow the rules or does not put the pieces in danger without meaning.
You might expect the LLM to be able to solve sudoku because they are computers and the puzzle consists of numbers, but the puzzles themselves are not really mathematics; They are symbolic. “Sudoku is famous for being a puzzle with figures that could be made with everything that is not figures,” said Fabio Somenzi, professor in Cu and one of the authors of the research journal.
I used an prompt sample of the researchers’ document and I gave it to Chatgpt. The tool has shown his work and told me several times that he had the answer before showing a puzzle that did not work, then go back and correct it. It was as if the bot was running in a presentation that continued to obtain last second modifications: this is the final answer. No, in fact, too bad, This is the final answer. He finally obtained the answer, out of tests and errors. But trials and errors are not a practical way for a person to resolve a sudoku in the newspaper. It is much too erased and ruined pleasure.
AI and robots can be good in games if they are designed to play them, but tools for general use like large language models can have trouble with logical puzzles.
AI finds it difficult to show its work
Colorado researchers did not just want to see if the bots could solve puzzles. They asked for explanations on how the bots worked through them. Things did not go well.
Testing the O1 -PREVIEW OPENAI reasoning model, the researchers saw that the explanations – even for correctly resolved puzzles – did not explain or justified their movements and have had poor basic terms.
“One thing in which they are good is to provide explanations that seem reasonable,” said Maria Pacheco, assistant computer teacher in Cu. “They align themselves with humans, so they learn to speak as we love it, but that they are faithful to what the real stages must be to solve the thing, this is where we are a little struggling.”
Sometimes the explanations were completely out of words. Since the end of the newspaper’s work, researchers have continued to test new published models. Somenzi said that when he and Trivedi directed O4’s O4 reasoning model through the same tests, at some point, it seemed to be completely abandoning.
“The following question we asked, the answer was the weather forecast for Denver,” he said.
(Disclosure: Ziff Davis, CNET’s parent company in April, filed a complaint against Openai, alleging that it has violated Ziff Davis Copyrights in the training and exploitation of its AI systems.)
Explain to you is an important competence
When you resolve a puzzle, you are almost certainly able to walk someone else through your thinking. The fact that these LLMs have failed this basic work so spectacularly is not a trivial problem. With AI companies by constantly speaking of “AI agents” who can take measures on your behalf, it is essential to be able to explain to you.
Consider the types of jobs given to AI now, or provided in the near future: driving, making taxes, deciding on commercial strategies and translating important documents. Imagine what would happen if you, one person, did one of these things and something went wrong.
“When humans have to put their faces before their decisions, they better explain what led to this decision,” said Somenzi.
It is not only a question of obtaining an answer to reasonable consonance. It must be precise. One day, the explanation of an AI in itself might have to hold on to court, but how can you take its testimony seriously if it is known to lie? You would not trust a person who failed to explain himself, and you would not trust someone you found either, that was what you wanted to hear instead of the truth.
“Having an explanation is very close to manipulation if it is done for a bad reason,” said Trivedi. “We must be very cautious with regard to the transparency of these explanations.”




