Tests that AIs Often Fail and Humans Ace Could Pave the Way for Artificial General Intelligence

abdulmanannet77@gmail.comJuly 18, 2025

0 1 6 minutes read

Tests that AIs Often Fail and Humans Ace Could Pave the Way for Artificial General Intelligence

https://www.profitableratecpm.com/f4ffsdxe?key=39b1ebce72f3758345b2155c98e6709c

There are many ways to test the intelligence of artificial intelligence: the fluidity of conversation, understanding of reading or physics that is both difficult. But some of the tests that are most likely to cut AI are those that humans find relatively easy, even entertaining. Although the AIS excels more and more with tasks that require high levels of human expertise, this does not mean that they are about to reach general artificial intelligence, or acted. AG requires that IA can take a very small amount of information and use it to generalize and adapt to very new situations. This capacity, which is the basis of human learning, remains difficult for AIS.

A test designed to assess the ability of an AI to generalize is abstraction and corpus reasoning, or arc: a collection of small colored grid puzzles that ask a solver to deduce a hidden rule, then apply it to a new grid. Developed by IA researcher François Chollet in 2019, he became the basis of the Arc Prize Foundation, a non -profit program that administers the test – now a reference of the industry used by all the main models of AI. The organization also develops new tests and has systematically used two (Arc-Agi-1 and its more difficult Arc-Agi-2 successor. This week, the Lance Arc-Agi-3 Foundation, which is specially designed to test AI agents-and is based on video game facts.

American scientist Talked with the president of the ARC Prix Foundation, the IA researcher and entrepreneur, Greg Kamradt, to understand how these tests evaluate AIS, which they tell us about the potential of AGE and why they are often difficult for depth learning models, even if many humans tend to find them relatively easy. The links to try the tests are at the end of the article.

On the support of scientific journalism

If you appreciate this article, plan to support our award -winning journalism by subscription. By buying a subscription, you help to ensure the future of striking stories about discoveries and ideas that shape our world today.

[An edited transcript of the interview follows.]

What definition of intelligence is measured by Arc-Agi-1?

Our definition of intelligence is your ability to learn new things. We already know that AI can gain chess. We know they can beat Go. But these models cannot generalize to new areas; They cannot go and learn English. So, what François Chollet did was a reference called Arc -Agi – she teaches you a mini competence in the question, then he asks you to demonstrate this mini competence. We essentially teach something and ask you to repeat the skills you have just learned. The test therefore measures the ability of a model to learn in a narrow field. But our affirmation is that it does not measure act because it is always in an area within the framework [in which learning applies to only a limited area]. He measures that an AI can generalize, but we do not pretend that it is acted.

How do you define it here?

There are two ways to see it. The first is more to technology, which is “an artificial system can correspond to the learning efficiency of a human? Now, what I mean by that is after the birth of humans, they learn a lot outside their training data. In fact, they don’t really do it to have Training data, other than some progressive priors. So we learn to speak English, we learn to drive a car and we learn to cycle – all these things outside of our training data. This is called generalization. When you can do things apart from what you have formed now, we define it as an intelligence. Now, an alternative definition of the act that we use is when we can no longer find any problems that humans can do and that AI cannot – it is when we acted. It is a definition of observation. The reverse is also true, which is as long as the price of arc or humanity in general can always find problems that humans can do, but AI cannot, so we have no act. One of the key factors of François Chollet’s reference … is that we test humans on them, and the average human can do these tasks and problems, but AI always has trouble with it. The reason why it is so interesting is that some advanced AIS, like Grok, can take any higher level exam or do all these crazy things, but it is a thorny intelligence. He still does not have the generalization power of a human. And this is what this reference shows.

How do your bearings differ from those used by other organizations?

One of the things that differentiates us is that we need our reference to be resolved by humans. It is in opposition to other benchmarks, where they make problems “Ph.D.-Plus and more”. I do not need to be said that AI is smarter than me – I already know that the O3 of Openai can do a lot of things better than me, but it does not have the power of human to generalize. This is what we measure, so we have to test humans. We actually tested 400 people on Arc-Agi-2. We put them in a room, we gave them computers, we did a demographic screening, then we did the test. The average person marked 66% on Arc-Agi-2. Collectively, however, the aggregated responses of five to 10 people will contain the right answers to all questions about ARC2.

What makes this test difficult for AI and relatively easy for humans?

There are two things. Humans are incredibly economical in samples with their learning, which means that they can look at a problem and perhaps one or two examples, they can pick up mini-skills or transformation and they can go. The algorithm that works in the head of a human is the orders of magnitude better and more effective than what we see with AI at the moment.

What is the difference between Arc-Agi-1 and Arc-Agi-2?

Then Arc-Agi-1, François Chollet did it himself. It was about 1,000 tasks. It was in 2019. He mainly made the minimum viable version in order to measure generalization, and it took place for five years because in -depth learning could not touch it at all. It didn’t even get closer. Then, models of reasoning that came out in 2024, by Openai, began to progress, which showed a change in level of steps in what AI could do. Then, when we went to Arc-Agi-2, we went a little further in the burrow of the rabbit with regard to what humans can do and AI cannot. This requires a little more planning for each task. So, instead of being resolved in five seconds, humans can do it in a minute or two. There are more complicated rules, and the grids are larger, so you must be more precise with your answer, but it is the same concept, more or less … We are now launching a preview of the developer for Arc-Agi-3, and that leaves this format completely. The new format will actually be interactive. So consider more as an agent reference.

How will the ARC-AGI-3 test agents be differently compared to previous tests?

If you think of daily life, it is rare that we have a stateless decision. When I say without a state, I mean just a question and an answer. At present, all references are more or less stateless. If you ask a question to a language model, it gives you a single answer. There are a lot of things that you cannot test with a stateless reference. You can’t test planning. You cannot test exploration. You cannot test intuition on your environment or the objectives that come with this. We therefore make 100 new video games that we will use to test humans to make sure that humans can make them because it is the basis of our reference. And then we will drop Ais in these video games and see if they can understand this environment that they have never seen before. To date, with our internal tests, we have not had a single AI able to beat the same level of one of the games.

Can you describe video games here?

Each “environment” or video game is a two -dimensional puzzle based on pixels. These games are structured as distinct levels, each designed to teach a specific mini-skill for the player (human or AI). To be completed a level successfully, the player must demonstrate the mastery of this competence by performing planned action sequences.

How is the use of video games to test Act different from how video games have already been used to test AI systems?

Video games have long been used as references in AI research, Atari Games being a popular example. But traditional video game benchmarks are faced with several limits. Popular games have in -depth training data accessible to the public, lacking standardized performance evaluation measures and allow brutal methods involving billions of simulations. In addition, developers who build AI agents generally have a prior knowledge of these games – intentionally incorporating their own ideas on solutions.

To try Arc-agi-1,, Arc-Agi-2 And Arc-Agi-3.