Acing this new AI exam — which its creators say is the toughest in the world — might point to the first signs of AGI


Researchers from the Center for AI Safety and Scale AI have released “Humanity’s Last Exam,” a test designed to measure the proximity of today’s most powerful systems. artificial intelligence (AI) models must meet or exceed human-level knowledge in several areas.
The test was launched in January 2025, but scientists first described the framework and thinking behind its design in a new study published January 28 in the journal Nature. It contains a corpus of 2,500 questions covering more than 100 topics, with contribution from more than 1,000 subject matter experts from 500 institutions across 50 countries.
At the launch, researchers tested OpenAI’s GPT-4o and o1, Google’s Gemini 1.5 Pro, Anthropic’s Claude 3.5 Sonnet, and DeepSeek R1. OpenAI’s o1 system took first place with a score of just 8.3%.
Despite this poor performance, the researchers wrote at the time that “given the rapid pace of AI development, it is plausible that models could exceed 50% accuracy on HLE by the end of 2025.”
As of February 12, 2026, the highest score achieved so far is 48.4%, defined by Google’s Gemini 3 Deep Think. Human experts, on the other hand, score around 90% in their respective fields.
Testing the world’s smartest machines
Humanity’s Last Exam was intentionally designed to be extremely difficult for AI models. Early in development, researchers issued a global call for submissions from subject matter experts in many fields.
Researchers applied strict submission criteria requiring questions to be specific, unambiguous, solvable and non-searchable. They didn’t want models to cheat by performing a simple web search, or for one of the questions to already appear online, thereby increasing the likelihood that a given model had the answer in its training dataset.
Each submitted question was then fed into the AI models. The team automatically rejected any questions the models could answer correctly.
More than 70,000 submissions were attempted, resulting in around 13,000 questions that baffled LLMs. These were then reviewed by a team of subject matter experts, approved by the research team, and presented to the scientific community for open comment.
Ultimately, the researchers reduced the total number of questions submitted to 2,500 questions that typically fall under doctoral-level testing.
An example of a trivia question on the exam is: “In Greek mythology, who was Jason’s maternal great-grandfather?” »
Meanwhile, a sample physics question asks the relationship between different forces during motion in a scenario in which a block is placed on a horizontal rail (and can slide without friction) while also attached to a rigid, massless rod of unknown length.
The breadth of questions and scope of topics covered in The Last Review sets it apart from similar benchmarking tools, its creators say.
Common tests, such as Massive understanding of multitasking language (MMLU), written with the participation of the founder of the Center for AI Safety Dan Hendryckstest only a small subset of knowledge in an expert-level domain, focusing primarily on coding and math.
Even cutting-edge references like that of François Chollet ARC-AGI sequel struggles to get past the memorization and searching issues that the creators of Humanity’s Last Exam suggest the new test addresses. Gemini’s Deep Think, for example, scored 84.6% on the ARC-AGI-2 benchmark test, just one week after failing to reach 50% on the HLE test.
The ultimate prize is general intelligence
Humanity’s Latest Exam probably represents the AI world’s best attempt yet to measure the broad-spectrum capabilities of modern AI models against human experts, but the study authors categorically state that scoring high on the HLE is in no way an indication of the arrival of general artificial intelligence (AGI).
“High accuracy on HLE would demonstrate expert-level performance on closed, testable questions and cutting-edge scientific knowledge, but it would not alone suggest autonomous search capabilities or artificial general intelligence,” the scientists said in the study.
“Good success in HLE is a necessary, but not sufficient, criterion to say that machines have achieved true intelligence,” Manuel Schottdorfneuroscientist in the Department of Psychological and Brain Sciences at the University of Delaware, said in a statement recent statement. Schottdorf is one of several experts whose question has been accepted into the HLE corpus.
“They will have to be pretty good at solving these questions, but that alone cannot allow us to conclude that machines are truly intelligent.”


:max_bytes(150000):strip_icc()/Health-GettyImages-2188246050-7cc128c4d1ec4db795eed107d1ba9468.jpg?w=390&resize=390,220&ssl=1)

