AI Math Benchmarks: AI’s Growing Capabilities

0 0 3 minutes read

AI Math Benchmarks: AI’s Growing Capabilities

https://www.profitableratecpm.com/f4ffsdxe?key=39b1ebce72f3758345b2155c98e6709c

Mathematics is often regarded as the ideal domain for measuring AI progress effectively. Math’s step-by-step logic is easy to track, and its definitive automatically verifiable answers remove any human or subjective factors. But AI systems are improving at such a pace that math benchmarks are struggling to keep up.

Way back in November 2024, non-profit research organization Epoch AI quietly released Frontier Math. A standardized, rigorous benchmark, Frontier Math was designed to measure the mathematical reasoning capabilities of the latest AI tools.

“It’s a bunch of really hard math problems,” explains Greg Burnham, Epoch AI Senior Researcher. “Originally, it was 300 problems that we now call tiers 1–3, but having seen AI capabilities really speed up, there was a feeling that we had to run to stay ahead, so now there’s a special challenge set of extra carefully constructed problems that we call tier 4.”

To a rough approximation, tiers 1–4 go from advanced undergraduate through to early postdoc level mathematics. When introduced, state-of-the-art AI models were unable to solve more than 2% of the problems Frontier Math contained. Fast forward to today and the best publicly available AI models, such as ChatGPT 5.2 Pro and Claude Opus 4.6, are solving over 40% of Frontier Math’s 300 tiers 1–3 problems, and over 30% of the 50 tier 4 problems.

AI takes on PhD level mathematics

And this dizzying pace of advancement is showing no signs of abating. For example, just recently Google DeepMind announced that Aletheia, an experimental AI system derived from Gemini Deep Think, achieved publishable PhD level research results. Though obscure mathematically—calculating certain structure constants in arithmetic geometry called eigenweights—the result is significant in terms of AI development.

“They’re claiming it was essentially autonomous, meaning a human wasn’t guiding the work, and it’s publishable,” Burnham says. “It’s definitely at the lower end of the spectrum of work that would get a mathematician excited, but it’s new—it’s something we truly haven’t really seen before.”

To place this achievement in context, every Frontier Math problem has a known answer that a human has derived. Though a human could probably have achieved Aletheia’s result “if they sat down and steeled themselves for a week,” says Burnham, no human had ever done so.

Aletheia’s results and other recent achievements by AI mathematicians point to new, tougher benchmarks being needed to understand AI capabilities, and fast, because existing ones will soon become irrelevant. “There are easier math benchmarks that are already obsolete, several generations of them,” says Burnham. “Frontier Math will probably saturate [meaning state-of-the-art AI models score 100%] within the next two years; could be faster.”

The First Proof challenge

To begin to address this problem, on February 6, a group of 11 highly distinguished mathematicians proposed the First Proof challenge, a set of 10 extremely difficult math questions which arose naturally in the authors’ research processes, and whose proofs are roughly five pages or less and had not been shared with anyone. The First Proof challenge was a preliminary effort to assess the capabilities of AI systems in solving research-level math questions on their own.

Generating serious buzz in the math community, professional and amateur mathematicians, and teams including OpenAI, all stepped up to the challenge. But by the time the authors posted the proofs on February 14, no one had submitted correct solutions to all 10 problems.

In fact, far from it. The authors themselves only solved two of the 10 problems using Gemini 3.0 Deep Think and ChatGPT 5.2 Pro. And most outside submissions fared little better, apart from OpenAI. With “limited human supervision” OpenAI’s most advanced internal AI system solved five of the 10 problems—a result met with a spectrum of emotions by different members of the mathematics community, from awe to disappointment. The team behind First Proof plans an even tougher second round on March 14.

A new frontier for AI

“I think First Proof is terrific: it’s as close as you could realistically get to putting an AI system in the shoes of a mathematician,” says Burnham. Though he admires how First Proof tests AI’s mathematical utility for a wide range of mathematics and mathematicians, Epoch AI has its own new approach to testing—Frontier Math: Open Problems. Uniquely, the pilot benchmark consists of 14 open problems (with more to follow) from research mathematics that professional mathematicians have tried and failed to solve. Since Open Problems’ release on January 27, none have been solved by an AI.

“With Open Problems, we’ve tried to make it more challenging,” says Burnham. “The baseline on its own would be publishable, at least in a specialty journal.” What’s more, each question is designed so that it can be automatically graded. “This is a bit counterintuitive,” Burnham adds. “No one knows the answers, but we have a computer program that will be able to judge whether the answer is right or not.”

Burnham sees First Proof and Open Problems as being complementary. “I would say understanding AI capabilities is a more-the-merrier situation,” he adds. “AI has gotten to the point where it’s, in some ways, better than most PhD students, so we need to pose problems where the answer would be at least moderately interesting to some human mathematicians, not because AI was doing it, but because it’s mathematics that human mathematicians care about.”

From Your Site Articles