OpenAI Model Earns Gold-Medal Score at International Math Olympiad and Advances Path to Artificial General Intelligence

A few months before the International Mathematical Mathematical Olympiad of 2025 (OMI) in July, a team of three people in Openai took a long bet that they could use the brutally difficult problems of the competition to form an artificial intelligence model to think about themselves for hours so that it is able to write mathematical evidence. Their objective was not simply to create an AI that could make complex mathematics, but which could assess ambiguity and nuances – the skills will need if they are for one day take many difficult real tasks. In fact, these are precisely the skills necessary to create an artificial general intelligence, or acted: understanding and reasoning at the level of man.
The OMI, which was held this year on Australia Sunshine Coast, is the first mathematics competition in the world for high school students, bringing together the best candidates in more than 100 countries. All receive the same six problems – three per day, each worth seven points – to resolve more than two days. But these problems are nothing like what you probably remember from the school. Rather than a brief digital response, everyone demands a reasoning and creativity supported in the form of written proof of pages. These logical and step by step arguments must extend over many fields of mathematics – extremely the type of problem which, until this year, AI systems have failed spectacularly.
The OPENAI team of researchers and engineers – Alex Wei, Sheryl HSU and Noam Brown – used a model of reasoning for general use: an AI designed to “think” through difficult problems by dividing them into steps, checking its own work and adapting its approach as you go. Although AI systems cannot be officially competed as participants, the notoriously difficult test served as a demonstration of what they can do, and the AI have addressed this year’s questions in the same test format and with the same constraints as human participants. By receiving the questions, the team’s experimental system worked for two 4.5 -hour sessions (as well as candidate students), without tools or internet – there was absolutely no external assistance of tools such as search engines or software designed for mathematics. The evidence he produced were noted by three former OMI medalists and published online. The AI correctly completed five of the six problems, receiving 35 points out of 42 – the minimum required for an OMI gold medal. (The Google Deepmind IA system also obtained this score this year.) Of 630 competitors, only 26 students, or 4%, surpassed AI; Five students made perfect 42. Given that a year ago, linguistic AI systems like Openai had trouble making elementary mathematics, the results were a spectacular jump in performance.
On the support of scientific journalism
If you appreciate this article, plan to support our award -winning journalism by subscription. By buying a subscription, you help to ensure the future of striking stories about discoveries and ideas that shape our world today.
In the next conversation, American scientist Talked with two members of the Openai team, Alex Wei and Sheryl HSU, to discuss how they carried out their work, why the lack of answer of the model to the sixth question was in fact a major step towards the resolution of the problem of “hallucination” of AI and the way in which the development of a system capable of writing complex evidence could help lead to an artificial general intelligence.
[An edited transcript of the interview follows.]
What led you to suddenly start preparing an AI model for the OMI a few months before the competition? What was the spark?
Wei: I have been thinking about mathematical evidence for a long time. I am in an Openai team called Mathgen. We had just seen the results progress a lot. We felt like we had a chance to get a model that could very well do the IMO, and we wanted to do a crazy race to get there.
HSU: I used to make mathematics competitions. [Wei] used to do mathematics competitions – he was much better than me. The OMI is definitely well known in the [AI research] Community, including among OpenAi researchers. So it was really inspiring to push specifically for that.
Can you talk about your decision to work with a general home AI system rather than a system specifically designed to respond to mathematical problems?
Wei: Philosophy is that we want to build a general AI and develop methods that do not only work for mathematics. Mathematics are a very good test field for AI because it is quite objective: if you have proof, it is easier to obtain a consensus on the fact that it is correct. It’s more difficult for, let’s say, poetry – you will have more disagreement in readers. And the problems of the OMI are very difficult, so we wanted to tackle difficult problems with the general methods in the hope that they will also apply to areas beyond mathematics.
HSU: I would also say that OpenAi’s objective is to build AG – it is not necessarily to write papers or win competitions. It was important that everything we have done for this project is also useful for the biggest objective of building AG and better models that users can really use.
In what ways could a model of reasoning to win a gold in IMO help lead to act?
Wei: a perspective is to think in terms of the duration of the tasks. A year ago, Chatgpt could only do basic mathematical problems. Two years ago – and even a year and a half ago – we often think of the mathematical problems of the basic school that you would find on fifth year homework. For someone very good in mathematics, they take a second or two to read and solve. Then we started to assess using the love [the American Invitational Mathematics Examination, a 15-question high school math contest]. It takes about 10 minutes per problem, with about three hours for 15 problems. The OMI is four and a half hours for only three problems, it’s 90 minutes per problem. Chatgpt started to be good for quick questions. Now, it’s better in longer tasks, such as “can you modify this paragraph for me?” As AI improves, you can extend the time horizon of tasks, and you can see this progression clearly in mathematics.
HSU: Another aspect is that the reasoning models were previously very good in tasks that are easy to check. If you solve an non -tight mathematical problem, there is a digitally correct answer. It’s easy to check. But in the real world – and in tasks, people really want help – it’s more complex. There are nuances: maybe it is above all correct but has mistakes; It may be correct but could be better stylized. Mathematics based on evidence is not trivial to assess. If we think of act, these tasks will not be easy to judge as correct or not; They will be more vaguely specified and more difficult overall.
What was the model training process?
Wei: In general, learning strengthening forms a model by rewarding good behavior and penalizing bad behavior. If you repeatedly strengthen good behavior and discourage bad behavior, the model becomes more likely to show the right behavior.
HSU: Towards the end, we also increased the testing of the test test [how long the AI model was able to “think” before answering]. Previously, for a human, problems of this kind can last a few minutes; Now we were at the hour. This additional time of thought has given surprising gains. There was a time when we carried out assessments on our set of internal tests which took a long time due to the increase in the calculation of testing times. When we finally examined the results – and Alex classified them – see progress made me think that gold could be at hand. It was quite exciting.
During the OMI test, the model you developed obtained five correct answers. But with the sixth question, the model has not tried to provide an answer. Can you tell me more about the meaning of this answer?
Wei: The model knowing what he does not know was one of the first signs of [progress] We have seen. Today, if you use Chatgpt, you will sometimes see “hallucinations” – the models do not know reliably when they do not know. This capacity is not specific to mathematics. I would love that, for daily questions, the model could honestly say when he does not know instead of giving an answer that I must check independently.
What type of impact could your work on this model have on future models?
HSU: Everything we have done for this project is quite general – being able to note the results that are not unique answers and work on difficult problems for a long time while making regular progress. These have contributed a lot to success here, and now we and others to Openai apply them beyond mathematics. It is not in GPT – 5, but in future models, we are delighted to integrate these capacities.
Wei: If you look at the solutions we have published publicly for OMI problems, some are very long – five to 10 pages. This model can generate long coherent and consistent outings, without errors. Many current models of the state of Laart cannot produce a fully coherent five -page report. I am delighted that this care and this precision help in many other areas.




