Google’s Gemini 3 model keeps the AI hype train going – for now


Gemini 3 is Google’s latest AI model
VCG via Getty Images
Google’s latest chatbot, Gemini 3, made significant progress against a series of tests designed to measure AI progress, according to the company. These achievements may be enough for now to assuage fears of an AI bubble bursting, but it’s unclear how well these scores translate into real-world capabilities.
Additionally, the persistent factual inaccuracies and hallucinations that have become the hallmark of all major language models show no signs of being eliminated, which could prove problematic for any use where reliability is vital.
In a blog post announcing the new model, Google bosses Sundar Pichai, Demis Hassabis and Koray Kavukcuoglu write that Gemini 3 has “PhD-level reasoning”, a phrase that competitor OpenAI also used when announcing its GPT-5 model. As proof, they list the results of several tests designed to test “college-level” knowledge, such as the Last Humanity Exam, a set of 2,500 research-level questions in math, science, and the humanities. Gemini 3 scored 37.5% on this test, outperforming the previous record holder, a version of OpenAI’s GPT-5, which scored 26.5%.
Such jumps can indicate that a model has become better in some respect, says Luc Rocher of the University of Oxford, but we need to be careful in how we interpret these results. “If a model goes from 80 to 90 percent of a baseline, what does that mean? Does that mean a model was 80 percent PhD level and is now 90 percent PhD level? I think that’s quite difficult to understand,” they say. “There is no number to determine whether an AI model is reasonable, as it is a very subjective notion.”
Benchmark tests have many limitations, such as requiring a single answer or multiple-choice answers for which models do not need to show how they work. “It is very simple to use multiple choice questions to score [the models]“, explains Rocher, “but if you go to the doctor, he will not evaluate you with a multiple choice. If you ask a lawyer, they will not give you legal advice with multiple choice answers. There is also a risk that the answers to these tests will be sucked into the training data of the AI models being tested, leaving them to cheat.
The real test for Gemini 3 and the most advanced AI models — and whether their performance will be enough to justify the billions of dollars that companies like Google and OpenAI are spending on AI data centers — will be how people use the model and how reliable they find it, Rocher says.
Google says the model’s improved capabilities will allow it to better produce software, organize email and analyze documents. The company also says it will improve Google Search by supplementing AI-generated results with graphics and simulations.
It’s likely that the real improvements will be in people using AI tools to write code autonomously, a process called agent coding, says Adam Mahdi of the University of Oxford. “I think we’re reaching the upper limit of what a typical chatbot can do, and the real benefits of Gemini 3 Pro [the standard version of Gemini 3] these will likely be more complex, potentially agentic workflows, rather than day-to-day discussions,” he says.
Early reactions online included people praising the Gemini’s coding abilities and reasoning ability, but as with all new model releases, articles were also published highlighting failures in performing seemingly simple tasks, such as tracing hand-drawn arrows pointing at different people or simple visual reasoning tests.
Google admits in Gemini 3’s technical specifications that the model will continue to hallucinate and produce factual inaccuracies from time to time, at a rate roughly comparable to other leading AI models. The lack of improvement in this area is a big concern, says Artur d’Avila Garcez of City St George’s, University of London. “The problem is that all AI companies have been trying to reduce hallucinations for over two years, but it only takes one really bad hallucination to permanently destroy trust in the system,” he says.
Topics:


