Google’s Gemini 3 model keeps the AI hype train going – for now

Gemini 3 is Google's latest artificial intelligence model

VCG via Getty Images

Google's latest chatbot, Gemini 3, has made significant progress in a series of tests designed to measure the progress of artificial intelligence, according to the company. These advances may be enough to allay concerns about artificial intelligence bubble is currently explosive, but it is unclear how well these estimates correspond to real-world capabilities.

Moreover, constant factual inaccuracies and hallucinations which have become a hallmark of all major language models, show no evidence of anti-aliasing, which can be problematic for any application where reliability is vital.

In a blog post In announcing the new model, Google executives Sundar Pichai, Demis Hassabis and Koray Kavukcuoglu write that Gemini 3 has “PhD-level reasoning” – a phrase that rival OpenAI also used when announcing its Model GPT-5. As evidence, they cite the results of several tests designed to test “graduate-level” knowledge, such as the “Last Exam of Humanity,” a set of 2,500 research-level questions in math, science and humanities. Gemini 3 scored 37.5% in this test, beating the previous record holder, OpenAI's GPT-5 version, which scored 26.5%.

Such jumps may indicate that the model has become more capable in certain respects, says Luke Roche at Oxford University, but we need to be careful in interpreting these results. “If a model goes from 80 percent to 90 percent on a benchmark, what does that mean? Does that mean the model was 80 percent at the PhD level and is now 90 percent at the PhD level? I think that's pretty hard to figure out,” they say. “We cannot accurately determine whether an AI model is intelligent because it is a very subjective concept.”

Benchmarking tests have many limitations, such as requiring single answer or multiple choice answers, for which models are not required to demonstrate performance. “It's very easy to use multiple choice questions for assessment [the models]says Roche, but if you go to the doctor, the doctor won't evaluate you with multiple choice. If you ask a lawyer, the lawyer will not give you multiple choice legal advice.” There is also a risk that the answers to such tests will be hidden in the training data of the AI ​​models being tested, effectively allowing them to cheat.

The real test of Gemini 3 and the most advanced AI models (and whether their performance will be sufficient to justify the trillions of dollars that companies like Google and OpenAI are spending on AI data centers) will be how people use the model and how reliable it is, Roche says.

Google says the model's improved capabilities will allow it to better create software, organize email and analyze documents. The firm also says it will improve Google search by augmenting AI results with graphics and simulations.

It's likely that the real improvements will come from people who use artificial intelligence tools to write code autonomously, a process called agent-based coding, he says. Adam Mahdi at Oxford University. “I think we have reached the upper limit of what a typical chatbot can do and the real benefits of Gemini 3 Pro. [the standard version of Gemini 3] they will likely be engaged in more complex, potentially agent-based workflows rather than day-to-day communication,” he says.

Initial reactions online included people praise Gemini's coding capabilities and reasoning ability, but as with all new model releases, there were also messages highlighting the inability to perform seemingly simple tasks such as tracing hand drawn arrows pointing to different people, or simply visual thinking tests.

In Gemini 3's technical specifications, Google admits that the model will continue to hallucinate and occasionally produce factual inaccuracies at a rate roughly comparable to other leading artificial intelligence models. The lack of improvement in this area is of great concern, says Arthur Avila Garcez at St George's, University of London. “The problem is that all the AI ​​companies have been trying to reduce the number of hallucinations for over two years, but it only takes one very bad hallucination to destroy trust in the system forever,” he says.

Topics:

Leave a Comment