OpenAI Is Testing AI’s Scientific Ambitions

Demis Hassabis founded DeepMind “solve the intelligence problem” and then use it to “solve everything else.” Sam Altman promised that “the benefits to quality of life from AI accelerating scientific progress… will be enormous.” Dario Amodei from Anthropic predicted that by 2026, AI progress could create “a country of geniuses in the data center.” Of all the founding myths behind the AI ​​boom, the hope that AI can help humanity understand the universe is one of the most enduring.

FrontierScience, a new test released Tuesday by OpenAI, suggests that AI models are making progress toward this goal and highlights the difficulty of testing the models' capabilities as they become increasingly competitive with human scientists. “We want to rigorously measure how models can improve scientific capabilities and perhaps even accelerate scientific discovery,” says Miles Wang, a researcher on the OpenAI evaluation team who led the work.

The test contains questions on physics, chemistry and biology of two levels of difficulty. Olympiad-level questions test “the limits of what many brilliant young minds are capable of,” Wang says. A more advanced level of research containing Ph.D. scientists, tests “open reasoning, judgment, and the ability to support real-world research.”

One typical research question was two paragraphs long and asked about “nitrogen mesoatoms in nickel(II) phthalocyanine.” Computer modeling to solve it “could take several days,” says Francisco Martin-Martinez, a senior lecturer in chemistry at King's College London.

Another asked to derive “electrostatic wave modes” in plasma. “Earlier this year I did a similar analysis for a different type of wave… I think it took me about 3 weeks to get the calculation right,” Tom Ashton-Key, a PhD candidate in plasma physics at Imperial College London, told TIME. “5-10% of my time is spent answering questions like this.”

The test results show the same trend that is driving the artificial intelligence boom: the line goes up and to the right. “We started creating this test a few months ago, and the progress hasn't been that great,” Wang says. However, by the time the article was published, everything had changed. “Progress has been very rapid over the past year, [reinforcement learning] and patterns of reasoning.”

OpenAI's recently released GPT-5.2 tops the test, scoring 77.1% at the Olympic level and 25.3% at the Research level, although its improvement over its predecessor, GPT-5, is marginal in the latter category. If and when they reach 100% at the research level, AI models will be “a very good assistant and will multiply the progress that graduate students or scientists can make,” Wang said.

However, FrontierScience “does not measure All “Important opportunities in science,” Wang says. Because the questions are text only, models are not tested on their ability to perform experiments or analyze images and videos. The small sets of questions – 100 questions at the competition level, 60 at the research level – mean that it is difficult to make reliable comparisons between closely performing models, and the paper lacks a human baseline showing how a person would perform on these questions.

“I would expect the benchmark to be highly correlated with existing work… and not be as informative about when models will actually be useful to aid research, but it is very difficult to do otherwise with a benchmark,” Jaime Sevilla, director of the Epoch AI Research Institute, told TIME in an email. “Overall, this looks like a good addition to the benchmarking ecosystem.”

These questions are broader than just this criterion. “We're approaching the edge of what we can reliably assess as a layperson,” Wang says. “It becomes very expensive to find very specialized subject matter experts, both in terms of time and cost.” When the person writing the question is one of the world's few experts on the topic, it's hard to find a third party to tell you how complex the problem is.

The task of finding experts to create tests is being handled outside of OpenAI by expert data annotation companies like Mercor or Surge AI, both of which are valued at more than $10 billion. They use experts from academic institutions to develop questions and rubrics to evaluate the models' responses. “If you want the Riemann Hypothesis to be proven in your lifetime, what do you want to do? Are you going to help train an AI to solve it or collaborate with an AI to solve it,” says Edwin Chen, founder and CEO of Surge AI.

AI has already had a significant impact on scientific work. Google DeepMind's AlphaFold has predicted over 200 million protein structures that will take According to the company, hundreds of millions of years remain to be found experimentally. Another project The goal of the project is to simulate and control plasma inside a fusion reactor. A third creates artificial intelligence systems to produce detailed weather forecasts.

However, for the most part these are narrow applications of AI, targeting a small part of one area. “AlphaFold gives you the structure of the protein and how it folds, but it doesn’t tell you anything about its electronic properties or where the electrons are,” Martin-Martinez says.
For many artificial intelligence companies and startupsThe grand prize is AI, which can help with the entire scientific process—from designing experiments to analyzing data—in a wide range of areas.

Large language models (LLMs) promise just such generality. In mathematics and programming they are starting to produce results. Sebastian Bubeck, a mathematician now working at OpenAI, gave GPT-5 a problem that he and his graduate students had been unable to solve for years. “We gave him two days to think,” says Bubek. “There was a wonderful personality that the model found, and that actually solved the problem.”

Coding tasks that used to take four hours now take Keith Butler, associate professor of chemistry at University College London, thirty minutes. “I can actually code again,” he says. But when it comes to actually making discoveries or proposing new hypotheses in his field, he's “a little more skeptical.”

Others are even more skeptical. “The amount of nonsense that comes out of any LLM is so colossal that it is completely unreliable,” says Carlo Rovelli, a theoretical physicist at Aix-Marseille University.

“They are a huge burden at the moment because the journals are full of publications,” says Rovelli, adding that the number of submissions to the journal Foundations of Physics, where he is editor-in-chief, has more than doubled in the past year. “Most people think they're doing great science by talking to MScs, and that's terrible.”

If the trend indicated by FrontierScience continues, LLMs may soon become more reliable research collaborators. This leaves Martin-Martinez excited but “lost” in the pace of progress. “Too many feelings. I need an LL.M. to sum them up,” he says.

Leave a Comment