Alan Turing statue in Bletchley Park, UK.Photo: Steve Meddle/Shutterstock
Today's best artificial intelligence (AI) models pass the Turing test, the famous thought experiment that asks whether a computer can impersonate a human when interacting through text.
Some see the updated test as a necessary benchmark for progress towards artificial general intelligence (AGI) is a controversial term used by many technology firms to refer to an artificial intelligence system that can match any human cognitive ability. But at an event at the Royal Society of London on October 2, several researchers said the Turing test should be scrapped altogether and developers should instead focus on assessing the safety of AI and creating specific capabilities that can benefit society.
“Let's figure out what kind of AI we need and test it instead,” said Anil Seth, a neuroscientist at the University of Sussex in Brighton, UK. By focusing on “moving toward AGI, I think we're really limiting our imagination about what kind of systems we might want—and, more importantly, what kind of systems we don't really want—in our society.”
The event was organized to mark the 75th anniversary of the publication of British mathematician Alan Turing's seminal paper describing a test he called the Imitation Game. Solving the philosophically difficult question of whether machines can think1The test involves a series of short text conversations between a judge and a person or computer. To win, the machine must convince the judge that it is human.
ChatGPT breaks the Turing test: the race is on for new ways to evaluate AI
The low-key approach to machine intelligence proposed at the meeting proved popular. At the oversubscribed event, Gary Marcus was introduced by Peter Gabriel, frontman of the rock band Genesis, and, as Marcus's personal friend, “The Matrix” star Laurence Fishburne had a seat in the audience. More than 1,000 people also watched the event online.
“The idea of AGI may not even be the right goal, at least for now,” said Marcus, a neuroscientist at New York University in New York City, during one of the keynote addresses. Some of the best AI models are highly specialized, e.g. AlphaFold, Google DeepMind's protein structure predictor, – he said. “He only does one thing. He doesn't try to write sonnets,” he said.
Beyond Turing
Turing's playful thought experiment has often been used as a measure of machine intelligence, but it was never intended to be a serious or practical test, says Sarah Dillon, a literary scholar at the University of Cambridge in Britain who studies the mathematician's work.
Some of today's most powerful artificial intelligence systems are advanced versions of large language models (LLMs), which predict text based on associations learned from learning from Internet data. In March Researchers tested four chatbots in a version of the Turing testand found that the best models passed.
However, just because chatbots can reliably imitate speech does not mean they can understand it, several researchers said at the event. While LLM responses can be strangely human, “when you go beyond what you typically ask of these systems, they have a lot of problems,” Marcus said. As an example, he cited the inability of some models to correctly label the details of an elephant or draw the hands of a clock anywhere other than the ten and two positions. For this reason, models may still fail the Turing test if challenged by a scientist who knows their weaknesses.

Artificial intelligence researcher Gary Marcus (left) with actor Laurence Fishburne at a Turing event.Photo: Courtesy of the Institute of Web Science, University of Southampton.
However, the rapid improvement of LLM-based systems in a wide range of areas, especially reasoning tasks, has prompted speculation about whether machines will soon be able to reach human levels in cognitive tests. To measure the growing capabilities of AI and identify non-linguistic skills, researchers have tried to develop more complex tests. One of the latter is second version of the puzzle-based abstract and reasoning corpus for AGI (ARC-AGI-2), which should evaluate the AI's ability to effectively adapt to new problems. Such tests are often called milestones on the path to general intelligence. but researchers have not reached a consensus on any criterion for achieving AGI..