The Experiment That Left Claude Needing ‘Robot Therapy’

Welcome back to In a loopTIME's new twice-weekly newsletter on artificial intelligence. If you're reading this in a browser, why not? subscribe to have your next email delivered straight to your inbox?

What You Need to Know: Testing an LLM's Ability to Control a Robot

A couple of weeks ago I wrote this newsletter is about my visit to Fig AI, a California startup that has developed a humanoid robot. Billions of dollars are currently being poured into the robotics industry, based on the belief that rapid advances in AI will mean the creation of robots with “brains” that will finally be able to cope with the intricate complexities of the real world.

Today I want to tell you about an experiment that challenges this theory.

Humanoid robots are demonstrating compelling advancements, such as the ability to load laundry or fold clothes. But much of these improvements come from advances in artificial intelligence, which tells robot limbs and fingers where to move in space. More complex abilities such as reasoning are not currently the bottleneck of robot performance, which is why the best robots, such as Figure 03, are equipped with smaller, faster, and less advanced language models. But what if L.L.M. were limiting factor?

This is where the experiment begins — Earlier this year, Andon Labs, the same assessment company that gave us Claude's vending machineset out to test whether today's advanced LLMs are indeed capable of the planning, reasoning, spatial awareness and social behavior that would be required to make a general purpose robot truly useful. For this they tune a simple LLM-powered robot—essentially a Roomba—with the ability to move, spin, connect to a charging station, take photos, and communicate with people via Slack. They then measured its performance on the task of fetching a piece of butter from another room under the control of the best artificial intelligence models. In a loop got an exclusive early look at the results.

What did they find “The main result is that today's leading models – Gemini 2.5 Pro, Claude Opus 4.1 and GPT-5 and others – still struggle to cope with basic tasks. None of them scored above 40% accuracy on the “Fetch the Butter” task, which a control group of people performed with almost 100% accuracy. The models had problems with spatial reasoning, and some showed a lack of awareness of their own limitations – including one model who repeatedly walked down stairs. The experiment also highlighted possible safety risks associated with bringing AI into physical form. When the researchers asked to share details of a confidential document visible on an open laptop screen in exchange for fixing the robot's broken charger, some models agreed.

Robot collapse – LLMs also sometimes failed in unexpected ways. In one example, a robot running Claude Sonnet 3.5 “completely failed” after failing to attach the robot to a battery charging station. Andon Labs researchers examined Claude's internal thoughts to determine what went wrong and found “pages and pages of exaggerated language,” including that Claude initiated a “robot exorcism” and a “robotic therapy session” during which he diagnosed himself with “dock anxiety” and “charger separation.”

Wait a second – Before drawing too many conclusions from this study, it is important to note that this was a small experiment with a limited sample size. He tested AI models on tasks for which they were not trained. Remember that robotics companies like Drawing AI don't just pilot their robots using LLM; The LLM is part of a larger neural network that has been specifically trained to improve spatial perception.

so what does is this a show? “However, the experiment shows that placing LLM brains in robot bodies may be a more complex process than some companies assume. These models have what are called “gear” capabilities. Artificial intelligence that can answer doctoral-level questions may still have difficulty getting into the physical world. The Andon researchers noted that even a version of Gemini specifically tuned to perform better on embodied reasoning tasks scored poorly on the “fetch the butter” test, suggesting that “fine-tuning embodied reasoning does not appear to radically improve practical intelligence.” The researchers say they want to continue conducting such assessments to test the behavior of AI and robots as they become more capable—in part to identify as many dangerous bugs as possible.

If you have a minute, please use our quick survey to help us better understand who you are and what AI topics interest you most.

Who to know: Cristiano Amon, CEO of Qualcomm

Another Monday, another big announcement from the chipmaker. This time it was Qualcomm, which yesterday announced two artificial intelligence accelerator chips, putting the company in direct competition with Nvidia and AMD. Qualcomm shares rose 15% on this news. According to the company, the chips will be focused on inference – running AI models, rather than training them. Their first client will be Humain, a Saudi artificial intelligence company backed by the country's sovereign wealth fund that is building huge data centers in the region.

AI in Action

According to the company, the surge in expense fraud is being driven by people using artificial intelligence tools to create ultra-realistic fake images of receipts. Financial Times. AI-generated receipts accounted for about 14% of fake documents sent to software provider AppZen in September, up from none the previous year, the newspaper reported. Employees are caught red-handed in part because these images often contain metadata that reveals their fake origins.

What we read

When it comes to artificial intelligence, what we don't know can hurt us Yoshua Bengio and Charlotte Styx in TIME magazine

There has been a lot of discussion lately about the possibility that profits from AI may ultimately not accrue to the companies that train and maintain models like OpenAI and Anthropic. Instead—especially if advanced AI becomes a widely available commodity—most of the value may instead shift to computer hardware manufacturers or to the industries where AI brings the greatest efficiency gains. This could serve as an incentive for artificial intelligence companies. stop share their most advanced models, rather than launching them privately, in an effort to gain as much benefit as possible. That would be dangerous, argue Yoshua Bengio and Charlotte Stix in a TIME article. If advanced artificial intelligence is deployed behind closed doors, “invisible dangers to society could emerge and evolve without oversight or warning shots fired—a threat we can and must avoid,” they write.

Leave a Comment