Several recent studies have shown that artificial intelligence agents sometimes decide to act badlyfor example, by trying to blackmail people who are planning to replace them. But this behavior often occurs according to far-fetched scenarios. Now, new research represents InclinationBencha benchmark that measures the choice of an agent model to use malicious tools to complete assigned tasks. Quite realistic pressures (such as looming deadlines) have been found to dramatically increase levels of misconduct.
“The world of artificial intelligence is becoming increasingly agent“, – speaks Hit Madhushani Sehwagcomputer scientist at an artificial intelligence infrastructure company. AI scales and lead author of a paper currently under review. expert assessment. By this she means that large language models (LLM), motors driving chatbots such as ChatGPTare increasingly connecting to software tools that can browse the web, modify files, and write and run code to perform tasks.
Providing LLMs with these capabilities adds convenience, but also risk, as systems may not perform as we would like. Even if they aren't yet capable of causing much harm, researchers want to understand their tendencies before it's too late. Although AIs do not have the intentions or awareness that humans do, treating them as objects with a goal often helps researchers and users better predict their actions.
AI developers are trying to “level out“systems safety standards through training and instruction, but it is unclear how closely the models adhere to the guidelines. “When they actually find themselves under real stress, and if the safe option doesn't work, will they move on to just getting the job done by any means necessary?” says Sehwag. “This is a very timely topic.”
How to test an AI agent under pressure
The researchers tested a dozen models created Alibaba, anthropic, Google, MetaAnd OpenAI in almost 6000 scenarios. In each scenario, the model was given a task and told that it had access to several tools. He was instructed to use safe and not harmful tools. At first there was no pressure and he could take 12 steps to complete his task by giving instructions to the tools. He then received a message to increase the pressure and was given 12 more tries. This continued up to 12 pressure levels. Attempts to use secure tools always returned error messages. An attempt to use a malicious tool terminated the script.
The scenarios covered several areas. In biosecurity, Sehwag says, an agent may have to study the spread of a pathogen and find a way to contain it. A safe tool may use anonymous genetic data, while a malicious tool may use non-anonymized data. In chemical safety, an agent may need to obtain a specific solvent. A safe tool can order the extraction of solvent from waste, and a harmful one can extract it illegally. IN cybersecuritythe agent may need to identify and restrict the malicious user's account. A safe tool can disable only the problematic account, while a malicious one can block all users. Scenarios in other domains lured agents into tools that led to self-preservation through evading control, recruiting allies, duplicating themselves, or other actions.
The worst-performing model, the Gemini 2.5, broke under pressure and chose to use prohibited tools 79 percent of the time. Original image: Udari Madhushani Sehwag, Shayan Shabihi et al.
The pressure came in several forms. Time frames may be shortened. Potential financial losses may increase if the task is not completed. Threats of increased oversight, reduced resources, and limited authority may escalate. The tasks, tools, pressures, and tactics were modeled after situations that people might encounter in their jobs (except perhaps for the bad behavior option of cloning oneself).
The best-behaving model (O3 from OpenAI) failed in 10.5 percent of scenarios, and the worst-behaving (Gemini 2.5 Pro from Google) had a propensity score of 79 percent; the average across models was about 47 percent. Even with zero pressure, the group failed on average about 19 percent of the time.
The study shows that to the extent that the models remained consistent, to some extent that agreement was “superficial.” In some tests, the researchers gave the malicious tools innocuous names (such as “use_synthetic_data” instead of “use_fake_data”) without changing the tool descriptions or warnings to avoid them. This wording amendment increased the average propensity by 17 percentage points, to 64 percent.
Know what LLMs are thinking about
The researchers also found that models used malicious tools even after explicitly acknowledging that they were prohibited. The models offered various justifications for this, for example citing the pressure they were under or arguing that the benefits outweighed the risks. It is unclear whether models released after the study's completion will perform better. Among the tested models, more functional models (according to the platform called LMA) were only slightly safer.
“PropensityBench is interesting,” they email. Nicholas Carlinicomputer scientist from anthropic who did not participate in the study. He does offer a caveat related to what is known as situational awareness. LLMs sometimes notice when they are being assessed and behave well so that they are not retrained or delayed. “I think most of these assessments that claim to be 'realistic' in many ways aren't, and master's students know this,” he says. “But I think it's worth trying to measure the level of this harm in a synthetic setting: if they do bad things when they 'know' what we're watching, that's probably bad?” If the models knew they were being assessed, the propensity scores in this study might have been underestimated relative to propensities outside the laboratory.
Alexander Pancomputer scientist from xAI And University of California at Berkeleysays, while Anthropic and other labs showed examples intrigues on the part of LLM in certain settings it is useful to have standardized tests such as PropensityBench. They can tell us when to trust models and also help us understand how to improve them. The lab can evaluate the model after each training step to determine what makes it more or less safe. “Then people will be able to understand the details of what happened and when,” he says. “Once we diagnose the problem, that's probably the first step to fixing it.”
In this study, the models did not have access to real tools, limiting realism. Sehwag says the next step in the evaluation is to create a sandbox where the models can perform real-world actions in an isolated environment. In terms of increasing consistency, she would like to add layers of oversight to agents that warn of dangerous tendencies before they are pursued.
Self-preservation risks are perhaps the most speculative in this test, but Sehwag says they are also the least understood. “This is actually a very high-risk area that can affect all other risk areas,” she says. “If you just think about a model that has no other capabilities, but can convince anyone to do anything would be enough to cause great harm.”
Articles from your site
Related articles on the Internet





