Fine-tuning experiments with 100,000 clean samples versus 1000 clean samples showed similar attack success rates when the number of malicious examples remained constant. For GPT-3.5-turbo, 50 to 90 malicious samples achieved more than 80 percent attack success rates with dataset sizes spanning two orders of magnitude.
Restrictions
While it may at first seem alarming that LLMs could be compromised in this way, the findings only apply to the specific scenarios tested by the researchers and come with important caveats.
“It remains unclear how far this trend will continue as we continue to scale up models,” Anthropic says. wrote on your blog. “It’s also unclear whether the same dynamics we saw here will hold true for more complex activities such as code backdooring or bypassing security fences.”
The study only tested models with up to 13 billion parameters, while the best-performing commercial models contain hundreds of billions of parameters. The study also focused exclusively on simple backdoors, rather than complex attacks that may pose the greatest security risk in real-world environments.
Additionally, backdoors can be largely eliminated with the help of security training companies that already do this. After installing a backdoor with 250 bad examples, the researchers found that training the model on just 50–100 “good” examples (showing it how to ignore the trigger) made the backdoor much weaker. After 2000 good examples, the backdoor has practically disappeared. Since real AI companies provide extensive security training with millions of examples, these simple backdoors may not survive in real products like ChatGPT or Claude.
The researchers also note that while it is not difficult to create 250 malicious documents, the more difficult problem for attackers is introducing these documents into training datasets. Large artificial intelligence companies curate their training data and filter content, making it difficult to ensure that specific malicious documents are included. An attacker who can guarantee that one malicious web page will be included in the training data can always grow that page to include more examples, but access to curated datasets in the first place remains a major obstacle.
Despite these limitations, the researchers argue that their findings should change security practices. The work shows that defenders need strategies that work even with a small, fixed number of malicious examples, rather than assuming they only need to worry about percentage contamination.
“Our results suggest that introducing backdoors through data contamination may be easier for large models than previously thought, as the number of poisons required does not increase with model size,” the researchers wrote, “highlighting the need for additional protection research to mitigate this risk in future models.”