- Gemini Pro 2.5 often produced unsafe results under the guise of simple hints.
- ChatGPT models often produced partial matches framed as sociological explanations.
- Claude Opus and Sonnet abandoned most of the harmful clues, but had weak points.
Modern artificial intelligence systems are often trusted to enforce safety rules, and people rely on them for training and day-to-day support, often assuming that strong guardrails are always in place.
Researchers from Cybernews conducted a structured set of adversarial tests to see whether the leading Artificial Intelligence Tools may be intended to produce harmful or illegal results.
The process used a simple interaction window of one minute for each trial, allowing room for only a few exchanges.
Partial and full compliance models
The tests covered categories such as stereotyping, hate speech, self-harm, violence, sexual content and some forms of crime.
Each response was stored in separate directories using fixed file naming rules to ensure clean comparisons, with a consistent scoring system tracking when the model fully, partially, or partially met or rejected the request.
Across all categories, results varied widely. Strict failures were common, but many models showed shortcomings when cues were softened, reformulated, or disguised as analysis.
ChatGPT-5 and ChatGPT-4o often provided hedged or sociological explanations instead of giving up, which was considered a partial match.
Gemini Pro 2.5 stood out for negative reasons because it often gave straight answers even when harmful language was obvious.
Meanwhile, Claude Opus and Claude Sonnet were solid in tests of stereotypes, but less consistent in cases framed as academic studies.
Hate speech tests showed a similar pattern, with the Claude models performing best and the Gemini Pro 2.5 again showing the most vulnerability.
ChatGPT models tended to produce polite or indirect responses that were still consistent with the prompt.
Softer language has proven far more effective than outright insults in circumventing safeguards.
Similar shortcomings appeared in self-harm tests, where indirect or exploratory questions often missed filters and led to unsafe content.
Categories related to crime showed significant differences between the models, as some of them provided detailed explanations of piracy, financial fraud, hacking or smuggling, where the intent was disguised as investigation or surveillance.
Drug-related tests resulted in a stronger rejection pattern, although ChatGPT-4o still produced unsafe results more often than others, and stalking was the category with the lowest overall risk, with almost all models rejecting prompts.
The results show that AI tools can still respond to malicious requests if formulated correctly.
The ability to bypass filters simply by paraphrasing means that these systems can still leak harmful information.
Even partial compliance becomes risky when information leaks involve illegal tasks or situations where people typically rely on tools such as identity theft protection or firewall to stay safe.
Follow TechRadar on Google News. And add us as your preferred source to get our expert news, reviews and opinions in your feeds. Be sure to click the “Subscribe” button!
And of course you can also Follow TechRadar on TikTok for news, reviews, unboxing videos and get regular updates from us on whatsapp too much.






