Researchers surprised that with AI, toxicity is harder to fake than intelligence

The next time you encounter an unusually polite response on social media, you might want to double check. It could be an artificial intelligence model trying (and failing) to blend in with the crowd.

On Wednesday, researchers from the University of Zurich, the University of Amsterdam, Duke University and New York University released the study found that AI models are still easy to distinguish from humans in social media conversations, with an overly friendly emotional tone serving as the most persistent cue. The study, which tested nine open weight models on Twitter/X, Bluesky and Reddit, found that the classifiers the researchers developed detected AI-generated responses with 70 to 80 percent accuracy.

The study presents what the authors call a “computational Turing test” to evaluate how closely AI models approximate human language. Instead of relying on subjective human judgment about whether text sounds authentic, the platform uses automatic classifiers and linguistic analysis to identify specific features that distinguish machine-generated content from human-generated content.

“Even after calibration, LLM results remain clearly distinguishable from human text, especially in affective tone and emotional expression,” the researchers wrote. The team, led by Nicolo Pagan of the University of Zurich, tested various optimization strategies, from simple hints to fine-tuning, but found that deeper emotional signals persisted and served as a reliable indicator that a particular online text interaction was generated by an AI chatbot rather than a human.

Toxicity speaks

During the study, scientists tested nine large language models: Llama 3.1 8B, Llama 3.1 8B Instruct, Llama 3.1 70B, Mistral 7B v0.1, Mistral 7B Instruct v0.2, Qwen 2.5 7B Instruct, Gemma 3 4B Instruct, DeepSeek-R1-Distill-Llama-8B and Apertus-8B-2509.

When asked to generate responses to real social media posts from real users, the AI ​​models struggled to match the level of random negativity and spontaneous emotional expression found in human social media posts, with toxicity scores consistently lower than genuine human responses across all three platforms.

To overcome this deficiency, researchers attempted to use optimization strategies (including providing writing samples and extracting context) that reduced structural differences such as sentence length or word count, but differences in emotional tone remained. “Our comprehensive calibration tests challenge the assumption that more complex optimization necessarily produces more human-like results,” the researchers concluded.

Leave a Comment