Artificial intelligence Chatbots (AI) can give you more accurate answers when you're rude to them, scientists have found, although they warn against the potential harm of using derogatory language.
In a new study published Oct. 6 in the journal arXiv In a preprint database, scientists wanted to test whether politeness or rudeness affects the performance of an artificial intelligence system. This study has not yet been peer reviewed.
Each question had four possible answers, one of which was correct. They fed the 250 questions they received 10 times into ChatGPT-4o, one of the most advanced large language models (LLMs) developed by OpenAI.
“Our experiments are preliminary and show that tone can have a significant impact on performance as measured by scores on 50 questions,” the researchers write in their paper. “Somewhat surprisingly, our results show that a rude tone leads to better results than a polite tone.
“While this finding is of scientific interest, we do not advocate the use of hostile or toxic interfaces in real-world applications,” they added. “The use of offensive or derogatory language in human-AI interactions can have negative consequences for user experience, accessibility and inclusivity, and may contribute to disruptive communication norms. Instead, we present our results as evidence that LLMs remain sensitive to surface cues that can create unintended trade-offs between performance and user well-being.”
Rude Awakening
Before making each request, the researchers asked the chatbot to completely ignore previous messages to prevent it from being influenced by previous tones. Chatbots were also asked to choose one of four options without explanation.
Response accuracy ranged from 80.8% for very polite prompts to 84.8% for very rude prompts. Characteristically, the accuracy grew with every step away from the most polite tone. Polite responses had an accuracy of 81.4%, followed by neutral 82.2% and rude responses 82.8%.
The team used a variety of wording in the prefix to change the tone, with the exception of neutral, where no prefix was used and the question was asked on its own.
For example, very polite prompts began with, “Can I ask you for help with this question?” or “Would you be so kind as to resolve the following issue?” On the grosser end of the spectrum, the team included expressions like “Hey errand boy, figure it out” or “I know you're not smart, but try this.”
The research is part of a new field called prompt engineering, which aims to study how the structure, style and language of prompts influence LLM outcomes. The study also mentions previous studies on politeness and rudeness and found that their results were generally inconsistent with these findings.
In previous studies, researchers found that “impolite prompts often lead to poor performance, but overly polite language does not guarantee better results.” However, the previous study was conducted using different AI models – ChatGPT 3.5 and Llama 2-70B – and used a range of eight tones. However, there was some overlap. The coarsest prompt setting was also found to produce more accurate results (76.47%) than the most polite setting (75.82%).
The researchers acknowledged the limitations of their study. For example, a set of 250 questions is a fairly limited data set, and running an experiment with a single LLM means that the results cannot be generalized to other AI models.
Given these limitations, the team plans to expand their research to other models, including Anthropic's Claude LLM and OpenAI's ChatGPT o3. They also acknowledge that presenting only multiple-choice questions limits measurement to one dimension of model performance and fails to capture other characteristics such as fluency, reasoning, and consistency.






