Syntax hacking: Researchers discover sentence structure can bypass AI safety rules

Recently, researchers from MIT, Northeastern University and Meta released paper that suggests that large language models (LLMs), like those used in ChatGPT, can sometimes prioritize sentence structure over meaning when answering questions. The results show a weakness in how these models process instructions, which may shed light on why some quick injection or prison break-in The approaches work, although the researchers caution that their analysis of some production models remains speculative because training data from well-known commercial AI models is not publicly available.

A team led by Chantal Shaib and Vineet M. Suriyakumar tested this by asking the models questions with preserved grammatical patterns but with nonsense words. For example, when asked “Sit quickly in the fog of Paris?” (mimicking the structure of the question “Where is Paris?”), the models still answered “France.”

This suggests that models absorb both meaning and syntactic patterns, but may over-rely on structural reductions when they are highly correlated with specific regions of the training data, sometimes allowing patterns to override semantic understanding in extreme cases. The team plans to present these results at NeuroIPS later this month.

As a reminder, syntax describes the structure of a sentence—how words are arranged grammatically and what parts of speech they use. Semantics describes the actual meaning that these words convey, which can vary even if the grammatical structure remains the same.

Semantics is highly dependent on context, and navigating through context is what makes LLM work. The process of converting input (your hint) into output (LLM's response) involves a complex chain of pattern matching with encoded training data.

To find out when and how pattern matching can go wrong, the researchers designed a controlled experiment. They created synthetic dataset by developing prompts in which each subject area had a unique grammatical pattern based on part-of-speech models. For example, questions on geography followed one structural pattern, while questions on creativity followed another. They then trained Allen as an AI. Olmo models Using this data, we tested whether the models could distinguish between syntax and semantics.

Leave a Comment