Major AI conference flooded with peer reviews written fully by AI

An artificial intelligence detection tool developed by the Pangram Lab has found that reviewers are increasingly using chatbots to prepare responses to authors.Photo: breakermaximus/iStock via Getty

What can researchers do if they suspect their manuscripts have been peer-reviewed using artificial intelligence (AI)? Dozens of scientists have expressed concerns on social media about manuscripts and reviews submitted to organizers of the International Conference on Learning Representations (ICLR), an annual gathering of machine learning experts taking place next year. Among other things, they noted hallucination quotes and suspiciously long and vague reviews of their work.

Graham Neubig, an artificial intelligence researcher at Carnegie Mellon University in Pittsburgh, Pennsylvania, was one of those who received expert assessments that seemed to be produced using large language models (LLM). The reports, he said, were “very wordy, with a lot of points” and required analysis that was not “the standard statistical analysis that reviewers ask for in typical artificial intelligence or machine learning papers.”

But Neubig needed help to prove that the reports were generated by artificial intelligence. So he posted on X (formerly Twitter) and offered a reward to anyone who could scan all the conference proceedings and their peer reviews for AI-generated text. The next day, he received a response from Max Spero, chief executive of Pangram Labs in New York, which develops tools for detecting text generated by artificial intelligence. Pangram reviewed all 19,490 studies and 75,800 peer reviews presented at ICLR 2026, which will be held in Rio de Janeiro, Brazil, in April. Neubig and more than 11,000 other artificial intelligence researchers will be in attendance.

Pangram's analysis found that about 21% of ICLR peer reviews were entirely AI-generated, and more than half contained evidence of AI use. The conclusions were published online by Pangram Labs. “People were suspicious, but they didn't have concrete evidence,” Spero says. “Within 12 hours, we wrote code to analyze all the textual content from these documents,” he adds.

Conference organizers say they will now use automated tools to assess whether submissions and peer review violate the Usage Policy AI in materials and expert assessments. “This is the first time a conference has faced this problem on this scale,” says Bharat Hariharan, a computer scientist at Cornell University in Ithaca, New York, and a senior program leader for ICLR 2026. “Once we go through this whole process… it will give us a better understanding of trust.”

Review written by artificial intelligence

The Pangram team used one of their own tools, which predicts whether text will be generated or edited by LLM. Pangram's analysis identified 15,899 reviews that were entirely generated by artificial intelligence. But it also identified many manuscripts that were submitted to the conference with suspected AI-generated text: 199 manuscripts (1%) were found to be entirely AI-generated; 61% of submissions were primarily written by humans; but 9% contained more than 50% AI-generated text.

Pangram described the model in a preprint1The team's analysis showed that of the four reviews received for the manuscript, one was labeled as entirely AI-generated and the other was labeled as lightly AI-edited.

For many researchers who had their work peer reviewed at ICLR, Pangram's analysis confirmed their suspicions. Desmond Elliott, a computer scientist at the University of Copenhagen, says one of the three reviews he received seemed to miss “the point of the paper.” His graduate student who supervised the work suspected that the review was written by LL.M.s because it mentioned incorrect numerical results from the manuscript and contained strange language.

When Pangram published its findings, Elliott adds, “the first thing I did was print the title of our paper because I wanted to see if my student’s intuition was correct.” The suspicious review, which Pangram's analysis flagged as being entirely AI-generated, gave the manuscript the lowest possible score, leaving it “on the line between acceptance and rejection,” Elliott says. “It's deeply upsetting.”

Consequences

Leave a Comment