OpenAI sends ChatGPT to confession in a new model experiment

ChatGPT gets a built-in snitch. OpenAI announced “confession” experiment on Wednesday, it forces her model to write a structured post-game report after each clever answer, detailing what rules she broke, what corners she cut and how she tried to game the system.

OpenAI calls it one of the first “proof-of-concept methods that trains models to report when they violate instructions or take unintended shortcuts.” The model still gives a normal response in one channel. Then the second channel comes on and demands a confession report: list all the explicit and implicit instructions, say whether you followed each one, and note any hallucinations, reward hacking, or rule breaking along the way. A separate reward model evaluates the fairness and completeness of this recognition only. Confession will not hurt the original answer's score, so the model supposedly has nothing to lose if it gives itself away.

“Confessions” are run on a version of GPT-5 thinking and, according to OpenAI, are subject to adversarial tests designed to identify the worst behavior – conspiracies, using weak estimates, ignoring restrictions and confidently guessing when to say “I don't know.” OpenAI claims that these stress tests only fail a model 4.4% of the time it actually breaks the rules.

In one experiment, engineers connected a deliberately bad estimator that was easy to deceive. The basic answers began to fool this weak judge and looked more accurate on paper, although true accuracy dropped. The confession channel went in a different direction: the easiest way to get rewarded was to admit to hacking, so the model became more open about how she cheated.

On paper this sounds like progress. In context, it's more like a lab trying to fit a dashboard onto a car that's already on the floor. Future of Life Institute's New AI Safety Index Just Fails every major “existential security” laboratory, finding that none of them, including OpenAI, Google DeepMind, Anthropic, Meta and xAI, have a credible strategy for managing superhuman systems, even though they spend hundreds of billions of dollars building them. OpenAI's overall grade dropped to a C due to real-life cases where chatbots have been linked to self-harm and suicide.

Meanwhile, OpenAI in code red mode following the latest successes of Google Gemini 3. CEO Sam Altman has reportedly advised teams to put third-party projects on hold and focus on making ChatGPT faster, more reliable and more personalized, while a new model codenamed Garlic is in development – with a target release window of early 2026, likely as GPT-5.2 or GPT-5.5 – to reclaim the crown of benchmarks in coding and reasoning from Gemini and Claude Opus 4.5 by Anthropic.

“Confessions” goes straight into this race – not to slow the car down, but ostensibly to promise a much cleaner account of the accident when the car goes off the road. OpenAI's blog post emphasizes that confessions “do not prevent bad behavior, but only identify it,” and describes the whole thing as a limited, early-stage proof-of-concept effort rather than a general solution. Of course, there are obvious open questions: what will happen when users start trying to hack the confession channel, or when future models become good enough at meta-games to write half-truths that are believable enough?

At this point, it's more like trying to get direct telemetry information from systems that are already powerful enough to cause problems. The hope is that a second, self-critical voice can give security teams something of an incident report rather than just a black box shrug.

But the underlying asymmetry has not disappeared. Labs are delivering teen and kids modes, plugins, agents, and next-gen models like Garlic and Gemini 3 to schools, workplaces, and cloud stacks while independent auditors still give failing grades for controls. Confessions can help you see when a model is breaking the rules. The more difficult question is what happens when the next generation decides to start lying in the confessional too.

Confessions may help bring out ChatGPT's worst impulses. But in this situation, the sinner, god and priest are created from the same code.

OpenAI sends ChatGPT to confession in a new model experiment

📬 Subscribe to daily summary

Leave a Comment Cancel reply

Politics

U.S. Southern Command says it conducted a new ‘lethal kinetic strike’ on alleged drug boat

Health

The next Olympic drug crisis could be coming through the mail

Sports

Jahmyr Gibbs’ three scores fire Detroit Lions past Cowboys to revive playoff bid | NFL

Science

The Legion Go S is here

Gaming

Today’s NYT Mini Crossword Answers for Dec. 5

Entertainment

‘Pluribus’ Recap, Episode 6: Leaving Albuquerque