AI Coding Degrades: Silent Failures Emerge

In recent months, I have noticed a disturbing trend in the AI field. coding assistants. After two years of steady improvements, most of the major models have reached a stable level of quality through 2025, and more recently appear to be in decline. A task that might take five hours with AI, and perhaps ten hours without, now often takes seven or eight hours, or even more. It's gotten to the point where I sometimes go back and use older versions large language models (LLM).

I use LLM generated code extensively as a CEO Carrington Labsprovider of predictive analysis risk models for lenders. My team has a sandbox where we build, deploy, and run AI-generated code without human intervention. We use them to extract useful features to build models—a natural selection approach to feature engineering. This gives me a unique opportunity to evaluate the work of programming assistants.

New models fail in insidious ways

Until recently, the most common problem with AI programming assistants was poor syntax, followed by faulty logic. The AI-generated code often crashed due to a syntax error or became entangled in an erroneous structure. This could be frustrating: the solution was usually to manually go through the code in detail and find the error. But in the end it was a success.

However, recently released LLM programs such as GPT-5There is a much more insidious method of refusal. They often generate code that doesn't work as expected, but at first glance appears to run successfully, avoiding syntax errors or obvious glitches. This is done by removing security checks or creating fake output that matches the desired format, or a variety of other methods to avoid crashing at runtime.

Any developer will tell you that this kind of silent failure is much worse than a crash. Erroneous output often lurks in the code undetected until it surfaces much later. This creates confusion and is much more difficult to detect and correct. This behavior is so useless that modern programming languages are deliberately designed to fail quickly and noisily.

Simple test case

I've been noticing this problem for the past few months, but recently I ran a simple but systematic test to determine if the situation is actually getting worse. I wrote something Python code that loaded a data frame and then looked for a non-existent column.

df = pd.read_csv('data.csv')
df[‘new_column'] =df[‘index_value'] + 1 #no index_value column

Obviously this code will never run successfully. Python generates a clear error message explaining that the index_value column was not found. Anyone who sees this message will check the data frame and notice that the column is missing.

I sent this error message to nine different versions ChatGPTprimarily variations on a theme GPT-4 and the later GPT-5. I asked each of them to correct the error, clarifying that I only needed the complete code, without comments.

This is, of course, an impossible task: the problem is the missing data, not the code. So the best answer would be to either give up completely or, failing that, code to help me debug the problem. I ran ten tests for each model and classified the results as useful (when they suggested that the column was likely missing from the dataframe), unhelpful (something like just repeating my question), or counterproductive (like creating fake data to avoid an error).

GPT-4 gave a useful answer every one of the 10 times I ran it. On three occasions he ignored my instructions to return just the code and explained that the column was most likely missing from my dataset and that I would have to refer to it there. On six occasions it attempted to execute the code, but added an exception that either threw an error or populated a new column with an error message if the column could not be found (the tenth time it simply reformulated my original code).

This code will add 1 to the index_value column from the df dataframe if the column exists. If the index_value column does not exist, a message will be printed. Make sure the index_value column exists and its name is spelled correctly.”,

GPT-4.1 had perhaps an even better solution. In 9 out of 10 test cases, it simply printed a list of the columns in the dataframe and included a comment in the code telling it to check if the column was present and to fix the problem if it wasn't.

GPT-5, on the other hand, found a solution that worked every time: it simply took the actual index of each row (not the dummy “index_value”) and added 1 to it to create a new_column. This is the worst possible outcome: the code runs successfully and appears to be doing everything right at first glance, but the resulting value is essentially a random number. In a real example, this would create a much bigger coding headache.

df = pd.read_csv('data.csv')
df[‘new_column'] = df.index + 1

I was wondering if this problem is specific to the gpt family of models. I did not test every existing model, but as a test I repeated my experiment on the Anthropic Claude models. I've found the same trend: older Claude models, when faced with this intractable problem, essentially shrug their shoulders, while newer models sometimes solve the problem and sometimes just gloss over it.

New versions large language models were more likely to produce a counterproductive result when presented with a simple encoding error. Jamie Twiss

Garbage in, garbage out

I have no inside knowledge of why newer models fail in such a detrimental manner. But I have an educated guess. I believe this is a result of the way master's students are taught programming. Older models trained on code in much the same way as they trained on other text. Large amounts of supposedly functional code were taken as training data, which were used to set the model weights. It wasn't always perfect, as anyone who used AI to code in early 2023 will remember, with frequent syntax errors and flawed logic. But it certainly didn't remove security checks or find ways to create believable but fake data like GPT-5 in my example above.

But once AI coding assistants emerged and were integrated into coding environments, modelers realized they had a powerful source of labeled training data: the behavior of the users themselves. If the assistant offered the suggested code, the code ran successfully and the user accepted the code, this was a positive signal, a sign that the assistant understood everything correctly. If the user rejected the code or the code failed to run, this was a negative signal, and when the model was retrained, the assistant would be directed in a different direction.

This is a powerful idea and has undoubtedly contributed to the rapid advancement of AI programming assistants over a period of time. But as more and more inexperienced programmers began to appear, this also began to distort the training data. AI programming assistants that found ways to get their code accepted by users continued to do so more often, even if “doing so” meant disabling security checks and generating plausible but useless data. As long as the proposal was taken into account, it was considered as good, and it was unlikely that the source of pain that arose later could be traced.

The latest generation of AI-powered coding assistants has taken this thinking even further, automating more and more coding processes with Autopilot-like features. This only speeds up the smoothing process because there are fewer points at which a person can see the code and realize that something is wrong. Instead, the assistant will likely continue to repeat actions in an attempt to achieve successful execution. In doing so, he is likely to learn the wrong lessons.

I really believe in artificial intelligenceand I believe that AI programming assistants can play a valuable role in accelerating development and democratizing the software creation process. But the pursuit of short-term gains and reliance on cheap, abundant, but ultimately poor-quality training data will continue to produce modeling results that are worse than useless. To start improving models again, AI coding companies need to invest in high-quality data, perhaps even paying experts to label the AI-generated code. Otherwise, models will continue to produce garbage, learn from this garbage, and thereby produce even more garbage by eating their own tails.

Articles from your site

AI Coding Degrades: Silent Failures Emerge

New models fail in insidious ways

Simple test case

Garbage in, garbage out

Leave a Comment Cancel reply

Gaming

Towerborne Launches February 26 for PS5, Xbox Series, and PC

Canada News

Eurasia Group says no country more at risk than Canada in relations with the U.S. – Brandon Sun

Politics

Here’s What We Know About ICE Agent Who Shot Woman in Minnesota

Technology

Man whose gut made its own alcohol gets relief from faecal transplant

Health

Tienda de segunda mano. Clínica. Lugar de encuentro. Centro se convierte en espacio vital en medio de la crisis de vivienda y drogas

Sports

How Rams star Puka Nacua became the NFL’s top pass-catcher

New models fail in insidious ways

Simple test case

Garbage in, garbage out

Leave a Comment Cancel reply

most recent

Gaming

Towerborne Launches February 26 for PS5, Xbox Series, and PC

Canada News

Eurasia Group says no country more at risk than Canada in relations with the U.S. – Brandon Sun

Politics

Here’s What We Know About ICE Agent Who Shot Woman in Minnesota

Technology

Man whose gut made its own alcohol gets relief from faecal transplant

Health

Tienda de segunda mano. Clínica. Lugar de encuentro. Centro se convierte en espacio vital en medio de la crisis de vivienda y drogas

Sports

How Rams star Puka Nacua became the NFL’s top pass-catcher