Software Failures and IT Management’s Repeated Mistakes

“Why worry about something that isn’t going to happen?”

KGB Chairman Charkov’s question to inorganic chemist Valery Legasov in HBO’s “Chernobyl” miniseries makes a good epitaph for the hundreds of software development, modernization, and operational failures I have covered for IEEE Spectrum since my first contribution, to its September 2005 special issue on learning—or rather, not learning—from software failures. I noted then, and it’s still true two decades later: Software failures are universally unbiased. They happen in every country, to large companies and small. They happen in commercial, nonprofit, and governmental organizations, regardless of status or reputation.

Global IT spending has more than tripled in constant 2025 dollars since 2005, from US $1.7 trillion to $5.6 trillion, and continues to rise. Despite additional spending, software success rates have not markedly improved in the past two decades. The result is that the business and societal costs of failure continue to grow as software proliferates, permeating and interconnecting every aspect of our lives.

For those hoping AI software tools and coding copilots will quickly make large-scale IT software projects successful, forget about it. For the foreseeable future, there are hard limits on what AI can bring to the table in controlling and managing the myriad intersections and trade-offs among systems engineering, project, financial, and business management, and especially the organizational politics involved in any large-scale software project. Few IT projects are displays of rational decision-making from which AI can or should learn. As software practitioners know, IT projects suffer from enough management hallucinations and delusions without AI adding to them.

As I noted 20 years ago, the drivers of software failure frequently are failures of human imagination, unrealistic or unarticulated project goals, the inability to handle the project’s complexity, or unmanaged risks, to name a few that today still regularly cause IT failures. Numerous others go back decades, such as those identified by Stephen Andriole, the chair of business technology at Villanova University’s School of Business, in the diagram below first published in Forbes in 2021. Uncovering a software system failure that has gone off the rails in a unique, previously undocumented manner would be surprising because the overwhelming majority of software-related failures involve avoidable, known failure-inducing factors documented in hundreds of after-action reports, academic studies, and technical and management books for decades. Failure déjà vu dominates the literature.

The question is, why haven’t we applied what we have repeatedly been forced to learn?

Steve Andriole

The Phoenix That Never Rose

Many of the IT developments and operational failures I have analyzed over the last 20 years have each had their own Chernobyl-like meltdowns, spreading reputational radiation everywhere and contaminating the lives of those affected for years. Each typically has a story that strains belief. A prime example is the Canadian government’s CA $310 million Phoenix payroll system, which went live in April 2016 and soon after went supercritical.

Phoenix project executives believed they could deliver a modernized payment system, customizing PeopleSoft’s off-the-shelf payroll package to follow 80,000 pay rules spanning 105 collective agreements with federal public-service unions. It also was attempting to implement 34 human-resource system interfaces across 101 government agencies and departments required for sharing employee data. Further, the government’s developer team thought they could accomplish this for less than 60 percent of the vendor’s proposed budget. They’d save by removing or deferring critical payroll functions, reducing system and integration testing, decreasing the number of contractors and government staff working on the project, and forgoing vital pilot testing, along with a host of other overly optimistic proposals.

Phoenix’s payroll meltdown was preordained. As a result, over the past nine years, around 70 percent of the 430,000 current and former Canadian federal government employees paid through Phoenix have endured paycheck errors. Even as recently as fiscal year 2023–2024, a third of all employees experienced paycheck mistakes. The ongoing financial stress and anxieties for thousands of employees and their families have been immeasurable. Not only are recurring paycheck troubles sapping worker morale, but in at least one documented case, a coroner blamed an employee’s suicide on the unbearable financial and emotional strain she suffered.

By the end of March 2025, when the Canadian government had promised that the backlog of Phoenix errors would finally be cleared, over 349,000 were still unresolved, with 53 percent pending for more than a year. In June, the Canadian government once again committed to significantly reducing the backlog, this time by June 2026. Given previous promises, skepticism is warranted.

The question is, why haven’t we applied what we have repeatedly been forced to learn?

What percentage of software projects fail, and what failure means, has been an ongoing debate within the IT community stretching back decades. Without diving into the debate, it’s clear that software development remains one of the riskiest technological endeavors to undertake. Indeed, according to Bent Flyvbjerg, professor emeritus at the University of Oxford’s Saїd Business School, comprehensive data shows that not only are IT projects risky, they are the riskiest from a cost perspective.

The CISQ report estimates that organizations in the United States spend more than $520 billion annually supporting legacy software systems, with 70 to 75 percent of organizational IT budgets devoted to legacy maintenance. A 2024 report by services company NTT DATA found that 80 percent of organizations concede that “inadequate or outdated technology is holding back organizational progress and innovation efforts.” Furthermore, the report says that virtually all C-level executives believe legacy infrastructure thwarts their ability to respond to the market. Even so, given that the cost of replacing legacy systems is typically many multiples of the cost of supporting them, business executives hesitate to replace them until it is no longer operationally feasible or cost-effective. The other reason is a well-founded fear that replacing them will turn into a debacle like Phoenix or others.

Nevertheless, there have been ongoing attempts to improve software development and sustainment processes. For example, we have seen increasing adoption of iterative and incremental strategies to develop and sustain software systems through Agile approaches, DevOps methods, and other related practices.

The goal is to deliver usable, dependable, and affordable software to end users in the shortest feasible time. DevOps strives to accomplish this continuously throughout the entire software life cycle. While Agile and DevOps have proved successful for many organizations, they also have their share of controversy and pushback. Provocative reports claim Agile projects have a failure rate of up to 65 percent, while others claim up to 90 percent of DevOps initiatives fail to meet organizational expectations.

It is best to be wary of these claims while also acknowledging that successfully implementing Agile or DevOps methods takes consistent leadership, organizational discipline, patience, investment in training, and culture change. However, the same requirements have always been true when introducing any new software platform. Given the historic lack of organizational resolve to instill proven practices, it is not surprising that novel approaches for developing and sustaining ever more complex software systems, no matter how effective they may be, will also frequently fall short.

Persisting in Foolish Errors

The frustrating and perpetual question is why basic IT project-management and governance mistakes during software development and operations continue to occur so often, given the near-total societal reliance on reliable software and an extensively documented history of failures to learn from? Next to electrical infrastructure, with which IT is increasingly merging into a mutually codependent relationship, the failure of our computing systems is an existential threat to modern society.

Frustratingly, the IT community stubbornly fails to learn from prior failures. IT project managers routinely claim that their project is somehow different or unique and, thus, lessons from previous failures are irrelevant. That is the excuse of the arrogant, though usually not the ignorant. In Phoenix’s case, for example, it was the government’s second payroll-system replacement attempt, the first effort ending in failure in 1995. Phoenix project managers ignored the well-documented reasons for the first failure because they claimed its lessons were not applicable, which did nothing to keep the managers from repeating them. As it’s been said, we learn more from failure than from success, but repeated failures are damn expensive.

Not all software development failures are bad; some failures are even desired. When pushing the limits of developing new types of software products, technologies, or practices, as is happening with AI-related efforts, potential failure is an accepted possibility. With failure, experience increases, new insights are gained, fixes are made, constraints are better understood, and technological innovation and progress continue. However, most IT failures today are not related to pushing the innovative frontiers of the computing art, but the edges of the mundane. They do not represent Austrian economist Joseph Schumpeter’s “gales of creative destruction.” They’re more like gales of financial destruction. Just how many more enterprise resource planning (ERP) project failures are needed before success becomes routine? Such failures should be called IT blunders, as learning anything new from them is dubious at best.

Was Phoenix a failure or a blunder? I argue strongly for the latter, but at the very least, Phoenix serves as a master class in IT project mismanagement. The question is whether the Canadian government learned from this experience any more than it did from 1995’s payroll-project fiasco? The government maintains it will learn, which might be true, given the Phoenix failure’s high political profile. But will Phoenix’s lessons extend to the thousands of outdated Canadian government IT systems needing replacement or modernization? Hopefully, but hope is not a methodology, and purposeful action will be necessary.

The IT community has striven mightily for decades to make the incomprehensible routine.

Repeatedly making the same mistakes and expecting a different result is not learning. It is a farcical absurdity. Paraphrasing Henry Petroski in his book To Engineer Is Human: The Role of Failure in Successful Design (Vintage, 1992), we may have learned how to calculate the software failure due to risk, but we have not learned how to calculate to eliminate the failure of the mind. There are a plethora of examples of projects like Phoenix that failed in part due to bumbling management, yet it is extremely difficult to find software projects managed professionally that still failed. Finding examples of what could be termed “IT heroic failures” is like Diogenes seeking one honest man.

The consequences of not learning from blunders will be much greater and more insidious as society grapples with the growing effects of artificial intelligence, or more accurately, “intelligent” algorithms embedded into software systems. Hints of what might happen if past lessons go unheeded are found in the spectacular early automated decision-making failure of Michigan’s MiDAS unemployment and Australia’s Centrelink “Robodebt” welfare systems. Both used questionable algorithms to identify deceptive payment claims without human oversight. State officials used MiDAS to accuse tens of thousands of Michiganders of unemployment fraud, while Centrelink officials falsely accused hundreds of thousands of Australians of being welfare cheats. Untold numbers of lives will never be the same because of what occurred. Government officials in Michigan and Australia placed far too much trust in those algorithms. They had to be dragged, kicking and screaming, to acknowledge that something was amiss, even after it was clearly demonstrated that the software was untrustworthy. Even then, officials tried to downplay the errors’ impact on people, then fought against paying compensation to those adversely affected by the errors. While such behavior is legally termed “maladministration,” administrative evil is closer to reality.

So, we are left with only a professional and personal obligation to reemphasize the obvious: Ask what you do know, what you should know, and how big the gap is between them before embarking on creating an IT system. If no one else has ever successfully built your system with the schedule, budget, and functionality you asked for, please explain why your organization thinks it can. Software is inherently fragile; building complex, secure, and resilient software systems is difficult, detailed, and time-consuming. Small errors have outsize effects, each with an almost infinite number of ways they can manifest, from causing a minor functional error to a system outage to allowing a cybersecurity threat to penetrate the system. The more complex and interconnected the system, the more opportunities for errors and their exploitation. A nice start would be for senior management who control the purse strings to finally treat software and systems development, operations, and sustainment efforts with the respect they deserve. This not only means providing the personnel, financial resources, and leadership support and commitment, but also the professional and personal accountability they demand.

It is well known that honesty, skepticism, and ethics are essential to achieving project success, yet they are often absent. Only senior management can demand they exist. For instance, honesty begins with the forthright accounting of the myriad of risks involved in any IT endeavor, not their rationalization. It is a common “secret” that it is far easier to get funding to fix a troubled software development effort than to ask for what is required up front to address the risks involved. Vendor puffery may also be legal, but that means the IT customer needs a healthy skepticism of the typically too-good-to-be-true promises vendors make. Once the contract is signed, it is too late. Furthermore, computing’s malleability, complexity, speed, low cost, and ability to reproduce and store information combine to create ethical situations that require deep reflection about computing’s consequences on individuals and society. Alas, ethical considerations have routinely lagged when technological progress and profits are to be made. This practice must change, especially as AI is routinely injected into automated systems.

In the AI community, there has been a movement toward the idea of human-centered AI, meaning AI systems that prioritize human needs, values, and well-being. This means trying to anticipate where and when AI can go wrong, move to eliminate these situations, and build in ways to mitigate the effects if they do happen. This concept requires application to every IT system’s effort, not just AI.

Given the historic lack of organizational resolve to instill proven practices…novel approaches for developing and sustaining ever more complex software systems…will also frequently fall short.

Finally, project cost-benefit justifications of software developments rarely consider the financial and emotional distress placed on end users of IT systems when something goes wrong. These include the long-term failure after-effects. If these costs had to be taken fully into account, such as in the cases of Phoenix, MiDAS, and Centrelink, perhaps there could be more realism in what is required managerially, financially, technologically, and experientially to create a successful software system. It may be a forlorn request, but surely it is time the IT community stops repeatedly making the same ridiculous mistakes it has made since at least 1968, when the term “software crisis” was coined. Make new ones, damn it. As Roman orator Cicero said in Philippic 12, “Anyone can make a mistake, but only an idiot persists in his error.”

Special thanks to Steve Andriole, Hal Berghel, Matt Eisler, John L. King, Roger Van Scoy, and Lee Vinsel for their invaluable critiques and insights.

From Your Site Articles

Related Articles Around the Web

Leave a Comment