Change Management's Role in Incident Prevention: standard changes

Speak with TY on this topic

During peak business hours, I witnessed a straightforward database field addition bring down a whole e-commerce platform. It was meant to be standard procedure, the type of “standard change” that is automatically approved because we have performed it innumerable times.

Adding a field to the end of a table and having applications retrieve data by field name instead of position made the change itself textbook low-impact. There is no need to alter the application or the functional flow. This could have been problematic in the past if you added a field in the middle of the list and it affected the values of other fields, but adding it at the end? That ought to have been impenetrable.

However, it wasn't.

Before I tell you what went wrong, let me explain why this is important to all of the IT professionals who are reading this.

Over the past three decades, industry data has repeatedly supported what this incident taught me: our presumptions about “safe” changes are frequently our greatest weakness. Upon reviewing the ITIL research, I was not surprised to learn that failed changes, many of which were categorized as “standard” or “low-risk,” are responsible for about 80% of unplanned outages.

When you look more closely, the numbers become even more concerning. Since I've been following the Ponemon Institute's work for years, I wasn't surprised to learn that companies with well-established change management procedures have 65% fewer unscheduled outages. The paradox surprised me: many of these “mature” procedures still operate under the premise that safety correlates with repetition.

What I had been observing in the field for decades was confirmed when Gartner released their research showing that standard changes are responsible for almost 40% of change-related incidents. The very changes we consider safe enough to avoid thorough review subtly create some of our greatest risks. IBM's analysis supports the pattern I've seen in innumerable organizations: standard changes cause three times as much business disruption due to their volume and our decreased vigilance around them, whereas emergency changes receive all the attention and scrutiny.

Aberdeen Group data indicates that the average cost of an unplanned outage has increased to $300,000 per hour, with change-related failures accounting for the largest category of preventable incidents. This data makes the financial reality stark.

What precisely went wrong with the addition of that database field that caused our e-commerce platform to crash?

We were unaware that the addition of this one field would cause the database to surpass an internal threshold, necessitating a thorough examination of its execution strategy. In its algorithmic wisdom, the database engine determined that the table structure had changed enough to necessitate rebuilding its access and retrieval mechanisms. Our applications relied on high-speed requests, and the new execution plan was terribly unoptimized for them.

Instead of completing quotes or purchases, customers were spending minutes viewing error pages. All applications began to time out while they awaited data that just wasn't showing up in the anticipated amounts of time. Thousands of transactions were impacted by a single extra field that should have been invisible to the application layer.

The field addition itself was not the primary cause. We assumed that since we had made similar adjustments dozens of times previously, this one would also act in the same way. Without taking into account the hidden complexities of database optimization thresholds, we had categorized it as a standard change based on superficial similarities.

My approach to standard changes was completely altered by this experience, and it is now even more applicable in DevOps-driven environments. Many organizations use pipeline deployments, which produce a standard change at runtime. It's great for speed and reliability, but it can easily fall into the same trap.

However, I have witnessed pipeline deployments result in significant incidents for non-code-related reasons. Due to timing, resource contention, or environmental differences that weren't noticeable in earlier runs, a deployment that performed flawlessly in development and staging abruptly fails in production. Although the automation boosts our confidence, it may also reveal blind spots.

Over the course of thirty years, I have come to the unsettling realization that there is no such thing as a truly routine change in complex systems. Every modification takes place in a slightly different setting, with varying environmental factors, data states, and system loads. What we refer to as “standard changes” are actually merely modifications with comparable processes rather than risk profiles.

For this reason, I support contextual change management. We must consider the system state, timing, dependencies, and cumulative effect of recent changes rather than just categorizing them based on their technical features. After three other changes have changed the system's behavior patterns, a change made at two in the morning on a Sunday with little system load is actually different from the same change made during peak business hours.

Effective change advisory boards must therefore go beyond assessing individual changes separately. I've worked with organizations where the change board carefully considered and approved each modification on its own merits, only to find that the cumulative effect of seemingly unrelated changes led to unexpected interactions and stress on the system. The most developed change management procedures I've come across mandate that their advisory boards take a step back and look at the whole change portfolio over a specified period of time. They inquire whether we are altering the database too frequently during a single maintenance window. Could there be unanticipated interactions between these three different application updates? What is the total resource impact of this week's approved changes?

It's the distinction between forest management and tree management. While each change may seem logical individually, when combined, they can create situations beyond the scope of any single change assessment.

Having worked in this field for thirty years, I've come to the conclusion that our greatest confidences frequently conceal our greatest vulnerabilities. Our primary blind spots frequently arise from the changes we've made a hundred times before, the procedures we've automated and standardized, and the adjustments we've labeled as “routine.”

Whether we should slow down our deployment pipelines or stop using standard changes is not the question. In the current competitive environment, speed and efficiency are crucial. The issue is whether we are posing the appropriate queries before carrying them out. Are we taking into account not only what the change accomplishes but also when it occurs, what else is changing at the same time, and how our systems actually look right now?

I've discovered that the phrase “we've done this before” is more dangerous in IT operations than “what could go wrong?” Because, despite what we may believe, we never actually perform the same action twice in complex systems.

Here is what I would like you to think about: which everyday modifications are subtly putting your surroundings at risk? Which procedures have you standardized or automated to the extent that you no longer challenge their presumptions? Most importantly, when was the last time your change advisory board examined your changes as a cohesive portfolio of system modifications rather than as discrete items on a checklist?

Remember that simple addition to a database field the next time you're tempted to accept a standard change. The most unexpected outcomes can occasionally result from the most routine adjustments.

I'm always up for a conversation if you want to talk about your difficulties with change management.

The number of articles to display is not specified correctly.