There should never be only one.


Resilient IT operations implements the policies, guidelines, and instructions set forth by the governance. This is where the rubber hits the road and where your IT systems and Operations ensure your clients can access your services, no matter what.

Today, we're talking about a concept that’s both incredibly simple and dangerously overlooked: the single point of failure, or SPOF for short.

Imagine you’ve built an impenetrable fortress. It has high walls, a deep moat, and strong gates. But the entire fortress can only be accessed through a single wooden bridge. That bridge is your single point of failure. If it collapses or is destroyed, your magnificent fortress is completely cut off. It doesn't matter how strong the rest of it is; that one weak link renders the entire system useless.

In your work, your team, and your processes and technology, these single bridges are everywhere. A SPOF is any part of a system that, if it stops working, will cause the entire system to shut down. It’s the one critical component, the one indispensable person, or the one vital process that everything else depends on.

When you identify and fix these weak points you aren't being pessimistic; you're fixing the very foundation of something that can withstand shocks and surprises. It’s about creating truly resilient systems and teams, not just seemingly strong ones. So, let’s explore where these risks hide and what you can do about them.

When People Become the Problem

For those of you who know me, saying something like this feels at odds with who I am. And yet, it's one of the most common and riskiest areas in any organization. Human single points of failure don't happen because of malicious intent. They typically grow out of good intentions, hard work, and necessity. But the result is the same: a fragile system completely dependent on an individual.

The Rise of the Hero

We all know a colleague like this. The “hero” is the one person who has all the answers. When a critical system goes down at 3 AM, they're the only one who can fix it. They understand the labyrinthine codebase nobody else dares to touch. They have the historical context for every major decision made in the last decade. On the surface, this person is invaluable. Management loves them because they solve problems. The team relies on them because they’re a walking encyclopedia.

But here’s the inconvenient truth: your hero is your biggest liability.

This isn’t their fault. They likely became the hero by stepping up when no one else would or could. The hero may actually feel like they are the only ones qualified to handle the issue because “management” does not take the necessary actions to train other people. Or “management” places other priorities. Be aware, this is a perception thing. The manager is very likely to be very concerned about the well-being of their employee. (I'm taking "black companies", akin to black sites, out of the equation for a moment and concentrating on generally healthy workplaces.) The hero will likely feel a strong bond to their environment. Also, every hero is different. There is a single point of failure, but not a single type of person. Every person has a different driver.

I watched a YouTube video by a famous entrepreneur the other day. And she said something that triggered a response in me, because it sows the seeds of the hero. She said, Would you rather have an employee who just fixes it, handles it, and deals with it? Or an employee that talks about it? Obviously, the large majority will take the person behind door number 1. I would too. But then you need to step up as a manager, as an owner, as an executive, and enforce knowledge sharing.

If you channel all critical knowledge and capabilities through one person, if you let this person become your go-to specialist for everything, you've created a massive SPOF. What happens when your hero gets sick, takes a well deserved two week vacation to a place with no internet, or leaves the company for a new opportunity? The system grinds to a halt. A minor issue becomes a major crisis because the only person who can fix it is unavailable.

This overreliance doesn't just create a risk; it stifles growth. Other team members don't get the opportunity to learn and develop new skills because the hero is always there to swoop in and save the day. The answer? I guess that depends on your situation and what your ability is to keep this person happy without alienating the rest of the team. The answer may lie in the options discussed later in the article around KPIs.

The Knowledge Hoarders

A step beyond the individual hero is the team that acts as a collective SPOF. This is the team that “protects” its know how. They might use complex, undocumented tools, speak in a language of acronyms only they understand, or resist any attempts to standardize their processes. They've built a silo around their work, making themselves indispensable as a unit.

Unlike the hero, this often comes from a place of perceived self preservation. If they are the only ones who understand how something works, their jobs are secure, right? But this behavior is incredibly damaging to the organization's resilience. Not to mention that it is just plain wrong. The team becomes inundated with requests for new features, but also for help in solving incidents. The result in numerous instances is that the team succeeds in neither. Next the manager is called to the senior management because the business is complaining that things don't progress as expected. 

This team thus has become a bottleneck. Any other team that needs to interact with their system is completely at their mercy. Progress slows to a crawl, dependent on their availability and willingness to cooperate. Preservation has turned into survival.  

The real root cause at the heart of both the hero and the knowledge hoarding team is a failure of knowledge management. When information isn't shared, documented, and made accessible, you are actively choosing to create single points of failure. We'll dive deeper into building a robust knowledge sharing culture in a future article, but for now, recognize that knowledge kept in one person's or team's head is a disaster waiting to happen.

When Your Technology is a House of Cards

People aren't the only source of fragility. The way you build and manage your technology stacks can easily create critical SPOFs that leave you vulnerable. These are often less obvious at first, but they can cause dangerous failures when they finally break.

The Danger of the Single Node

Let's start with the most straightforward technical SPOF: the single node setup. Imagine you have a critical application like maybe your company's main website or an internal database. If you run that entire application on one single server (a single “node”), you've created a classic SPOF.

It’s like a restaurant with only one chef. If that chef goes home, the kitchen closes. It doesn't matter how many waiters or tables you have. If that single server experiences a hardware failure, a software crash, or even just needs to be rebooted for an update, your entire service goes offline. There is no failover. The service is simply down until that one machine is fixed, patched or rebooted.

You need to set up your systems so that when one node goes down, the other takes over. This is not just something for large enterprises. SMEs must do the same. I've had numerous calls from business owners who did something to their web server or system and now “it doesn't work!” Not only are they down, now they have to call me and I then must arrange for subject matter experts to fix it immediately. Typically at a cost much larger than if they had set up their system with active, warm or even cold standbys. 

The Mystery of Closed Technologies

Another major risk comes from an overreliance on closed, proprietary technologies. This happens when you build a core part of your business on a piece of software or hardware that you don't control and can't inspect. It’s a “black box.” You know what it’s supposed to do, but you have no idea how it does it, and you can’t fix it if it breaks. When something goes wrong, you are completely at the mercy of the company that created it. You have to submit a support ticket and wait.

This is actually relatable to the next chapter, please follow along and take the advice there.

The Trap of Vendor Lock In

Closely related to closed technology is the concept of vendor lock-in. This is a subtle but powerful SPOF. It happens when you become so deeply integrated with a single vendor's ecosystem that the cost and effort of switching to a competitor are impossibly high. Your vendor effectively becomes a strategic single point of failure. Your ability to innovate, control costs, and pivot your strategy is now tied to the decisions of another company.

This may even run afoul of legal standards. In Europe, we have the DORA and NIS2 regulations. DORA specifically mandates that companies have exit plans for their systems, starting with their critical and important functions. Functions refers to business services, to be clear. 

But we get there so easily. The native functions of AWS, Azure and Google Cloud, just to name a few, are very enticing to use. They offer convenience, low code, and performance on tap. It's just that, once you integrate deeply with them, you are taken, hook, line, and sinker. And then you have people like me, or worse, your regulator, who demands “What is your exit plan?”

Your Resilience Playbook: Practical Steps to Eliminate SPOFs

Identifying your single points of failure is the first step. The real work is in systematically eliminating them. This isn't about a single, massive project; it's about building new habits and principles into your daily work. Here's a playbook I think you can start using today.

Mitigate People-Based Risks

The cure for depending on one person is to create a culture where knowledge is fluid and shared by default. Your goal is to move from individual heroics to collective resilience.

  • Mandate real vacations. This might sound strange, but one of the best ways to reveal and fix a “hero” problem is to make sure your hero takes a real, disconnected vacation. This isn't a punishment; it's a benefit to them and a necessary stress test for the team. It forces others to step up and document their processes in preparation. The first time will be painful, but it gets easier each time as the team builds its own knowledge.

  • Adopt the “teach, don't just do” rule. Coach your senior experts to see their role as multipliers. When someone asks them a question, their first instinct should be to show, not just to do. This can be a five minute screen sharing session, grabbing a colleague to pair program on a fix, or taking ten minutes to write down the answer in a shared knowledge base so it never has to be asked again.

    Many companies have knowledge sharing solutions in place. Take a moment to actually use them. Prepare for when new people come into the company. Have a place where they can get into the groove and learn the heart beat of the company. There is a reason why the Madonna song is so captivating to so many people. Getting into the groove elevates you. And the same thing happens in your company. 

  • Rotate responsibilities and run "game days". Actively move people around. Let a developer handle support tickets for a week to understand common customer issues. Have your infrastructure expert sit with the product team. Also, create “game days” where you simulate a crisis. For example: "Okay team, our lead developer is 'on vacation' today. Let's practice a full deployment without them.” This makes learning safe and proactive.

  • Celebrate team success, not individual firefighting. Shift your praise and recognition. Instead of publicly thanking a single person for working all night to resolve a problem, celebrate the team that built a system so resilient it didn't break in the first place. Reward the team that wrote excellent documentation that allowed a junior member to solve a complex issue. Culture follows what you celebrate. At the same time, if the team does not pony up, definitely praise the person and follow up with the team to fix this.

  • Host internal demos and tech talks. Create a regular, informal forum where people can share what they're working on. This could be a “brown bag lunch” session or a Friday afternoon demo. It demystifies what other teams are doing, breaks down silos, and encourages people to ask questions in a low pressure environment.

  • Remunerate sharing. Make sharing knowledge a bonus-eligible key performance indicator. The more sharing an expert does, with their peers acknowledging this, the more the expert earns. You can easily incorporate this into your peer feedback system. 

  • Run DRP exercises without your top engineers: This is taking a leap of faith, and I would never recommend this until all of the above are in place and proven. 

Building Resilient Technical Systems

The core principle here is to assume failure will happen and to design for it. A resilient system isn't one where parts never fail, but one where the system as a whole keeps working even when they do.

  • Embrace the rule of three. This is a simple but powerful guideline. For critical data, aim to have three copies on two different types of media, with one copy stored off-site (or in a different cloud region). For critical services, aim for at least three instances running in different availability zones. This simple rule protects you from a wide range of common failures.

  • Automate everything you can. Every manual process is a potential SPOF. It relies on a person remembering a series of steps perfectly, often under pressure. Automate your testing, your deployments, your server setup, and your backup procedures. Scripts are consistent and repeatable; tired humans at 3 AM are not.

  • Use health checks and smart monitoring. It's not enough to have a backup server; you need to know that it's healthy and ready to take over. Implement automated health checks that constantly monitor your primary and redundant systems. Your monitoring should alert you the moment a backup component fails, not just when the primary one does.

  • Practice chaos engineering. Don't wait for a real failure to test your resilience. Intentionally introduce failures in a controlled environment. This is known as chaos engineering. Start small. What happens if you turn off a non-critical service during work hours? Does the system handle it gracefully? Does the team know how to respond? This turns a potential crisis into a planned, educational drill.

Avoiding Technology and Vendor Traps

Your resilience also depends on the choices you make about the technology and partners you rely on. The goal is to maintain control over your destiny.

  • Build abstraction layers. Instead of having your application code talk directly to a specific vendor's service, create an intermediary layer that you control. This “abstraction layer” acts as a buffer. If you ever need to switch vendors, you only have to update your abstraction layer, not your entire application. It’s more work up front but gives you immense flexibility later.

  • Make “ease of exit” a key requirement. When you evaluate a new technology or vendor, make portability a primary concern. Ask tough questions: How do we get our data out? What is the process for migrating to a competitor? Is the technology based on open standards? Run a small proof of concept to test how hard it would be to leave before you commit fully.

  • Consider a multi-vendor strategy. For your most critical dependencies, like cloud hosting, avoid going all in on a single provider if you can. Using services from two or more vendors is an advanced strategy, but it provides the ultimate protection against a massive, platform wide outage or unfavorable changes in pricing or terms.

It's a journey, not a destination

You will never be “ready.” Building resilience by eliminating single points of failure isn't a one time project you can check off a list. It’s a continuous process. New SPOFs will emerge as your systems evolve, people change roles, and your business grows.

The key is to make this thinking a part of your culture. Make “What's the bus factor for this project?” a regular question in your planning meetings. Make redundancy and documentation a non negotiable requirement for new systems. By constantly looking for the one thing that can bring everything down, you can build teams and technology that don't just survive shocks—they eat them for breakfast.

Client rating

High Impact