Cascading failures

Early in my career, probably 30 years ago, I recall shipping some significant application at a bank. I don’t even remember what the product was but I do remember that management was really worried about things not going well. So we rolled out to a small subset of our customers for a couple of weeks, while developers were on call for the support desk. In the first few days support called regularly and we’d come down and fix the problems right away.

A line of dominos falling — Image by Alicja from Pixabay

By the second week, the support team had stopped calling the developers and when we checked in with them, they said everything was going great. So we rolled out to the entire user base of 10,000 people.

Fast forward a couple of days, to discover that there were lots of problems, and had been problems all the way through the support period. The gotcha is that the support team had found workarounds so they had stopped reporting those problems. Once they’d figured out a workaround, they just walked users through that, rather than reporting it. Problem solved, or so they thought.

Until they were now supporting 10,000 people who all needed that workaround at the same time. All of a sudden the support team didn’t have enough capacity anymore, and it was a full panic. All hands on deck, and development in full fire fighting mode.

Where do we even start to unpack this?

Did we have quality issues? Obviously, yes. The fact that so many significant bugs had even been found, shows that there were quality problems.

Did we have a problem with support stopping with a workaround and not digging deeper to find the underlying problem? Again, yes. While this could have been a conscious choice (Satisficing Decision), it was most likely a cognitive bias (Einstellung Effect) that led them to stop looking for a better solution.

Did we have communication problems between development and support? Yes. The fact that development wasn’t even aware that things were breaking, was a significant problem.

Did we have problems with management not seeing the whole picture? Again yes. Despite the focus on wanting this to go smoothly, they didn’t have any of the overall picture. Each group was continuing to work in their own silo, and optimizing for their own behaviour. Looking at the bigger picture is a management responsibility, and they had abdicated that to the individual teams, telling them to figure it out on their own. Teams that were not in the habit of working together, and had never built up the skills or processes that would have made this effective.

When we see a significant failure, it’s rarely because one thing went wrong. We can usually correct for one mistake being made. The significant failures are due to a cascade of failures, as in this case.

It wasn’t just that there were bugs. It wasn’t just that support stopped reporting problems when they had a workaround. It wasn’t just that groups weren’t talking to each other. It wasn’t just that management had taken their eyes off the ball.

All of these things had happened at once, and it was a disaster. Had only one thing happened, we likely would have compensated for it. The fact that they all happened at once made that impossible.

What made this a cascade rather than a recoverable problem was that each failure cut a feedback loop. Bugs had been reaching us through support calls, but once the workarounds were in place, we stopped hearing about them. We relied on support to escalate problems to development, but support had stopped calling. We expected management to track rollout health through what development reported, but we had nothing to report. Each failure silenced the signal we would have needed to catch the next one.

Each team was solving their own problem: support kept users moving, development fixed what they heard about, management tracked what development told them. Nobody was watching the whole system. Checking in with each team and hearing that everything is fine is not the same as understanding the health of the rollout. Silence is not a signal of success. Ask instead whether the feedback loops are still intact.

How could we have identified the individual problems before we had a cascading failure?

For people internal to the system, that’s the main point of retrospectives. Reflecting deeply into what’s going on and surfacing problems before they fail. If we’re only having superficial conversations then we won’t uncover these things, but if we’re having the right retrospectives, we will. See my course Retrospective Magic for more on improving your retrospectives.

The other option is to bring in someone from outside the system, as it’s often easier for an outsider to see things that the insiders have learned to ignore. That’s not a criticism of the insiders, but rather an acknowledgement of the way human brains work. The more we ignore a thing, the less chance of that thing even being brought into conscious awareness - we literally stop seeing it. Refer to the reticular activating system in this article on poor code

If you’d like help with either of those then let’s talk.

Cascading failures

Categories

Tags