Problem diagnoses

July 19, 2020 ☼ writing

At a previous job of mine, I volunteered to be a part of the group that helped coordinate response to company-wide downtime, or incidents”. This role was called being an incident PM. My job wasn’t to fix the crisis at hand, but to help coordinate response and communications across many groups of people. If the website went down at 3am, my phone would ring. I’d be responsible for waking up whoever best understood the problem, helping people identify a fix, update the public with a status update if needed, and be present at debrief meetings where teams would discuss how to best prevent relevant issues from reoccurring. I enjoyed this work and I recently realized the biggest lesson I learned from it.

When a problem happens in a complex system, it’s not always apparent what the root cause of the issue is. Even if a problem is highly visible, in complex systems, the cause could be dozens of specific things. Sometimes when a cause seems obvious at a high level (e.g. the website is down because the database isn’t working”), the immediate action can be unclear without digging deeper. Is the database not working because we can’t handle a new wave of traffic, or is the computer it’s running on just out of disk space?

In times of crises, biasing towards action is natural. But actioning before understanding the cause can lead to wasted time and incorrect risk assessments. When I was working as an incident PM, promising-sounding solutions to complex problems were often presented before the cause of the problem was established. These solutions were often found to be wrong. It’s easy to say yes to a solution early on to stop the bleeding, but most of the time, it’s smarter to double down on establishing facts before blindly committing resources to an unclear solution.

It may be tempting to have someone with knowledge of a system to rush out a fix based on their best hunch, but it’s often faster to use the same person’s expertise to double down on understanding the problem. Fewer mistakes and better solutions are made this way. Biasing towards action is great, but it’s better to spend the needed time to pool resources towards the best action.