Many technical people “just know” how to troubleshoot a technical issue, from experience, example, or trial and error, but many of those same highly technical people, when put on the spot, can’t necessarily tell you HOW they troubleshoot.

How do YOU troubleshoot?

Basic Troubleshooting Framework

The obvious answer is, “It depends,” but that’s not very satisfying, unless you can give a host of classes of problems and how to deal with them. Instead, lets look at some high level steps that describe how you might approach any technical problem.

Incident Management Objectives

The above steps apply to almost any issue, but what if you’re on-call for a production system? Generally production readiness includes having an incident management process which can be trained and followed to ensure the following goals are met:

  • Resolve an incident as quickly as possible (small MTTR)
  • Ensure client satisfaction with support quality
  • Keep clients and stakeholders up to date on what to expect so they can plan accordingly
  • Follow up with analysis and action items to avoid this issue in the future

Incident Management Framework

One nice thing about a DevOps culture is that the people who are on-call for the production systems are also developers. Having a developer on-call can shorten the resolution time of an incident and sometimes provide a better, more knowledgeable and informed solution, because they have more in-depth domain knowledge of the system. Because of this, the process for incident management might look a little different, with less scripted run books to follow, and more in-depth, iterative troubleshooting.

The following steps describe one way you can combine the Basic Troubleshooting Framework with additional process to meet the Incident Management Objectives.

Not Perfect, But Good Enough

No process is perfect, of course. This one was assembled from years of experience both problem solving and being on-call, but you may have more experience yourself. So feel free to adopt the above steps and modify as needed. If I missed something important or you think they’re out of order, feel free to drop me a note in the comments. I’ll be using this as a reference too, so I may update it as needed.

Cloud Guy. Anthos Solutions Architect at Google (opinions my own). X-Cruise, X-Mesosphere, & X-Pivotal.