DevOps Troubleshooting

3 min readJul 1, 2020

Many technical people “just know” how to troubleshoot a technical issue, from experience, example, or trial and error, but many of those same highly technical people, when put on the spot, can’t necessarily tell you HOW they troubleshoot.

How do YOU troubleshoot?

Basic Troubleshooting Framework

The obvious answer is, “It depends,” but that’s not very satisfying, unless you can give a host of classes of problems and how to deal with them. Instead, lets look at some high level steps that describe how you might approach any technical problem.

Identify the symptoms
Gather and examine detailed information
Hypothesize potential causes
Verify hypotheses one by one, in order of most likely and/or simplest to fix
Devise a plan to solve the problem
Implement the solution plan
Verify the issue was resolved
Repeat as needed

Incident Management Objectives

The above steps apply to almost any issue, but what if you’re on-call for a production system? Generally production readiness includes having an incident management process which can be trained and followed to ensure the following goals are met:

Resolve an incident as quickly as possible (small MTTR)
Ensure client satisfaction with support quality
Keep clients and stakeholders up to date on what to expect so they can plan accordingly
Follow up with analysis and action items to avoid this issue in the future

Incident Management Framework

One nice thing about a DevOps culture is that the people who are on-call for the production systems are also developers. Having a developer on-call can shorten the resolution time of an incident and sometimes provide a better, more knowledgeable and informed solution, because they have more in-depth domain knowledge of the system. Because of this, the process for incident management might look a little different, with less scripted run books to follow, and more in-depth, iterative troubleshooting.

The following steps describe one way you can combine the Basic Troubleshooting Framework with additional process to meet the Incident Management Objectives.

Identify the symptoms, scope, impact, and urgency
Raise the alarm and ask for help
Gather and examine detailed information (logs, metrics, errors, traces, current state)
Assign a communications lead (to allow the problem solvers to focus) & communicate status regularly
Establish a war room and/or video conference, if the severity merits it
Hypothesize potential causes
Try to reproduce & record the steps
Verify hypotheses one by one (or in parallel, if in a group) in order of most likely and/or simplest to fix
Stem the bleeding quickly, if possible.
Brainstorm more hypotheses, if needed
Identify the chain of causality (root causes)
Devise and implement a plan to solve the problem
Closely monitor the behavior after a fix has been applied to quickly detect if the problem resurfaces or side effects occur.
Document the symptoms, validated hypothesis, and changes made
Have a postmortem to share the knowledge and identify improvements to product and process
Admit mistake and be honest, but avoid blame, shame, or punishment
Reward overtime so that being on-call isn’t just a chore
Praise participants who played their part well, not just the one who discovered or solved the issue

Not Perfect, But Good Enough

No process is perfect, of course. This one was assembled from years of experience both problem solving and being on-call, but you may have more experience yourself. So feel free to adopt the above steps and modify as needed. If I missed something important or you think they’re out of order, feel free to drop me a note in the comments. I’ll be using this as a reference too, so I may update it as needed.

DevOps Troubleshooting

Basic Troubleshooting Framework

Incident Management Objectives

Incident Management Framework

Not Perfect, But Good Enough

Written by Karl Isenberg