DevOps Troubleshooting

Karl Isenberg
3 min readJul 1, 2020

Many technical people “just know” how to troubleshoot a technical issue, from experience, example, or trial and error, but many of those same highly technical people, when put on the spot, can’t necessarily tell you HOW they troubleshoot.

How do YOU troubleshoot?

Basic Troubleshooting Framework

The obvious answer is, “It depends,” but that’s not very satisfying, unless you can give a host of classes of problems and how to deal with them. Instead, lets look at some high level steps that describe how you might approach any technical problem.

  1. Identify the symptoms
  2. Gather and examine detailed information
  3. Hypothesize potential causes
  4. Verify hypotheses one by one, in order of most likely and/or simplest to fix
  5. Devise a plan to solve the problem
  6. Implement the solution plan
  7. Verify the issue was resolved
  8. Repeat as needed

Incident Management Objectives

The above steps apply to almost any issue, but what if you’re on-call for a production system? Generally production readiness includes having an incident management process which can be trained and followed to ensure the following goals are met:

  • Resolve an incident as quickly as possible (small MTTR)
  • Ensure client satisfaction with support quality
  • Keep clients and stakeholders up to date on what to expect so they can plan accordingly
  • Follow up with analysis and action items to avoid this issue in the future

Incident Management Framework

One nice thing about a DevOps culture is that the people who are on-call for the production systems are also developers. Having a developer on-call can shorten the resolution time of an incident and sometimes provide a better, more knowledgeable and informed solution, because they have more in-depth domain knowledge of the system. Because of this, the process for incident management might look a little different, with less scripted run books to follow, and more in-depth, iterative troubleshooting.

The following steps describe one way you can combine the Basic Troubleshooting Framework with additional process to meet the Incident Management Objectives.

  1. Identify the symptoms, scope, impact, and urgency
  2. Raise the alarm and ask for help
  3. Gather and examine detailed information (logs, metrics, errors, traces, current state)
  4. Assign a communications lead (to allow the problem solvers to focus) & communicate status regularly
  5. Establish a war room and/or video conference, if the severity merits it
  6. Hypothesize potential causes
  7. Try to reproduce & record the steps
  8. Verify hypotheses one by one (or in parallel, if in a group) in order of most likely and/or simplest to fix
  9. Stem the bleeding quickly, if possible.
  10. Brainstorm more hypotheses, if needed
  11. Identify the chain of causality (root causes)
  12. Devise and implement a plan to solve the problem
  13. Closely monitor the behavior after a fix has been applied to quickly detect if the problem resurfaces or side effects occur.
  14. Document the symptoms, validated hypothesis, and changes made
  15. Have a postmortem to share the knowledge and identify improvements to product and process
  16. Admit mistake and be honest, but avoid blame, shame, or punishment
  17. Reward overtime so that being on-call isn’t just a chore
  18. Praise participants who played their part well, not just the one who discovered or solved the issue

Not Perfect, But Good Enough

No process is perfect, of course. This one was assembled from years of experience both problem solving and being on-call, but you may have more experience yourself. So feel free to adopt the above steps and modify as needed. If I missed something important or you think they’re out of order, feel free to drop me a note in the comments. I’ll be using this as a reference too, so I may update it as needed.

--

--

Karl Isenberg

Cloud Guy. Anthos Solutions Architect at Google (opinions my own). X-Cruise, X-Mesosphere, & X-Pivotal.