Writing About Things Going Wrong
In this post, I want to talk about one very important part of dealing with an outage or some other type of undesirable event - doing a write-up afterwards. As well as being a good exercise in introspecting behaviours in your team, a well-written report can have a huge impact on how your work, as an ops/sysadmin/SRE/whatever person, is perceived outside of your team. Whether you think it is or not, giving other people in your organisation a view into some of the details of what you do is a part of your job. As your maths teacher may have told you, you have to ‘show you work’.
Imagine you have just spent some time fighting a problem in production. You’ve managed to get to the bottom of it - what a relief. You’ve put a fix in place, and service has been restored. It may be the middle of the night depending on how your on-call rotation works. You might be cursing yourself for a stupid mistake you made months ago that has now come back to bite you. Or, perhaps you’re pissed at a vendor for screwing up. If you had to rope in other team members to develop a fix, they’re probably itching to move on to other things or get back to their evening or weekend. Chances are you’re stressed and annoyed. You probably just want to put the whole thing behind you. Before you do that though, you should spend a little time preparing a report.
Why should you do this?
Memory is perishable. Especially the finer details. The sooner you record what happened the more likely you are to accurately capture the events. Consequently, and, more importantly, the more likely you are to accurately capture the corrective actions whose necessity has become apparent as a result of the incident. You may already have had a sense that these were needed, but they might not have been resolved into concrete actions just yet.
It also forces you to analyse what happened and make sure that you truly do have a clear understanding of what went wrong. This is the old trick of having to explain something to someone else making it clearer in your own mind.
You’ve just solved a problem involving some non-zero amount of effort. It had a cost to the business in the form of, at the very least, the time you put into it. In order to minimise cost, you should share what you’ve learned with others so that they can get the value out of it as well. Regardless of whether it was something trivial or the toughest thing you’ve ever figured out in your whole career it’s worth preserving. Remember that the person you might be saving from repeating this work may be some future version of yourself.
For example, you may have:
- Seen a new failure mode you weren’t aware of before.
- Found a dependency between two components you didn’t know of.
- Found a gap in your monitoring.
- Learned some other fun new thing about the behaviour of one of your systems you didn’t know before.
These nuggets of information should be captured somewhere. First and foremost, this should be in your incident report itself. But this isn’t enough. Out of every report should fall one or more corrective actions. These are the changes whose necessity has become apparent as a result of the incident. For most people this will take the form of one or more tickets to be dealt with at some later stage.
The outcome from these tickets can take a number of forms. In the simplest case this may be just be a piece of documentation on your wiki. Although, hopefully this is a rarity as the value you can hope to get out of this is generally pretty low. When is the last time you said to yourself; I fancy a bit of a browse through our wiki. Or, when last, in the heat of an incident, did you go searching through the wiki for an answer? It’s far more valuable for the knowledge to be captured as part of the system itself. For example, as a piece of monitoring or a piece of test automation.
Learning is, of course, a continuous process. We all have imperfect knowledge of the systems we run. After all, they change often. New versions get deployed, new components get added. There is always something new to discover, some new type of failure to see, some strange interplay that has not yet made itself obvious. This task is, in a sense, never complete. But, as you progress, two things tend to happen. First, as you’re consciously scrutinising all of the things that have gone wrong, you develop a better and better sense for what will work and what will wake you up at 4am. You start to see bad patterns, single points of failure, unnecessary complexity, brittleness, services which are hard to reason about or introspect and so on. Second, as you begin to plug all of the more trivial issues the complexity of the faults you face goes up. It’s not that the harder problems were not there, you just now have more space to deal with them.
Transparency and Trust
We ops people can get consumed by the infrastructure we manage. It’s easy for us to lose perspective sometimes. We forget that not everyone is embroiled in the details of operating the services in our charge. People won’t feel that you are being open and honest with them if you don’t clearly communicate what’s going on. This is where a little bit of empathy can go a long way.
If you are not reporting on incidents, something like this is all the information that your colleagues may be getting;
Server mail-prod-1 went down last night.
There is a lot of ambiguity in this. It could be interpreted many ways. If you leave gaps, others will fill them for you and they may not be as forgiving as you might like. In the above example, were there no mails sent last night at all? What does ‘last night’ mean? Is that from 6pm to 9am? What even is ‘mail-prod-1’ and what does it do? If you don’t write up a report, this may be the only version of events that your organisation is getting. A better version of the above might be;
Our primary email server (mail-prod-1) was out of service between 20:05 UTC and 20:17 UTC due to a disk failure. During this time, all mail was delivered to the backup server and queued there. When the primary server was brought back online these messages were relayed to the primary server and successfully delivered. 52 messages were generated during this time. At worst, customer mails would have been delayed by up to 12 minutes.
You also need to balance giving context with overloading the reader. Again, practice some empathy. Put yourself in the place of your audience. What are the facts which are relevant to them? What information do they need to understand what happened? In the above example, the server went down due to a hardware failure. They probably don’t need to know the model of the RAID controller or whether the drive was a spinning disk or an SSD, or which rack the machine was in. Unless, of course, those details are actually relevant. Say, if you were seeing high failure rates of this type of drive. Then a corrective action might be to stop buying these drives and proactively replace the existing units.
Hopefully, you’re working in a reasonably healthy organisation where finger-pointing and blame are not a thing. The biggest concern, then, for the readers of your report is the impact that the issue had to the business. They want to know, what were the implications of the incident? What was the context around the incident? Why did this happen? Will it happen again? What work do we need to do to ensure that it doesn’t happen again? If we can’t outright prevent it from happening, what can we do to mitigate the impact? These are the details which your report needs to address.
Transparency breeds trust. The more open you are the more people will trust you to do the right thing. This has to be an active process on your part though. It’s not enough to sit in your silo and feel open. You must broadcast this information. You’ve got to put it right in peoples’ faces. Email the reports to everyone. Don’t send them a link - put the actual text of the report in an email.
There are a few ways of looking at this. The first part is ensuring that you, as an ops team, are holding yourselves accountable for implementing improvements that you have discovered are required. Writing it down makes it real - means you have to do it.
Next, you’re holding the other teams in the organisation accountable too - developers, QA, product, marketing, finance and so on. Too often you see ops people grumble about things being broken, yet, they themselves have neglected to clearly communicate the issues to those who need to know about it.
Then there is, in a broader sense, holding the business accountable. Is the business, for example, taking security seriously? Suppose you had a breach. If you propose a batch of required changes in an incident report, they are very hard to ignore.
Another, slightly more subtle aspect to this is the hard data it gives you. Making decisions based on hard data is a lot more likely to produce a good result. Not happy that one of your vendors is giving you the service you expect? Well, your historical incident reports give you a concrete picture of what the actual service level has been.
What should go in your report
So, how do actually go about writing a report? In my mind, the most important properties of a good report;
- Keep it simple.
- Provide your audience with the key information needed to help them understand what happened.
- It should only contain objective facts. You should have very strong supporting evidence for everything you say in it.
- Well, OK, you can add things which you’re not 100% sure of as long as you are explicit about it.
- Try and keep your ego and your bias out of it. I know you hate that piece of software, but, truthfully, if you’re honest about it, is it doing the job?
- Don’t make it personal. Don’t criticise an individual or a team.
- Don’t make judgements on the code.
- Don’t use it as an opportunity to whine.
My reports contain four sections.
Section 1 - Overview
This should be a quick summary of:
- Which product or products were impacted.
- What was the date on which the incident occurred.
- What was the severity of the incident. Use your own scale, but anything more complicated than small/medium/large is probably over doing it.
Section 2 - Description
This section contains a step-by-step description of the sequence of events as they happened. Include times and dates of when things happened. For example, when services or components stopped or started working. When actions were taken. What were the effects of the actions? Include observations that were made and when they were made. Try to be as clear as possible.
Try to explain the thought process behind the actions taken. Remember that it may not be obvious to the reader why you did what you did at the time. Hindsight being 20-20 and all. An action that didn’t help, or, even made things worse, will probably seem obviously wrong after the fact. You need to remind your audience that the operator had imperfect knowledge of the situation at the time. Here is a hypothetical example.
At 20:18 UTC on the 25th November, our monitoring detected a problem with the foobar.com application. The issue was investigated and the site was found to be unresponsive.
A notice was sent to the email@example.com mailing list to alert the product team at 20:25 UTC.
High system load was observed across all web servers and database servers.
Examination of the logs showed higher traffic than usual. After a little analysis, it was found that a small number of IPs were making approximately 25% of all requests. All requests from these IPs were to /api/v1/login. Request to this endpoint are made when a user logs in to the application.
At 20:41 UTC the following IPs were completely blocked.
Traffic levels dropped immediately and service was restored at 20:43 UTC.
At 20:50 UTC traffic levels began to rise again. The same pattern of requests had begun again on a new set of IPs.
At 21:12 UTC the load-balancer configuration was updated. Rate limiting was added for the impacted URL only. A conservative limit of 100 requests per minute, per IP was used to avoid risking any adverse effects. Traffic levels dropped once again.
For 25 minutes (20:18 UTC - 20:43 UTC) users would have seen very slow or no service from the foobar.com application.
Section 3 - Root Cause
This can be a contentious topic. Whether root cause is actually what you should be doing and how you go about doing it is a complex topic. How I use this is, perhaps, slightly different to some. Essentially this is a mini-summary of the ‘what’ of the fault description. For example, for the above description, my root cause might look like this:
A brute force attack against the login system overloaded the application.
Now, the problem with root cause comes in when the conclusion you draw is ‘human error’. This is a slippery slope which you should avoid. Humans don’t decide to make mistakes. So, you need to look beyond this at the circumstances they were which led them to believe this was the correct course of action.
Section 4 - Corrective Action
This is a list of things you found you need to fix or improve in some way. For example, a few example corrective actions for the above fault might be;
- Review the rate limiting put in place during the outage with dev and product to ensure that it is appropriate. [ops, dev, product]
- Review all API endpoints for cpu cost. Put in place rate limiting for ‘expensive’ endpoints. [ops, dev, product]
- Put in place test automation to ensure rate limit is effective. [ops, qa]
It’s OK to not have all of the answers straight away. An appropriate action might be to do further investigation and follow up with more information. It’s better to communicate soon after the fault, especially after the more visible ones, in order to fill in the organisation on the the details you do know, rather than delay until you have things 100% figured out. You can always follow up later with more detail at a later stage.
Corrective actions may also be qualified recommendations. E.g., I think we should do X, but only if we can confirm Y to be true.
Put the ticket ids against the corrective actions.
Incident reports help to close the loop between fault and remediation. They help to ensure that you learn, as an organisation, from faults and work toward ensuring they are not repeated.
It’s better to send them out often - even for small things. If the only time you write a report is when things go really, really wrong, then any time someone sees a report from you, they’re anticipating a huge problem. If you send them often they become routine. Don’t be afraid to send a report along the lines of ‘something broke, but those changes we made a few months ago worked, so customers didn’t notice - we’re awesome’.
Whatever you do, things are going to break for everyone to see. If you’ve established a solid pattern of problem detection, solution design and solution application people will have much more trust in you when things go wrong.