Some thoughts on toil
The importance of distinguishing toil from other types of work and understanding its impact didn’t really click with me until recently when I read the Google SRE book. Toil is an insidious type of work. It’s impossible to completely eliminate, can be difficult to spot, and can have a huge impact on your productivity. If left unchecked, it can grow to consume the majority of your time meaning you’re left with little or no capacity for work which improves your environment.
I really like the use of the word toil. It’s a fitting description of the effect and it feels right for it to have a distinct word to clearly delineate between toil and more productive forms of work.
In the SRE book, toil is defined as work which is:
- Has no enduring value
- Scales linearly with growth
Think of the tasks which you are doing by hand, over and over, but could be automated. You know the ones - they’re not quite annoying enough (or you’ve been too busy) to write a script to do them for you instead. Or, perhaps you’ve written a script, but you don’t have 100% faith in it yet and you feel you have to baby-sit it still - so you’re running it by hand. After you’ve run that script, you’re not really any better off than you were beforehand. Maybe that disk won’t fill up today, but it’s going to keep slowly filling until you need to run that cleanup script again in a week or two. And, on top of that, you know that, the more users that are added to the application, the faster the disk is going to fill up. So, over time, you’re going to have to run it more often. This is toil.
Here are some other examples:
Code releases. This one can be overlooked. While the new code may bring new value, you executing the release process by hand doesn’t. Get out of the way of the release process.
On-call. OK, so you’ve booped that service, but, you’ve just got back to where you were beforehand. Disaster averted, sure, but no real improvement has been made.
Say you have a low volume product. You might not have bothered to automate the setup of new customers. Every time you add a new customer to the product, a human has to run through a process. Maybe there is a script which automates part or all of the process, but a person is still interrupted from what they’re doing to go run that script. This is bad for staff and bad for customers. You may have added bit of revenue to the product in the form of that new customer, but the product itself doesn’t have any new value.
Maybe you’re using configuration management for your infrastructure, but the actual application of changes is gated by a review process. Again, a human has to be interrupted and perform some manual steps and push the changes through. In reality, the vast majority of these changes are probably very small and the simply applying these changes to the testing environment first is probably mostly what people want to do.
Why is toil bad?
Knocking out the same repetitive tasks can give you a sense of achievement. Getting that little kick of satisfaction from crossing stuff off your list is nice. But, in reality, it’s false. It may seem that you’re progressing, but really, you’re just standing still. OK, from time to time, it might be fine to do this a little to give yourself a boost, but watch out. If you’re spending a lot of time on toil, then, by definition, you’re not doing anything new. If you’re not doing anything new, then you’re not learning anything new.
I think any moderately experienced engineer would agree that the value of their work has, and should continue to, go up over time. Or, to put it another way, you know a hell of a lot more now than you did on day one and you expect to know a hell of a lot more a year from now. For a lot of people, that’s why we’re in this job, we love to learn. It stands to reason that if something you do is O(n) time complexity, then you need to re-think how you’re doing that thing, or else you’ll eventually spend all of your time doing it.
If you’re managing people, you need to watch out for the proportion of toil in other people’s work too. You’re not going to keep good engineers engaged if they’re not given opportunity to make things better. It can spell boredom, burnout and possibly losing good people. Help them to see the toil in their own work - they may be having a hard time spotting it - and then make sure they have the time to pay it down.
High levels of toil can also help perpetuate the notion of ops as the janitors - there to clean up the messes left by others. Most people don’t want to do this type of work.
Ops people tend to be highly sensitive to repetition. The impulse to script things away is second nature.
We can have a bit of a dysfunctional attitude toward scoping a problem though. We see all of the myriad ways that things fail in production. At times, we can tend toward trying to flesh out all of the corner cases and find all of the odd failure modes we may encounter up front. Perhaps even to the point of paralysis sometimes. This probably comes, at least in part, from a fear that, while attention from the rest of the team may be on the problem now, in the future, we’re likely somewhat on our own. Plus, there is the desire to be done with a problem.
So, there can be this odd tension between the little script to fix this one problem, versus trying to front-load all of the work and deal with every possible incarnation of the problem and then fit that into some larger automation story.
More and more, it seems to me that the approach for reducing toil should be driven primarily by the value the changes will bring. We need to accept that an incomplete solution which solves a large chunk of the problem can be acceptable. If you solve for the most frequently occurring tasks or the tasks which take the longest time, it doesn’t matter that there is still toil which is not yet automated. The work you’ve done has still brought some value and reduced the net toil. And, there’s nothing stopping you from tackling the next-biggest chunk of toil later on. In any case, priorities could change, the nature of the application could change and you could end up spending a disproportionately high amount of time on a problem that simply disappears.
Remember, any work you do on reducing toil, by definition, gives you more time.
Real world example
Among other things, my team manages a handful of web properties. For various (good) reasons, the authentication to these properties in not centralised. Each one has local users with local passwords. When someone leaves the company, we need to review what applications, services, and so on that that person had access to and ensure everything is revoked. Because they are standalone, this means logging into each one of these web properties and checking to see if this person had access to any of them. This falls very squarely in the territory of toil.
Your first impulse may be to create some sort of a super-dashboard which aggregates all of the data and actions from each of the properties. I.e., view, enable, and disable users. And this isn’t a bad idea, but it’s probably not the best use of time. There are actually not a lot of staff accounts in these properties. The majority of the time, what we’re actually doing is confirming that a given user doesn’t have an account in a given property. It’s better to solve for this, much simpler, use-case rather than trying to create a generic uber-user-management panel.
So, in the end, we wrote a little endpoint for each property which dumped out a list of users (IP white-listed and password protected of course). A cron job fetches the list from each property once per day and sticks the results in a little database to make it searchable. This was very quick and simple to implement. It hasn’t eliminated this manual process, but now it takes seconds instead of 15-20 minutes. There’s a lot of value in that and we can build on it over time.