On-Call Done Right
Derek Power, Infrastructure Head of Engineering, explains how we set up our on-call strategy and how we improved it to meet the business needs without burning out people.
On-call, when done correctly, is not a Big Monster that engineers need to be afraid of. It can and should be viewed as a learning tool that allows engineers to better understand the platform they work on every day. With proper and measured alerts, coupled with a fair on-call rotation, you will find yourself in a situation where engineers will actively apply to join your rotations.
How it was
Originally Jimdo’s on-call was looked after by eight heroes, two of which were engineering managers. In the days when the company’s headcount was small, this setup worked fine. The problem is you have the same eight people looking after everything that is considered production, regardless of whether or not they work in the team whose product may fire an alert at night. This, coupled with a small pool of folk in the rotation, could lead to On-call Fatigue.
But as with all companies that grow, both in terms of people and platform, your on-call rotation and setup must grow too. The trick is how you do this. If you have a solid platform and the alerts firing at night are low, volunteers are easy to come by. All that is required is a little refactoring of what alerts go to what people. By wrapping a little structure around an existing framework you can modify it easily so that it works better for both the engineers and the company.
How it is now
Having one large rotation looking after everything is not the best way to ensure your Mean Time To Resolution (MTTR) is as low as it can be. Instead, you should consider having targeted rotations that look after specific areas of the platform, supported by the best-placed engineers: those who work in such areas.
Ideally, you would have a mix of developers and operations folk in your on-call rotations, to ensure that the right people are available to handle anything that fires. Too often companies think that having Operations or Site Reliability Engineering (SRE) look after all the out-of-hours alerts is the right thing to do, regardless of whether or not these teams have development experience.
But if you correctly structure the rotations you should be able to avoid having the wrong team alerted about an issue they can not resolve without assistance.
So how did Jimdo get volunteers to join the on-call schedules? By breaking down production into three distinct pillars, as shown above.
We were able to then build three on-call teams that looked after pillars best suited to their skillset. This is a two-fold win for the company. The right people will be alerted to issues that need investigating, that is the first win. Win number two is because we are only going to wake up the right engineer for the relevant alert, more people are willing to sign-up and be part of on-call rotations.
So how do we bring people into the on-call fold? It’s all about how your on-call is configured, to begin with. At Jimdo an alert that will wake somebody up cannot ‘go live’ unless it has an associated runbook with it. These runbooks should be an almost ‘paint by numbers’ guide on how to resolve the alert. Beyond making this easier for the on-call engineer to resolve the issue, following this principle means we can ensure the MTTR has the best chance of being short.
You never want on-call engineers to go into their first rotation without a sense of support. Luckily part of the company culture ingrained at Jimdo is that folk help each other out. Six of the original on-call rotation remained in the new setup, taking shifts just as they always had. They also volunteered to go into a secondary rotation - the Backup. Should anyone in one of the three pillar rotations feel that they needed help, there was a scheduled engineer they could escalate to as required. For the initial three months of the new rotation setup, we had this Backup engineer, making it easier for those going into on-call to not feel like they would be doing it solo.
This is a key part of bringing anybody into an on-call rotation; providing a safety net. Somebody that they may never escalate to, but having the option removes some of the stress that on-call can cause. It is another thing that some companies overlook. You can have a team of talented engineers during the day, but at 2 a.m. it is nice to know you have a friend to call on should the need arise.
The best on-call process in the world is nothing without the tools to make it run.
We use PagerDuty, as many in the industry do, for sending out the alerts that get people fixing problems out of hours. This is then plugged into our monitoring solution and configured so that when an alert fires the platform is given a small window to ‘self-heal’ before triggering PagerDuty.
Again, an important element often overlooked by other companies.
The last thing you want is for your alert to immediately get an engineer out of bed. What if the application or system recovers between an alert triggering and the engineer booting up their laptop? The number one complaint from engineers on-call is that they get woken up by events that have resolved themselves. People should only get woken up when they absolutely must.
The Depth of Night
So what happens when alerts go ping in the night? We have a nice and tidy process wrapped around that. Using Slack as our main messaging service in Jimdo, we have dedicated channels to discuss issues when they arise. If an alert fires, what typically happens is the on-call engineer, upon beginning their investigation, will post into one of the channels they are looking at the alert. Should the alert be an easy-to-resolve one, as in simply following the runbook, an update is posted and we continue as normal?
If the alert requires more than just the on-call engineer involved then we start an escalation process into the relevant teams or even go up the hierarchy to pull in senior engineers. The actions being taken are updated in the Slack channel for all to see and follow easily. Depending on the severity of the issue being worked on we may declare an incident and then follow our incident management process.
There is an important and often ignored task that must be done to ensure you are running a successful on-call rotation: The Post Morning Read. This isn’t something that the engineers should do, but rather the managers that are in charge of running the rotation. You want to check the usual channels that issues are flagged in, but also then check the alerts that fired during the night. It should involve checking PagerDuty to see what actually woke somebody up. All of this is important as it provides metrics that can be used to gauge just how well on-call is doing, but also how well the product is doing.
If you see ‘wake up’ alerts in triple digits it might be time to focus on the reliability of the platform over feature rollouts. Should you see less than ten alerts fired over the course of an entire week, but yet customers are complaining about poor service, maybe we need to revisit some thresholds or add in more alerts.
Maybe low double-digit alerts firing over a week is a sign that we’re doing a good job, things are reliable and we are not impacting customers with poor service. But if you skip the Morning Rituals you run the risk of having your on-call go from something people can learn from to something people will run from.
Which will make your next incident much harder to resolve.
At Jimdo, we evolve by change. At the same time, we learn about how we manage these changes. Candost, one of our Engineering Managers had a discussion with other managers and collected answers to two questions.