Initial episode: Stopping services which run too well
I’m an IT consultant. Never did I suggest to any of my clients they should interrupt a well-running application in production. Maybe that was not as professional as it might seem. Let’s see why.
As a consultant, working at many different companies over time also means experiencing that IT is different everywhere. One observation: employees are never fully satisfied with their IT. Often people struggle with getting things done and keeping everything running. Never would they go so far to reduce the uptime of an IT service, right?
In general, the differences are only minor between large banks and medium-sized software providers, and they all eventually pick up the main trends of our industry, like Agile Development, BigData, DevOps, or Cloud. A lot of those evolutions and paradigm shifts we see today are well known. However, the industry leader in software development, Google, still does some things very different from all the others.
Stop services which run too well
Email is one of those well-known services which is basically as old as the Internet. When it first came into use, email delivery was much less reliable than today. Some may remember phone calls like „Did you receive my email yet?“ – „Nope!“ – „But I send it three hours ago!“. The email transport protocol has a built-in retry option to deliver emails as long as up to five days if the receiving server is not available. After that, a message bounces. Delivery issues were anticipated by the designers of the transmission protocol.
Today, basic internet services are much more reliable and we got used to rely on that. So, why should you intentionally pull the cord on a flawlessly running service, isn’t that insane? In fact, Google does exactly that!
Some large companies not only make contracts with their customers about the quality and availability of their provided services, they also do this for internal services. Google is no exception. These contracts are called Service Level Agreements („SLA“). The way Google designs SLAs is documented in Chapter 4 of the SRE book. A set of performance measurable indicators (SLIs) is used to establish judgeable objectives (SLOs) for a service, the user-facing objectives are exposed as SLAs. SLAs are much more fine grained than a simple „is up“ or „is down“. If service consumers get time-outs upon their requests, a service which reports being „up“ in terms of „the service is running“ is rather completely useless. It’s like saying “Please come in!” when the doorhandle is broken. Availibility is not just „black“ or „white“.
No service is really available always. Think about the email example. As long as the sending email server is unable to reach the destination server, it retries the delivery multiple times over a long period. The fact that a program can recover from errors is called „resiliency“ and make two systems independent from one another is clalled „lose coupling“. The sender is not assuming that the destination is available at any time it wants to send. If you miss to implement your software in such a resilient way, you will never run into troubles – as long as the service you consume is available. But it can already be fatal on your own program if the other side only has minor issues. To detect consumers of a service which are unable to cope with an outage of that particular service, Google actually decides to turn off services eventually, but only after they completed their SLA for the given timespan. Just to make sure that service users don’t feel too safe. For example for the central Google coordination component Chubby, the tale of “The Global Planned Chubby Outage” is told in chapter 4 of the SRE book:
SRE makes sure that global Chubby meets, but does not significantly exceed, its service level objective. In any given quarter, if a true failure has not dropped availability below the target, a controlled outage will be synthesized by intentionally taking down the system. In this way, we are able to flush out unreasonable dependencies on Chubby shortly after they are added. Doing so forces service owners to reckon with the reality of distributed systems sooner rather than later.
If you look at this from the other end, this also means that Google achieves very good uptimes of their services.
Now, what about you? What’s the first essential service you will take down in your company? At first of course you would need SLAs, to judge how good it performs. Then you need the buy-in from everyone to jump into the cold water of actually stopping it.