Last week, a number of services hosted in Google Cloud suffered a dramatic outage. Following a maintenance glitch, services like YouTube, Shopify, Snapchat, and thousands of others became unavailable or very slow to respond. Overall, the services were down for more than four hours, before the availability of the platform was finally restored.
The curious thing about this incident was not the outage itself (sweet happens), but the circumstances behind it that made it last that long. Cloud service providers, as a rule, aim for the highest levels of availability, which are carved in their SLAs. So how could it happen that one of the leading global computing platforms was taken down for more than four hours? Happily, Google is very good in debriefing its failures, so we can have a sneak peek at what have actually happened behind the scenes.
It all started with a few computing nodes which needed to undergo routine maintenance and thus had to be temporarily removed from the cloud – a common day-to-day activity. And then something went wrong. Due to a glitch in the internal task scheduler, many more other, worker nodes had been mistakenly dismissed – drastically reducing the total throughput of the platform, and causing a Chertsey-style gridlock.
Ironically, Google did everything right, exceptionally right. They considered that risk on the design stage. They had a smart recovery mechanism in place that should have kicked in to recover from the glitch and provide the necessary continuity. The problem was that the recovery mechanism itself was supposed to be run by the faulty scheduler. Yet, being a system management task with a lower priority than the affected production services, it was pushed far back in the execution queue. And since the queue was miles long by that time, the recovery service in the choking cloud has never made its way to its time slice.
Any lessons we can learn from this incident? There are myriads; the deeper your knowledge about cloud infrastructures is, the more conclusions you can draw from it. A security architect can draw at least the following two:
1. Backing up systems is a process, not a one-off task. Your backup routine might have worked at the time you set it up, but things break, media dies, and passwords change. Don’t risk, go and test your backups now – emulate a disaster, pull that cord, and see if your arrangements are capable of providing continuity. Don’t be tempted just to check the scripts – try the actual process in the field. Put this check on your schedule and make it a routine.
2. When designing a backup or recovery system, take extra care to minimize its dependencies on the system being recovered. It is worth remembering that modern digital environments are very complex, and you might need to be quite imaginative to recognise all possible interdependencies. The recovery system should live in its own world, with its own operating environment, connectivity, and power supply.
It is very easy to get caught in this trap, as it gives us the imaginary peace of mind we’re craving for. We know that the system is there for us, and we sleep well at night. We know that should a bad thing happen, it will give us its shoulder. We only realise it is not going to when it’s too late to do anything to make it right.
Just as I was writing this, my friend called me with a story. She went on an overseas trip, and, while being there, wanted to Skype home. Skype, however, having realised her IP was unusual, applied extra security and sent her a verification e-mail. It all would have ended there, if only her Skype account wasn’t bound to a very old e-mail account at an ISP that was blocked in the country for political reasons – so she couldn’t get to her inbox to confirm her identity. Luckily it was just Skype and luckily she knew about VPN – but the things might have become way more complex with a different, life-critical service.
So, really, you will never know how a cow catches a hare. There are way too many factors that may kick in unexpectedly, and, worst of all, unknown unknowns are among them. Still, by using the above two approaches wisely and persistently, you may reduce the risks to the negligible level, which is well worth the effort.
Picture credit: danielcheong1974