normal accidents – Ken Ivanov @ home

The Dropped Washer Effect

One of these buildings can melt your car down. Can you identify the culprit?

Have you ever come across a situation where something, utterly negligible and minor, had become a cause for a major disruption or even an accident? Such as a small crack in an underground water pipe, dripping inconspicuously for a couple of years, and eventually causing a landslide after accumulating a critical mass of water? Or a seemingly common glass building capable of focusing the sunlight so that it melts the bodywork of cars parked nearby?

If so, chances are high that you observed an example of the Dropped Washer effect. Named after a Boeing 737 accident in Okinawa, Japan, the dropped washer effect describes large-scale adverse events that happened because of the cause of an incomparably lower significance. The unfortunate Boeing ended up burning out completely because of a missing slat mechanism washer, 0.625 inches wide, that the engineering crew forgot to replace after the aircraft’s last service.

One characteristic of the potential dropped-washer features that makes them particularly naughty is their zero perceived value for the business. Offering no added opportunities and presenting no apparent risks for the product, they often do not even exist in the minds of the product stakeholders. This important peculiarity makes it all too easy for them to slip every safety measure employed in modern production flows – from risk assessment to quality control.

Happily, in many cases there are techniques that can help increase our chances of spotting and eliminating the dropped washers from our projects.

Check out my new paper here.

Picture credit: Reuters

Check Your Backups, Now

Last week, a number of services hosted in Google Cloud suffered a dramatic outage. Following a maintenance glitch, services like YouTube, Shopify, Snapchat, and thousands of others became unavailable or very slow to respond. Overall, the services were down for more than four hours, before the availability of the platform was finally restored.

The curious thing about this incident was not the outage itself (sweet happens), but the circumstances behind it that made it last that long. Cloud service providers, as a rule, aim for the highest levels of availability, which are carved in their SLAs. So how could it happen that one of the leading global computing platforms was taken down for more than four hours? Happily, Google is very good in debriefing its failures, so we can have a sneak peek at what have actually happened behind the scenes.

It all started with a few computing nodes which needed to undergo routine maintenance and thus had to be temporarily removed from the cloud – a common day-to-day activity. And then something went wrong. Due to a glitch in the internal task scheduler, many more other, worker nodes had been mistakenly dismissed – drastically reducing the total throughput of the platform, and causing a Chertsey-style gridlock.

Ironically, Google did everything right, exceptionally right. They considered that risk on the design stage. They had a smart recovery mechanism in place that should have kicked in to recover from the glitch and provide the necessary continuity. The problem was that the recovery mechanism itself was supposed to be run by the faulty scheduler. Yet, being a system management task with a lower priority than the affected production services, it was pushed far back in the execution queue. And since the queue was miles long by that time, the recovery service in the choking cloud has never made its way to its time slice.

Any lessons we can learn from this incident? There are myriads; the deeper your knowledge about cloud infrastructures is, the more conclusions you can draw from it. A security architect can draw at least the following two:

1. Backing up systems is a process, not a one-off task. Your backup routine might have worked at the time you set it up, but things break, media dies, and passwords change. Don’t risk, go and test your backups now – emulate a disaster, pull that cord, and see if your arrangements are capable of providing continuity. Don’t be tempted just to check the scripts – try the actual process in the field. Put this check on your schedule and make it a routine.

2. When designing a backup or recovery system, take extra care to minimize its dependencies on the system being recovered. It is worth remembering that modern digital environments are very complex, and you might need to be quite imaginative to recognise all possible interdependencies. The recovery system should live in its own world, with its own operating environment, connectivity, and power supply.

It is very easy to get caught in this trap, as it gives us the imaginary peace of mind we’re craving for. We know that the system is there for us, and we sleep well at night. We know that should a bad thing happen, it will give us its shoulder. We only realise it is not going to when it’s too late to do anything to make it right.

Just as I was writing this, my friend called me with a story. She went on an overseas trip, and, while being there, wanted to Skype home. Skype, however, having realised her IP was unusual, applied extra security and sent her a verification e-mail. It all would have ended there, if only her Skype account wasn’t bound to a very old e-mail account at an ISP that was blocked in the country for political reasons – so she couldn’t get to her inbox to confirm her identity. Luckily it was just Skype and luckily she knew about VPN – but the things might have become way more complex with a different, life-critical service.

So, really, you will never know how a cow catches a hare. There are way too many factors that may kick in unexpectedly, and, worst of all, unknown unknowns are among them. Still, by using the above two approaches wisely and persistently, you may reduce the risks to the negligible level, which is well worth the effort.

Picture credit: danielcheong1974

That is no question

Back in 1854 a renowned mathematician George Boole was the first to describe the concepts of algebra and logic over a binary field, which were eventually named after him and are now regarded as one of the pillars of the information age.

The power and universality of foundations given to IT engineers and scholars by the works of Boole had one adverse effect though. Boolean had landed such a major role in software development tools and in developers’ minds, that the concept started to be abused and misused by being employed in scenarios for which it wasn’t exactly fit.

For as long as software programming was primarily a transcription of logical chains into English words and consisted largely of unequivocal instructions alike ‘is the value stored in CX greater than zero?’ everything worked well.

And then everything went out of sync. Since around 70’s, software programming started making its way up to higher, much higher abstraction layers. C has arrived, followed by OOP and C++, and then Java, Python, and Ruby. Complexity levels of programming tasks skyrocketed. No-one cared about contents of CX anymore. Questions answered by programmers in their code started resembling non-trivial day-to-day questions that we come across in real life. Yet the tools in the box, despite looking smart, shiny, and new, remained largely the same.

Let me ask you a simple question.

Can the outcome of a friend-or-foe identification – e.g. that of an aircraft – be represented with a Boolean type?

What could be easier, at first glance, – the aircraft is either friend or foe, right?

Wrong. There are at least two more possible outcomes: “the aircraft has not been positively identified (can be either friend or foe),” and “no aircraft has ultimately been found.” Those two outcomes are of no less importance than the ‘primary’ ones, and, being ignored, may lead to erroneous or even catastrophic decisions.

If you answered yes, don’t be too hard on yourself. The human brain is a skilful optimizer. Despite being often referred to as ‘intelligent’, when left to its own devices it actually does everything in its power to think less. It operates an impressive arsenal of corner cutting techniques, such as question substitution, simplification, framing, priming, and around a hundred of others to avoid the actual thinking in favour of pattern-based decisions.

And this doesn’t marry well with Boolean type. The problem of Boolean is that it offers an illusion of an obvious answer, suggesting a simple choice between two options where there is no actual choice, or where there might be something besides that choice.

Working hard on optimizing its decision making process, our brain celebrates the chance to substitute the whole set of outcomes with an easier choice between two opposites: yes-or-no, friend-or-foe, right-or-left, good-or-bad, a-boy-or-a-girl. Inspired by simplicity of the answer, the analytic part of our brain gives up and accepts the choice – even if the opposites together only comprise so much of the whole variety of the outcomes.

Development environments kindly assist the irrational part of our brain by providing the tools. I find it amusing that in line with the evolution of programming languages Boolean was given an increasingly significant presence: from none in assembly language, through int-emulated surrogate in C, to a dedicated type in C# and Java. That is, as software developers had to deal with questions more and more vague, the development frameworks kindly offered answers more and more simple.

“Wait,”, a smart programmer would say, “and what about exceptions? What about nullable types? Aren’t those supposed to deal with everything that goes beyond true and false?”

In some scenarios they do – and in the others they don’t. Exceptions may work well where there is clearly a yes-or-no choice that falls in, and a marginal alternative that falls out. The problem is that in many instances there is no yes-or-no choice at all, but our little grey cells would tells us there is. Apart from that, exceptions are an opt-in technique for our brain: something that needs to be considered proactively – and therefore they will be among the first to be ‘optimized’ and neglected. How many programmers do you personally know that do exception handling right? And how many do you know that don’t?

And so it goes. It’s Friday, well after 7pm. It’s only a programmer and a QA guy in the deserted office. Their deadline passed a few days ago. They are rushing to finish the system tonight. The programmer starts typing, ‘if…’ and stops for a moment. He quickly glances at the bottom-right corner of his screen: 7:53pm. He sighs, takes a sip of his cooled down tea, and completes the line:

if (!friend) { missile.launch(); }

His code is complete now. He commits the changes, writes a quick note to the client, and drives home to join his family at a late dinner. The QA chap runs a quick round of positive tests and follows his fellow.

You already know what happened next.

* * *

This story is not about negligent programmers. Rather, it is about the dangerous mix brought in by peculiarities of human mind and perks offered by modern development environments, which together give rise to serious logical errors in programs.

Most real-life questions that arise on the uneven ground under our feet have no black-or-white answers. Yet, for many of them, it is way too easy to get caught in the trap of narrowing the whole set of answers down to two mutually exclusive absolutes. The narrower becomes the gap between the programmer’s way of thinking and the human’s, the clearer this problem exposes itself in the software development profession.

So the next time you are tempted to think of some characteristic as a boolean, do make an effort to ask yourself: does this choice really have only two possible options? Didn’t I neglect any important outcomes? Isn’t my mind trying to cut short and take advantage on me?

Because it most certainly will.

Pic credit: mindfulyourownbusiness.com

7 Security Mistakes Boeing Made

The story of the two recent Boeing 737 MAX crashes is packed with questions we are yet to find answers to, yet it is already clear that the distinctive feature of the double tragedy is overwhelming number of gross blunders – a lot more than you would expect in a field so extremely attentive to security and safety as commercial aviation.

While we don’t know all the details of the crashes so far, what we do know points out a number of grievous security flaws:

security feature as a paid option, not by default: Boeing charged airlines extra for sensor discrepancy detectors; neither LionAir nor Ethiopian aircraft had them installed;
hiding information: Boeing hid from 737 pilots that their new aircraft featured a new MCAS system, which could quietly intervene and override the pilots’ control of the aircraft;
ignoring feedback: MAX pilots complained to FAA about issues with the aircraft’s in-flight performance, but those were largely silenced/ignored;
no safeguards for MCAS failure: this has not been officially confirmed, but it looks like pilots wouldn’t be able to switch off MCAS if they needed to, effectively being unable to fly the aircraft fully manually to recover from MCAS or sensor failure;
creating workarounds rather than fixing bugs: the MCAS system was introduced to balance the MAX’s tendency to raise its nose up due to changes in the aircraft’s aerodynamics as a result of its bigger engines. In other words, the essence of MCAS is effectively adding a ton of BBQ sauce on to your overpeppered steak, rather than cooking a well-peppered steak from the very start.
conflict of interest: it appears that a great deal of safety tests of the new aircraft were performed by its very creators;
trust compromise: this is by far the grossest mistake made by Boeing and FAA; something that might well affect the success of the whole MAX family and of its freshest 777X machine, which was quietly (guess why) introduced two days ago. Whereas the whole world had been grounding their MAX fleets, Boeing chose the tactics of silencing the matter, denying any allegations, and refusing to admit similarities between LionAir and Ethiopian crashes. The only statement that made sense from them was about introducing a vague ‘software update.’ A matter of uttermost importance is that, as per Boeing’s own words, the prospective change was in the works well before the second crash.

I feel incredibly sorry for those who lost their friends and relatives in the crashes, and I feel sorry for the designers of the MAX, which is without doubt a great aircraft. I only hope that the investigation goes smoothly (with Boeing bosses apparently being quite reluctant for it to), and discovers the full truth about the crashes. Being sensible humans, the best we can do for those who gave up their lives to the tragedy, is to learn our lessons and write down all the mistakes we made, and then do everything in our power to prevent anything similar from happening in future.

Picture credit: Boeing