That is no question

Back in 1854 a renowned mathematician George Boole was the first to describe the concepts of algebra and logic over a binary field, which were eventually named after him and are now regarded as one of the pillars of the information age.

The power and universality of foundations given to IT engineers and scholars by the works of Boole had one adverse effect though. Boolean had landed such a major role in software development tools and in developers’ minds, that the concept started to be abused and misused by being employed in scenarios for which it wasn’t exactly fit.

For as long as software programming was primarily a transcription of logical chains into English words and consisted largely of unequivocal instructions alike ‘is the value stored in CX greater than zero?’ everything worked well.

And then everything went out of sync. Since around 70’s, software programming started making its way up to higher, much higher abstraction layers. C has arrived, followed by OOP and C++, and then Java, Python, and Ruby. Complexity levels of programming tasks skyrocketed. No-one cared about contents of CX anymore. Questions answered by programmers in their code started resembling non-trivial day-to-day questions that we come across in real life. Yet the tools in the box, despite looking smart, shiny, and new, remained largely the same.

Let me ask you a simple question.

Can the outcome of a friend-or-foe identification – e.g. that of an aircraft – be represented with a Boolean type?

What could be easier, at first glance, – the aircraft is either friend or foe, right?

Wrong. There are at least two more possible outcomes: “the aircraft has not been positively identified (can be either friend or foe),” and “no aircraft has ultimately been found.” Those two outcomes are of no less importance than the ‘primary’ ones, and, being ignored, may lead to erroneous or even catastrophic decisions.

If you answered yes, don’t be too hard on yourself. The human brain is a skilful optimizer. Despite being often referred to as ‘intelligent’, when left to its own devices it actually does everything in its power to think less. It operates an impressive arsenal of corner cutting techniques, such as question substitution, simplification, framing, priming, and around a hundred of others to avoid the actual thinking in favour of pattern-based decisions.

And this doesn’t marry well with Boolean type. The problem of Boolean is that it offers an illusion of an obvious answer, suggesting a simple choice between two options where there is no actual choice, or where there might be something besides that choice.

Working hard on optimizing its decision making process, our brain celebrates the chance to substitute the whole set of outcomes with an easier choice between two opposites: yes-or-no, friend-or-foe, right-or-left, good-or-bad, a-boy-or-a-girl. Inspired by simplicity of the answer, the analytic part of our brain gives up and accepts the choice – even if the opposites together only comprise so much of the whole variety of the outcomes.

Development environments kindly assist the irrational part of our brain by providing the tools. I find it amusing that in line with the evolution of programming languages Boolean was given an increasingly significant presence: from none in assembly language, through int-emulated surrogate in C, to a dedicated type in C# and Java. That is, as software developers had to deal with questions more and more vague, the development frameworks kindly offered answers more and more simple.

“Wait,”, a smart programmer would say, “and what about exceptions? What about nullable types? Aren’t those supposed to deal with everything that goes beyond true and false?”

In some scenarios they do – and in the others they don’t. Exceptions may work well where there is clearly a yes-or-no choice that falls in, and a marginal alternative that falls out. The problem is that in many instances there is no yes-or-no choice at all, but our little grey cells would tells us there is. Apart from that, exceptions are an opt-in technique for our brain: something that needs to be considered proactively – and therefore they will be among the first to be ‘optimized’ and neglected. How many programmers do you personally know that do exception handling right? And how many do you know that don’t?

And so it goes. It’s Friday, well after 7pm. It’s only a programmer and a QA guy in the deserted office. Their deadline passed a few days ago. They are rushing to finish the system tonight. The programmer starts typing, ‘if…’ and stops for a moment. He quickly glances at the bottom-right corner of his screen: 7:53pm. He sighs, takes a sip of his cooled down tea, and completes the line:

if (!friend) { missile.launch(); }

His code is complete now. He commits the changes, writes a quick note to the client, and drives home to join his family at a late dinner. The QA chap runs a quick round of positive tests and follows his fellow.

You already know what happened next.

* * *

This story is not about negligent programmers. Rather, it is about the dangerous mix brought in by peculiarities of human mind and perks offered by modern development environments, which together give rise to serious logical errors in programs.

Most real-life questions that arise on the uneven ground under our feet have no black-or-white answers. Yet, for many of them, it is way too easy to get caught in the trap of narrowing the whole set of answers down to two mutually exclusive absolutes. The narrower becomes the gap between the programmer’s way of thinking and the human’s, the clearer this problem exposes itself in the software development profession.

So the next time you are tempted to think of some characteristic as a boolean, do make an effort to ask yourself: does this choice really have only two possible options? Didn’t I neglect any important outcomes? Isn’t my mind trying to cut short and take advantage on me?

Because it most certainly will.

Pic credit: mindfulyourownbusiness.com

Good News

One of English translations of Victor Hugo’s words On résiste à l’invasion des armées; on ne résiste pas à l’invasion des idées reads as No army can stop an idea whose time has come. In our case, the army is even going to help promote such an idea instead of resisting it.

Atlantic Council is set to host a discussion that was long awaited by me and a solid crowd of experts in information security, business continuity and cyber risk management. The Cyber Risk Wednesday: Software Liability discussion will take place on November the 30th in Washington, DC.

The discussion will be dedicated to a difficult question of increasing liability of software vendors for defects in their products, and the ways of trading it off with economic factors. Taking into account the extent to which software, in a variety of forms, infiltrates into the inmost aspects of our lives (such as a smart house running a hot tub for you), as well as the extent to which we trust software in managing our lives for us (letting it run driverless cars and smart traffic systems), the question of liability is vital – primarily, as a trigger for vendors for employing proper quality assurance and quality control processes. That’s why I wholly welcome the Atlantic Council’s initiative, and truly hope that it will help raise awareness of the problem and give a push to wide public discussion of the same.

Once upon a time on the twenty ninth

I was thinking for quite some time which topic to start my blog with, but the topic has suddenly come up by itself.

Yesterday we came across a sudden issue in our component library. Due to a three-year-old typo in a low-level piece of code the library appeared to be, so to say, not entirely leap-year-friendly. Once in four years, on 29th of February, the typo came into effect by altering the behaviour of the containing function. The function started producing wrong results on 00:00 February 29, and was doing so until 23:59:59, returning back to normal with the first second of March (all times UTC). The most unpleasant part about it was that the typo propagated up to a higher level piece of the API, blocking up a good share of the product functionality. As a result, our day started with a manifestation of angry (totally understandable) customers at all our support channels, sharing their dissatisfaction and demanding a solution.

To make a long story short, that was followed by a fairly busy day and most of the night. Thanks to the selfless efforts of our team, we’ve managed to employ the emergency procedures and come up with a temporary and then permanent solution for our customers. Now that the customers can relax and sleep well, we can take our breath and make some initial conclusions.

The first conclusion is that however unlikely an issue is, is still can happen. Our yesterday’s issue was caused by a combination of different factors. The typo shouldn’t have been there. Even if it was, it should’ve been caught by the QA routine. Even if it wasn’t caught by QA, there was a fuse that was supposed to prevent the error from affecting any higher level components. The fuse, alas, didn’t work either.

This was topped up by the absence of our primary build person from the company premises due to their day off, and by the fact that the 29th of February had fallen on Monday this year. Should it have fallen on Tuesday or any other weekday, we’d have discovered the problem much earlier, as our US people  would have still been at work when the problem started exposing itself.

Therefore, be prepared. Prepare an emergency plan and check and update it regularly. Be prepared to the bad. Be prepared to the worst you can imagine – and to even worse than that. Don’t expect bad and good things to trade off at some ‘average failure’ – assume, all the worst things will happen at once.

Second, create backup functions. By concentrating a particular business function in hands of one person or department, you are taking on a huge risk of losing that function in case if that person or department becomes unavailable. There is no need to imagine disastrous pictures of a PM ran over by a bus or a department catching fire – a broken car, a poorly child, an Internet cable accidentally cut off by a gardener, or something as simple as the responsible person’s day off, as it was in our case, will be quite enough to lose vital time. As we encourage the members of our team to share their knowledge and skills with each other (I believe encourage isn’t the right word here – here at EldoS we all are passionate about sharing our knowledge and learning about new things, so basically all we do is not getting in the way), we’ve managed to find a competent replacement for the build person quickly, and launch the build process once the broken functionality was fixed.

If there is no way to backup a particular function, try to create a contingency plan, which would offer a temporary solution until the function is restored.

Aim for a capability of the organisation to perform most of its critical functions even under severe shortfall of available personnel. You never know when a problem happens and which of the functions will be unavailable.

Third, communicate. There is nothing worse than uncertainty for a customer facing a problem with your product. Tell your customers everything you know about the problem, in as much detail as possible. Let them know about any estimated time scales for the fix/solution to be available. Tell them what kind of consequences to expect. Don’t try to hide anything, as it will most likely become evident anyway, and you will lose your customers’ trust.

Create a prioritized backlog of customers affected by the issue, basing on the scale, criticality and urgency levels of the problem for them. Handle those in critical situation individually. Think if you can create a bespoke solution for them quicker. Sometimes, a dumb and cumbersome workaround – like an advise to move their computer clock a day ahead in our case – might show a viable temporary solution for some of your customers until the proper update is prepared and deployed.

Fourth, don’t stop once all your customers are up and running again. Treat every fault as an opportunity for reviewing and improving your processes and procedures. Not only search for similar issues and fix them; ask yourself, are there any flaws in the way you create your product that could have triggered the issue? Are you and your customers totally happy with the response times? Are they happy with the form in which the fix was provided? Is there anything you can do to prevent anything similar from happening in the future, to decrease the scale of the impact, or to speed up the delivery of the fix?

Bad things do happen, and often, despite directing our constant efforts at preventing them, we can’t really do anything about that. However, once a bad thing has happened, the best and most reasonable we can do about it (apart from dealing with the consequences, of course) is to learn from it, and to use our new experience for improving our processes – ending up with much better product and customer experience.