I was thinking for quite some time which topic to start my blog with, but the topic has suddenly come up by itself.
Yesterday we came across a sudden issue in our component library. Due to a three-year-old typo in a low-level piece of code the library appeared to be, so to say, not entirely leap-year-friendly. Once in four years, on 29th of February, the typo came into effect by altering the behaviour of the containing function. The function started producing wrong results on 00:00 February 29, and was doing so until 23:59:59, returning back to normal with the first second of March (all times UTC). The most unpleasant part about it was that the typo propagated up to a higher level piece of the API, blocking up a good share of the product functionality. As a result, our day started with a manifestation of angry (totally understandable) customers at all our support channels, sharing their dissatisfaction and demanding a solution.
To make a long story short, that was followed by a fairly busy day and most of the night. Thanks to the selfless efforts of our team, we’ve managed to employ the emergency procedures and come up with a temporary and then permanent solution for our customers. Now that the customers can relax and sleep well, we can take our breath and make some initial conclusions.
The first conclusion is that however unlikely an issue is, is still can happen. Our yesterday’s issue was caused by a combination of different factors. The typo shouldn’t have been there. Even if it was, it should’ve been caught by the QA routine. Even if it wasn’t caught by QA, there was a fuse that was supposed to prevent the error from affecting any higher level components. The fuse, alas, didn’t work either.
This was topped up by the absence of our primary build person from the company premises due to their day off, and by the fact that the 29th of February had fallen on Monday this year. Should it have fallen on Tuesday or any other weekday, we’d have discovered the problem much earlier, as our US peopleĀ would have still been at work when the problem started exposing itself.
Therefore, be prepared. Prepare an emergency plan and check and update it regularly. Be prepared to the bad. Be prepared to the worst you can imagine – and to even worse than that. Don’t expect bad and good things to trade off at some ‘average failure’ – assume, all the worst things will happen at once.
Second, create backup functions. By concentrating a particular business function in hands of one person or department, you are taking on a huge risk of losing that function in case if that person or department becomes unavailable. There is no need to imagine disastrous pictures of a PM ran over by a bus or a department catching fire – a broken car, a poorly child, an Internet cable accidentally cut off by a gardener, or something as simple as the responsible person’s day off, as it was in our case, will be quite enough to lose vital time. As we encourage the members of our team to share their knowledge and skills with each other (I believe encourage isn’t the right word here – here at EldoS we all are passionate about sharing our knowledge and learning about new things, so basically all we do is not getting in the way), we’ve managed to find a competent replacement for the build person quickly, and launch the build process once the broken functionality was fixed.
If there is no way to backup a particular function, try to create a contingency plan, which would offer a temporary solution until the function is restored.
Aim for a capability of the organisation to perform most of its critical functions even under severe shortfall of available personnel. You never know when a problem happens and which of the functions will be unavailable.
Third, communicate. There is nothing worse than uncertainty for a customer facing a problem with your product. Tell your customers everything you know about the problem, in as much detail as possible. Let them know about any estimated time scales for the fix/solution to be available. Tell them what kind of consequences to expect. Don’t try to hide anything, as it will most likely become evident anyway, and you will lose your customers’ trust.
Create a prioritized backlog of customers affected by the issue, basing on the scale, criticality and urgency levels of the problem for them. Handle those in critical situation individually. Think if you can create a bespoke solution for them quicker. Sometimes, a dumb and cumbersome workaround – like an advise to move their computer clock a day ahead in our case – might show a viable temporary solution for some of your customers until the proper update is prepared and deployed.
Fourth, don’t stop once all your customers are up and running again. Treat every fault as an opportunity for reviewing and improving your processes and procedures. Not only search for similar issues and fix them; ask yourself, are there any flaws in the way you create your product that could have triggered the issue? Are you and your customers totally happy with the response times? Are they happy with the form in which the fix was provided? Is there anything you can do to prevent anything similar from happening in the future, to decrease the scale of the impact, or to speed up the delivery of the fix?
Bad things do happen, and often, despite directing our constant efforts at preventing them, we can’t really do anything about that. However, once a bad thing has happened, the best and most reasonable we can do about it (apart from dealing with the consequences, of course) is to learn from it, and to use our new experience for improving our processes – ending up with much better product and customer experience.