build a site for free



How to… Gain enemies and reduce the wage bill

"Let's automate this..."

Being on callout has some benefits, those primarily being financial and err. Nope, can’t think of any more.

Being on callout is also plain annoying. That film you were watching, forget it. Pub quiz and a few beers? Nope. Quiet night of uninterrupted sleep? Nope.

This used to grate on me. A lot. Almost every other night there would be a callout, or in those days a paging alert – “Please call service centre urgently, Priority 1 fault on your service”. Woohoo! Another hour or so of joy.

In fact, this annoyed me so much I took it upon myself to fix the problem. Our service was opperated on a resilient pair of Unix servers on which some software was running in a primary/standby style setup, so it shouldn’t fail really. But it did. Frequently.

I couldn’t fix the software itself and moaning at the vendor did no good in any short space of time, so it had to be something else. Now my thought process was simply, “When I get called out what are the steps I take to detect and remedy the fault condition?”. From that I could then script that process to auto run every minute and therefore detect failure quickly and then apply remedies before anyone had a chance to call me out. Brilliant!

Did it work? After a number of trials and failures, of course it did. Pages changed from calling me out to “Failure detected, script stage x invoked”. In fact it all worked rather too well. After a couple of months I get called in to my boss’ office for a discussion.

The basic gist was this – we had a three person on call rota at a very high rate £x,000 per year due to the importance of the software function, the impact when down and the known fragility of the service. My fixes had highlighted the large fee being paid to the team for no callouts at all. The auto-fix script must be removed immediately, or we would be losing our pay in the next month. Arrghh!

So having made myself terribly unpopular by doing the right thing I now had to go back to being annoyed by a software failure that didn’t need to be there and that could auto-fix. That’s nuts.

In reality, I ended up putting in a fast-detect and delayed auto-fix script to deal with the issue and keep the political world happy.

Skip to a few months later and I have moved job, packaged up the callout allowance into standard salary and just because I could, removed the script. Probably shouldn’t have done that bit in retrospect but masking the problem doesn’t get it fixed properly.



Learning point for everyone:

Be careful with auto-fixes, you can make enemies. Deploying these at the outset – i.e. at implementation stage is an absolutely great idea if you mean to make for happier people, lower operational costs and a better end service. Let us never forget in IT that we are a service industry.

In fact, this is all such a good idea that Amazon has it as part of it’s core design/architecture standards. Now it’s in there in part to draw your attention to some of the features unique to the Amazon cloud, thereby creating a bit of lock-in to the ecosystem but I’ll forgive them if it allows some of the auto-scaling and fault recovery we really do need to be architecting with in IT today.

Business details

Registered company no. 11869849 
VAT number 317720513