Epic fail 21

Every now and again in the world of IT there is a fault that no matter what people try, they just can’t find the answer. Often in my world these faults seem to find their way out of the operational teams and into me.

This one comes in as an intermittent switch reboot. Not just any switch, this one is at the heart of a rather large client’s datacentre and is at the time in their IT history where server connectivity resilience in the datacentre was just being thought about and gradually implemented, so the impact was rather noticeable. In fact it was bordering on catastrophic.

A bit of investigation later and yes, it’s an odd one. One moment everything is brilliant, the next the switch is in a reboot cycle. No reason logged for reboot, just a reboot. It is not regular but intermittent and can be fine for days or weeks, then just go.

Now for the unfamiliar let me just say that these things do not happen on these particular devices. This is a Cisco 6500, supervisor 2 and a selection of linecards. These things run for years without any intervention. I have installed hundreds in my time, I know what I am dealing with.

The call has really come in as there have been repeated failures over the last few days, the customer is really starting to get a little nervous and in my opinion, rightly so.

This particular datacentre is only half hour or so away, so I decide to do the thing that others don’t and get some eyes, and physical presence, on site with an aim to reassure the customer that we do care and will do all we can to help.

I find myself therefore in a typically noisy, cold datacentre, sitting on the floor with cables all over the place looking at a laptop screen which is telling me nothing much useful.

It must have been a good few hours in when a customer representative comes over to me asking for an update. What can I say but the truth? I can see nothing untoward, everything seems fine. It is at this point he takes a deep breath and leans on the neighbouring rack, at which point the switch reboots.

As per previous experience there is no log of anything untoward. Nothing sinister, just nothing. I have no choice but to go for the only thing I know that happened prior to the fault and again ask him to lean on the rack again. Down goes the switch again and I have something, albeit something odd but I definitely have something.

Now it is time for some thinking. Something comes back to me from years ago, aided by a blanking plate halfway down the chassis.

Now we need to go back a number of years and there is me and a colleague doing a fairly simple job, inserting a new linecard into the same chassis to give it a few more ports. This is a zero outage, hot insertion but we cover ourselves with change control regardless. On this occasion it did not go to plan. On inserting the card and seating it into the chassis, the chassis rebooted. Then rebooted and continued to reboot. This is a seriously bad scenario. We power it off and remove the card.

Looking with a torch at the backplane, we can see the entire connector block of pins has been crushed by the card insertion. This should not and does not happen. In a moment of desperation we carefully perform surgery on the chassis and pull each pin straight again with some long nose pliers, take a deep breath and power back up. Amazingly all is fine. The blanking plate goes on and we live to survive another day.

Back to the present day and I am now thinking movement of the rack plus what I now remember about that fault, what is going on? Gradually the pieces come together. The rack itself does not have it’s ‘feet’ on the ground, in fact the more I look at it, the more it all seems a bit out of kilter. A bit of crude tape measuring and yes, the rack guides are out of line too. By quite a bit.

Now I look around further questioning everything I see. Why are there so many floor tiles up? To strengthen the floor. Why? It moves. Why? Ah well, this was never originally designed as a datacentre and actually the whole building moves a bit in the wind, we have been reengineering the building a bit over the last few months.

So yes, reader, I am actually saying I had a £60-100k switch being knocked over by a combination of bad rack installation and the wind.

We swapped out the chassis shortly after and sent it for diagnostics, indeed it was skewed by quite some margin. Then begins the “whose fault is this?” debate and that is not for this story.

Learning point for everyone:

I have discussed previously that fault finding is not always about the obvious, it’s searching for the change, the thing that you don’t know, the external impacting entity.
Fault finding is however also about having that true history of faults against an asset and being able to assess that. On this occasion I didn’t have that until my memories came back to me. Why didn’t the service desk tell me? Because during that original fault I gave them scant information because I was just dealing with it all. So it was my own fault.
When you have faults, are troubleshooting them or fixing them, the verbose notes on actions you are taking are critical to the future operation of that asset. It may seem more crucial to fix the item than document the solution but in truth both are actually critical to ensuring quality in a long term operation. Skimping on this detail will increase risk and operational costs longer term.

How to… Find difficult faults

It was blown over.

Business details

Contacts