Epic fail 16

Following on from the previous story this is still part of the same job. The building houses a large call centre for some of the more prestigious customers of this company. In all there are about 400 people in the building, all busy working.

The day in question is the day before the start of a few evenings migration activity from the historic LAN to the new LAN, all switches are in place and connected to each other, a stack of testing has been run through to prove resilience and all is good and peaceful.

The building has had a whole new set of cables run through it, upgrading it from CAT 3 to CAT 5e cabling and all patching has been pre-completed into the waiting new switches. The new switches are already interconnected to the old switches in order to ease the migration path, which will happen over the course of a few evenings in a very careful manner.

Why such care? Simply because the routing tables we have inherited are a mess. Simplification of this will be a part of the migration, moving from a large number of discontinuous subnets to a more meaningful summarisation and a handful of other subnets for management purposes.

I am operating remotely, double checking all configuration, ensuring we are ready for migration and that there is nothing untoward waiting for us. What could go wrong?

As it turns out, quite a lot. It’s about 2pm and as I said, I am remotely on the consoles looking around, cross checking. The first indication I see of an issue is my terminal session locks. Odd but this is IT, best thing to do is just re-establish the session. But I can’t. Ok… That’s very odd. Let’s step back a bit and go to the old switch. No joy. Dead. Hmm. OK, back a bit further and to the WAN routers, but again no joy. That’s truly odd, this is a highly resilient delivery, both routers out typically suggests power outage, yet this site I know has resilient power feeds and a generator.

So I call our operations teams and yes, they have seen the site disappear. No alarms prior, just a loss of site. Investigations underway. I take the decision to go to the site, it’s only 20 mins away and they are after all our customer, let’s see if I can help.

On arrival at the site I see people leaving the building in droves, with me walking in feeling like one of those disaster movies where the hero walks against the flow of people. Of course, I am not the hero, I am the guide or at this point in the building manager’s eyes a super villain.

Having convinced the manager that I am there to help but I need to get in to see what is happening, I head to the main comms room. Everything is looking fine. Green lights everywhere, circuits up, switches being normal, no exciting spanning-tree events, routers all ok but not seeing routes coming in from the core service provider network. Hmmm.

There is probably someone out there that has worked this out now, so I’ll just put you in the picture. The service provider’s BGP prefix limit had been hit. Once I had worked this out, the fix was simple enough but clearly we had lost credibility in the eyes of the customer and the migration plan was stopped. The big question was how on earth had that happened? What changed to introduce more networks to the situation?

The answer took some finding but I got there in the end. An end user port in the new network had become live. That ‘livening’ of the port had moved a VLAN state from down to up, causing an additional network to be advertised to the service provider. Unknown to me the current number of advertised networks was 50, this one additional subnet broke that limit and the service provider behaviour is to cut all service if you breach this. Ouch.

Learning point for everyone:

What we never got to the bottom of was who plugged something in. In retrospect it didn’t matter, we should have mitigated the risk of a human doing something unexpected. It’s worth noting the nature of this site was a very static PC setup, no remote/floating workers or guests so someone turning up with their laptop and plugging it in wasn’t really perceived as a risk by me, or indeed others.
These days I always like to question the limits. What are our known limits? Our unknown limits? Can we transition any unknown to known? Document the limits and what happens when they are hit. This will really help operational teams get to root causes quickly.
Naturally in the world of IT we need to plan for failures we don’t expect, yet we often look at planning for the best results. Maybe try an innovation trick if you find yourself in this type of project where you are struggling to identify risk. The inversion process in innovation is used to identify great things by looking at bad things. Ask the team “what does the worst migration/integration/solution for this look like?” or “How would we make this go wrong?” Document the points on a whiteboard and you now have a view of a great deal of the risks you are actually facing. Oh, and the team will find it therapeutic and enjoyable. You’ll have to put up with some war stories, but these are failure learning opportunities! Learn from the team and we’ll help IT be brilliant, one avoided failure at a time.

How to… Take down a call centre

Know your limits...

Business details

Contacts