Friday's Network Outage: What Happened?

08/01/11

Contributed by Michael Taves

We experienced a major network outage from about 8:30 on Friday morning to about 12:30 Saturday afternoon. Most, though not all, of our network-based servers were inaccessible during that time.  The servers themselves were just fine, but there was so much error traffic on the network that it prevented access to them.  We want to tell you about what caused the problem and some steps we are taking to minimize the possibility of this event reoccurring, as well as some things you can do to help us prevent this kind of thing.

 

The culprit turned out to be one small four-foot cable in a vacant office that was tampered with, most likely just to get it out of the way or off the floor, on Friday morning.  Without going into great technical detail, the effect of this was to create a "data storm" of error traffic on the network, consuming the processors of many of our network switches which could not process “legitimate” data packets.

The diagnostic process involves sequentially isolating segments of the network until you locate the segment from which the error traffic is originating.  It's a laborious effort. ITS staff, with support from our network vendor, worked throughout the day and night on Friday and were finally able to narrow it down to a network segment that included a main data center room in Philips Hall and all of Muller Center, and from there were able to subsequently narrow it down to somewhere in Muller.  At that point we physically cut Muller Center off from the network and all services began to come back up as normal.  A subsequent physical inspection of all ports in Muller Hall turned up the culprit of the cable that was causing the problem.

We are conducting a thorough post mortem technical analysis of options we might have for preventing an occurrence like this in future from effecting our entire network, and we are pretty confident that there are some key steps we can take quickly.   What you can do is to help spread the word to call ITS if there is a cable or a computer, or anything that might be network related, that is in your way or needs to be moved for whatever reason. It's kind of like the New York state "Call Before You Dig" campaign.  Call us.  We’ll be more than happy to address your problem in a way that does not disrupt the network.  This incident brought virtually the entire campus to a standstill for a full business day, and produced a sleepless night for a number of ITS staff. There are things we'll be doing to help prevent this, but you can help too!  Just call the ITS Help Desk at 274-1000 if you need to change or move something that you think even might be connected to the network (a device, a cable, anything).

We apologize for the inconvenience of this outage.   No data network can be made 100% "bullet proof". Equipment failure can happen, and human errors will occur.   But with our technical efforts and your cooperation we can minimize the chances that this happens again.  Thank you for listening, and for your patience.

 

2 Comments



https://www.ithaca.edu/intercom/article.php/20110801152950253