Archive for August 20th, 2007

Skype outage

Last Thursday, Skype had problems and was down for most of two days. There is an explanation on Skype’s website today.

There are a number of posts (look at the responses to the Skype blog) with various reactions to the blog - but I have to start by taking it at face value. Microsoft released their (monthly) patch release, and the rebooting that ensued - meant that all of the machines had to reconnect to Skype. That mass reconnection caused a failure in one of Skype’s algorithms.

At Sapias, we would see mini-versions of this when our network connectivity failed, or when there was a problem with one of our carriers. When connectivity was restored, there was a bunch of data (the size of the bunch directly proportional to the down time) that devices started to deliver to us - all at once. We would also see it every morning, as vehicles which had gone out of coverage the night before (with some drivers taking their vehicles home - out of cell range), came back in to coverage, and had some quantity of messages to deliver all at once.

Of course these things are better when they happen often (like our every morning load at Sapias). If they happen often, they aren’t unusual, and you know that you can react well to them. They are more troublesome when they are rare. Although some people are skeptical that the monthly Microsoft patch could cause this, I can imagine a case where that patch was unusual - and required some different level of restart. The key is that a company like Skype (and I aspire to having as many users as they have one day), has to figure out how to test for things that are very difficult to test for. Unless you have millions of internal users, or can simulate millions of users - with millions of machines, it is pretty hard to do.

It is another reminder that you can’t test quality in - you have to figure out in the architecture how to be bullet-proof. And regardless of how you test, there are always going to be cases that you didn’t anticipate. In this case, it sounds like Skype had algorithms to try to react to similar scenarios - but they failed. Perhaps another argument for not getting too fancy. Sometimes the fancy scheme is the one that fails.

No Comments »

Jim on August 20th 2007 in Problem Solving, Technologies