System outages followup
Jesse Robbins at O’Reilly Radar has followed up on the outages at 365 Main. (I had written about this a couple of days ago: Hosted Data Centers and Outages). His post is a post mortem on the failure itself, and includes some commentary on the design approaches that 365 Main has taken.
The challenge in design and implementing architectures to deal with failure, is that often the complexity of the solution increases the likelihood of failure. One of the comments in my earlier post from Jeff Dao was that you should do a dry run of your solutions - simulate, or create a failure. But the more complex your solution, the more difficult it may be to simulate a failure - and in particular, the more difficult to simulate every failure scenario.
Having said that, in the case of 365 Main, it seems like they could have tested the three generators that failed to start up.
Jim on July 28th 2007 in Problem Solving, Technologies