Archive for the 'Problem Solving' Category

Jim’s first Heisenbug

Back to a theme of problem solving. In this case - the interesting part is the problem - not how I found it…

I was in college, working with ‘Real Hessenberg Matrices’ - which are generally sparse matrices. So it was reasonable to read in the values from punched cards. As I recall, some of them were 500×500.

I wrote a bunch of code to calculate eigenvalues (the real task was comparing different algorithms - optimized an not with various compilers, and counting instructions, to evaluate what order things really were, and at what point the size was overwhelmed by the order.) The program would work - and then not work. And it failed in the ugliest way imaginable.

The problem turned out to be in the routine (which I was given by my advisor) to load the sparse matrices.

The values came in four to a card (yes - punched cards). The Read statement that read them was something like:

READ (5)  i1, j1, x(i1,j1), i2, j2, x(i2, j2), i3, j3, x(i3, j3), i4, j4, x(i4, j4)
5 FORMAT (i2, i2, f7.2, i2, i2, f7.2, i2, i2, f7.2, i2, i2, f7.2))

This said (cleverly) read in the coordinates, and then the value, for four elements of the matrix. So a typical card looked like:
01 01 12345.22 01 07 12346.00 99 99 12344.11 50 50 12341.12
This loaded X(1,1), X(1,7), x(99,99), x(50,50)

It drove me crazy for way too long. Turned out that there were not a multiple of four elements in the sparse matrix. The last card only had two sets of values on it:

01 01 12345.22 01 07 12346.0

Cleverly, the READ statement put the value 0.0 at location 0, 0 of the NxN matrix.  That location was somewhere in my code… It would work for awhile - and would work on any matrix with a multiple of four data elements.
The fix was trivial (because my thesis was almost done and I was tired of school). I replaced the last data card (above) with:

01 01 12345.22 01 07 12346.0 01 07 12346.0 01 07 12346.0

Yes - repeating the last value two times…

Next installment - the infinite loop that ate my entire budget of $150 of computer time at the UCSF data center.

No Comments »

Jim on December 12th 2008 in Problem Solving, Technologies

Managing by Data

Last month, Scott Thurm wrote an article in the Wall Street Journal titled Now, It’s Business By Data, but Numbers Still Can’t Tell Future. The article discusses how a number of companies are managing by the numbers. But he points out, “Running a complex enterprise can’t be reduced to a spreadsheet, however.”

In the early 1970s I was working as a contractor for the Oakland Public Schools data processing department. I was writing (and fixing) Fortan and Cobol programs, supporting some old 1401 applications, and serving as a liaison with the statistics department. One September, enrollment took a dive. It was the first year that enrollment had decreased year over year. For a school district (as I recall Oakland was something like the fifth largest in the nation), declining enrollment is devastating to budgets. So they asked me to help them model (predict) what the next years enrollment would be. I spent a bunch of time using various curve fitting techniques to come up with the best estimate I could of the next years enrollment. The statistics department was thrilled - and prepared to take their numbers to the board.

The problem was - none of these techniques applied a year earlier would have predicted the decline that we saw. To me that proved that my methodology could not be trusted. The statisticians were happy to have an ‘answer’ and ignored my concerns.

That is not to say that I don’t like to look at the numbers. I love numbers. But I have learned to take them with a grain of salt.

No Comments »

Jim on August 21st 2007 in Problem Solving, Technologies

Skype outage

Last Thursday, Skype had problems and was down for most of two days. There is an explanation on Skype’s website today.

There are a number of posts (look at the responses to the Skype blog) with various reactions to the blog - but I have to start by taking it at face value. Microsoft released their (monthly) patch release, and the rebooting that ensued - meant that all of the machines had to reconnect to Skype. That mass reconnection caused a failure in one of Skype’s algorithms.

At Sapias, we would see mini-versions of this when our network connectivity failed, or when there was a problem with one of our carriers. When connectivity was restored, there was a bunch of data (the size of the bunch directly proportional to the down time) that devices started to deliver to us - all at once. We would also see it every morning, as vehicles which had gone out of coverage the night before (with some drivers taking their vehicles home - out of cell range), came back in to coverage, and had some quantity of messages to deliver all at once.

Of course these things are better when they happen often (like our every morning load at Sapias). If they happen often, they aren’t unusual, and you know that you can react well to them. They are more troublesome when they are rare. Although some people are skeptical that the monthly Microsoft patch could cause this, I can imagine a case where that patch was unusual - and required some different level of restart. The key is that a company like Skype (and I aspire to having as many users as they have one day), has to figure out how to test for things that are very difficult to test for. Unless you have millions of internal users, or can simulate millions of users - with millions of machines, it is pretty hard to do.

It is another reminder that you can’t test quality in - you have to figure out in the architecture how to be bullet-proof. And regardless of how you test, there are always going to be cases that you didn’t anticipate. In this case, it sounds like Skype had algorithms to try to react to similar scenarios - but they failed. Perhaps another argument for not getting too fancy. Sometimes the fancy scheme is the one that fails.

No Comments »

Jim on August 20th 2007 in Problem Solving, Technologies

System outages followup

Jesse Robbins at O’Reilly Radar has followed up on the outages at 365 Main. (I had written about this a couple of days ago: Hosted Data Centers and Outages). His post is a post mortem on the failure itself, and includes some commentary on the design approaches that 365 Main has taken.

The challenge in design and implementing architectures to deal with failure, is that often the complexity of the solution increases the likelihood of failure. One of the comments in my earlier post from Jeff Dao was that you should do a dry run of your solutions - simulate, or create a failure. But the more complex your solution, the more difficult it may be to simulate a failure - and in particular, the more difficult to simulate every failure scenario.

Having said that, in the case of 365 Main, it seems like they could have tested the three generators that failed to start up.

No Comments »

Jim on July 28th 2007 in Problem Solving, Technologies

Problem solving techniques: Stuck on one path

This about getting stuck down a path. Going down just one path or exploring just one solution, you are not able to see any other possible solutions. This is similar to another one of our techniques: The buzzsaw moment. In that post, I described having a single solution taking all of your attention. And perhaps it isn’t even really a solution at all.

In this case - you get so drawn in to one path - that you don’t even notice (can’t even notice?) the other paths anymore. If you are on a waterslide - there is really no choice until you come out the other end. Hopefully you can find a way to jump paths - or at least notice that there are other paths to consider.

Water Slide

[The cartoons were done for us several years ago by Marc Schmid. His website is www.cartooncity.net ]

No Comments »

Jim on July 13th 2007 in Problem Solving

Problem solving techniques: Insight from goats

We used to raise dairy goats. This insight is from those goats.

If you are flying a plane, and you see a goat eye to eye - there is probably a mountain he is standing on.

Sometimes you see a  glitch once in your system. It only happened once. You can’t figure out how to duplicate it. So you decide to release anyway. Not a good idea. You need to figure out how to look harder - or look at it a different way. You are seeing the goat. Watch out for the mountain!

Mountain Goat

Every day at 5:30 performance goes to a crawl. But you can’t figure it out. Watch out for the mountain! There might be a process that is running way at 5:30…

At Sapias, we sometimes saw trucks that appeared to be in the ocean (I am not mixing metaphors here). The truck shouldn’t be in the ocean. We eventually determined a combination of problems in antennas, and indications from the device about the location accuracy.

We once saw network traffic go out of site. We had a hard time figuring out where the traffic was coming from. But there was a mountain, and we finally tracked it down.

If you see a goat - find the mountain before it finds you.

1 Comment »

Jim on July 2nd 2007 in Problem Solving

Problem solving techniques: Upset the apple cart

The next technique is characterized in Douglas Hofstadter’s book: Fluid Concepts and Creative Analogies. He describes enjoying doing the game Jumble where you take a set of letters and try to make a word out of them. Sometimes the words just appear, sometimes he uses specific techniques (try the ‘u’ after the ‘q’), and other times he writes the letters out in a random order. The goal is to get yourself out of a mental rut.

Apple Cart

When you are debugging code - or dealing with a production down situation, it is certainly useful to consider the obvious things (the last module you touched for instance). But sometimes you have to upset the apple cart - and consider something ‘unexpected’. Or introduce some randomness to try to find a pattern.

1 Comment »

Jim on June 29th 2007 in Problem Solving

Problem solving techniques: Don’t just look where the light is good

We often watch people concentrating their debugging or problem solving in one area. It is the old joke depicted in the image below. You walk up to someone looking for their key. After helping them for awhile, you start asking a few questions - finally asking  how they lost the key, or where they lost the key. The answer is “over there”. You then ask the obvious question - “If you dropped it over there, why are you looking for it here?”. They answer “Because the light is better”.

Street Light

Sometimes you are looking at one piece of your code, or your process, because you know that part best - not because it is the likely source (no pun intended) of the problem. You need to look other places. Maybe look at the place you understand the least. Maybe look at the piece of code that you hate to look in (maybe there is a reason you don’t like to go there). Maybe you should spawn some agents.

[The cartoons were done for us several years ago by Marc Schmid. His website is www.cartooncity.net ]

No Comments »

Jim on June 23rd 2007 in Problem Solving

Problem solving techniques: Spawn some agents

Several of the techniques I am describing involve making sure you don’t get stuck on a single track. This technique provides a good way to make sure you aren’t stuck. You should work down multiple tracks at once. In a ‘production down’ scenario this is particularly important. If you have one approach - that might work - but will take an hour - you really want to work some other path in parallel - since that first path may not work out.

This can be applied by your self as you make sure you are thinking of other solutions - or you can send other people down other tracks. At Sapias the engineering team would set up IM chat sessions when there were production issues - and during deployments. The team had worked together for a long time, so individuals would automatically start down alternate paths - and would IM their colleagues of their direction and progress.
Agents

As a leader or manager, you can explicitly suggest alternate paths. Sometimes it is better to solicit ideas, and then pick some to send people off in. The collective thoughts as well as collective hands to move the thoughts forward will get you a more timely - and better - solution.

[The cartoons were done for us several years ago by Marc Schmid. His website is www.cartooncity.net ]

1 Comment »

Jim on June 19th 2007 in Problem Solving

Problem solving techniques: the buzzsaw moment

Several years ago, Mary and I gave a presentation on problem solving techniques. I would like to highlight a few of these techniques. These techniques apply in general to problem solving, and in particular to debugging of software. The first one is the buzzsaw moment. This technique applies especially to those instances where you have a production system that is down - or that you know if about to go down. It also applies when you have made a mistake, that is going to have bad consequences, and you are trying to figure out how to undo it.

You are sitting on the log - the log is moving toward a saw - and you can see what the problem is. The problem is this darn rope tying you down. So you focus all of your attention on untying that rope. You know deep inside that it isn’t going to work - that you can’t do it fast enough - but there it is - a rope that needs to be untied.

Buzzsaw Cartoon (small)


But - it turns out that there is another option. Close at hand (pun intended). Up to your right is the ‘off’ button. All you need to do is a) see it b) realize that it could result in a good result and c) HIT IT.

The really good news about this technique is that many people can apply it (more on this in a subsequent post).

One of my favorite instances of the buzzsaw moment was when I was looking over the shoulder of a database system administrator. We had some data to clean up, and were going to selectively delete some records. This is one of those moments that require care. Usually another cup of coffee first, and always another set of eyes. I was the second set of eyes. The person at the keyboard entered his DELETE tablename WHERE… command, then selected DELETE tablename (but not the WHERE clause), and hit enter. And at that point the damage was done - all of the rows in the table were being deleted. The really good news was that there were a lot of them. In this case, the off button really was the off button. We were sitting next to the server. The rows were being deleted, and logged. I reached over, and threw the on/off switch, interrupting the delete. Since it hadn’t completed (ok - there was luck involved), all the rows were there (ensuring a consistent state) when we brought the server back up.

[The cartoons were done for us several years ago by Marc Schmid. His website is www.cartooncity.net ]

1 Comment »

Jim on June 13th 2007 in Problem Solving