Wednesday, February 16, 2011

Plan to Fail

During a particularly intense design session the other day, one of my more brilliant engineers and co-architect of CAStor asked what he thought to be a rhetorical question. If we can provide a deterministic algorithm that completely solves a problem except for failures, shouldn't we do that?

At the time, I was dumbfounded. The answer seems so obvious I had no reply. Of course we should do our best to push the failure cases into a corner, solve them separately, and keep the blue sky paths as true blue and predictable as possible. Now, after some careful and mostly sober consideration, I can revisit that moment and provide the answer I wish I had thought of at the time. That's because, I have a blog. Not just anybody can get one of these, you know.

The answer I wish I had given is, not always.

We software engineers tend to think of failures as exceptions, bad things that don't happen very often. When it does happen, a failure injects asynchrony and non-determinacy into the problem whether we like it or not. One can never be sure of when it will happen or exactly how the failure will exhibit itself. We can easily allow ourselves to believe that failures are tricks the universe plays on us. But really, a failure is just a mismatch between our imagination and reality. If something we expect to happen doesn't happen, or something we expect to happen quickly takes too long, or something we expect to behave in a certain way starts acting up, we call that a failure. But that doesn't mean reality has failed us, it just means our imaginations are limited.

Whoa! Too philosophical. Let's return to the cocktail party.

In a previous post, I described two different algorithms for answering questions like the one posed in the original scenario. We observed that the probabilistic one was better than the deterministic one in several ways, but it had a very serious drawback; it doesn't always give the correct answer! Now let's see what happens when we inject something that might be called a failure into the story. Every party I've ever attended had a bathroom for the guests' convenience. Regularly, but at unpredictable times, a guest will simply disappear for a few moments and be unavailable to send or receive messages with the other guests. This models a failure because a given guest may, at any moment, go to (sheer force of will allows me to refrain from potty humor here) the loo. We can now examine how failures impact the two algorithms previously described.

First the deterministic algorithm. If the host makes a request to look up a number while one of the guests happens to be indisposed, and if we're unlucky enough so that guest happens to hold the card for which we're searching, then the algorithm cannot possibly return the correct answer. Worse than that, since the ring has a temporary hole in it, the algorithm cannot even complete and it will hang. If there is only one water closet, we can overcome this problem by having each guest remember their nearest two neighbors on each side, allowing the algorithm to "hop over" any failure. If there are two bathrooms, raising the possibility of two simultaneous failures, there will need to be some sort of ring repair sub-algorithm that gets invoked whenever a failure is recognized. I'll leave the design of such a repair strategy to the reader, but suffice to say that the resulting overall algorithm will be even more complex in order to handle the exceptional failure cases. Even if there is only one copy of each card at the party (we'll talk about redundancy in a minute), this new algorithm will still be deterministic except for failures. Looking at things more holistically, however, it is clear that an inopportune call of nature can cause the "deterministic" algorithm to give the wrong answer, just like the probabilistic one.

Speaking of the probabilistic algorithm, what do we need to modify to accommodate the possibility of unpredictable bathroom breaks? The answer is, absolutely nothing. This algorithm is inherently potty tolerant. Assuming one copy of each card, it will still report a false negative result if we are unlucky enough that the person holding the exact card for which we are currently searching happens to be taking a break, just like the "deterministic" algorithm. But there is no disconnect between our expectations of the algorithm and its actual behavior either with or without failures. In all cases, there is an unlikely possibility of a false negative result.

There is an easy way to rescue the determinacy of the "deterministic" algorithm by making multiple copies of the cards. In general, b+1 copies are required to ensure no false negatives if there are b baƱos at the party. This trick works for both algorithms in fact, and for the probabilistic one, it will also reduce the probability of a missed shout, because any of the b+1 guests holding a copy of the card being searched can respond with the correct answer. In reality, it is not always easy to bound the number of simultaneous failures, so there remains, always, some possibility of a false negative being returned by either algorithm.

Thinking of failures as being exceptional, rare occurrences often leads to inefficient, and sometimes incorrect, algorithms. For those of us designing systems that are intended to scale without bounds, it is a mental convenience we cannot afford. As the number of guests at the party continues to increase, the probability of an inconvenient leak becomes completely unexceptional. Paradoxically, many of the algorithms we end up using for these sorts of problems, like the simple shout and listen algorithm at the party, feel almost too primitive to be useful, and yet, they often are the most robust. To be successful in this arena, we must view failures as a normal, expected part of the problem definition.

We must plan to fail.

No comments:

Post a Comment