Thursday, March 3, 2011


A couple months ago I wrote a useful tool to help us visualize the operation of a CAStor cluster in situ. It provides a bird's eye view of the nodes and communication among them by replaying a syslog file captured during an actual execution. We plan to use this tool for a number of different things, from marketing to tech support, even design validation. As a first application, I have begun using it as the basis for a series of tutorial videos explaining the inner workings of CAStor. Here's the first installment.

Check it out!

Chocolate Empathy

One reason it's so hard to design good distributed software is that programmers must, for lack of a less anthropomorphic term, empathize with multiple processes at once. A CAStor storage node, for example, is not omniscient. It knows only about its own state and enough about the global cluster to operate. Answering even very simple questions like, "How much available storage is there in the cluster?" involves communicating and agreeing with other nodes. From the programmers' perspective, we must imagine ourselves in the place of each of the individual nodes and model, in our heads, what is known and, more importantly sometimes, what cannot possibly be known by each node at each point in the algorithm.

This capability to maintain many different models of what others know, in addition to our own knowledge store, is an amazing feat of the human mind. And we do it all the time. Every conversation you have with another person relies on a gigantic store of knowledge about what she knows, whether she knows that I know she knows it, and so on. But this ability is definitely a higher function of the brain and does not come without some mental effort.

A famous psychological experiment performed by Wimmer and Perner in the '80s shows this mental ability does not mature until fairly late, four years old or so, in a child's life. The experiment that demonstrates this is elegant and simple. Children are told the following story. A little boy named Maxi and his mother return from a shopping trip at which they purchased a bar of chocolate. The mother places the chocolate in a blue cabinet and closes it. When Maxi goes outside to play, the mother moves the chocolate bar to another, green cabinet. When Maxi returns to the kitchen, he decides to have a bite of chocolate. Which cabinet, the children are asked, will Maxi look in to find the bar?

Before the age of three or four years, almost all children will say Maxi will look in the green cabinet, because that's where the mother put it! Proving they have not yet fully developed the ability to model someone else's mind or empathize with another, albeit fictional, person's knowledge base. It may also be the case that this mental ability is lost or stunted as we become old (speaking strictly from personal experience).

So this stuff is hard. It's the difference between teaching someone to dance and choreographing a ballet. CS curricula, by and large, train us to be omniscient, sequential programmers. Perhaps that's why so many otherwise talented developers mistakenly believe every storage node should just know to look in the green cabinet.

Wednesday, February 16, 2011

Plan to Fail

During a particularly intense design session the other day, one of my more brilliant engineers and co-architect of CAStor asked what he thought to be a rhetorical question. If we can provide a deterministic algorithm that completely solves a problem except for failures, shouldn't we do that?

At the time, I was dumbfounded. The answer seems so obvious I had no reply. Of course we should do our best to push the failure cases into a corner, solve them separately, and keep the blue sky paths as true blue and predictable as possible. Now, after some careful and mostly sober consideration, I can revisit that moment and provide the answer I wish I had thought of at the time. That's because, I have a blog. Not just anybody can get one of these, you know.

The answer I wish I had given is, not always.

We software engineers tend to think of failures as exceptions, bad things that don't happen very often. When it does happen, a failure injects asynchrony and non-determinacy into the problem whether we like it or not. One can never be sure of when it will happen or exactly how the failure will exhibit itself. We can easily allow ourselves to believe that failures are tricks the universe plays on us. But really, a failure is just a mismatch between our imagination and reality. If something we expect to happen doesn't happen, or something we expect to happen quickly takes too long, or something we expect to behave in a certain way starts acting up, we call that a failure. But that doesn't mean reality has failed us, it just means our imaginations are limited.

Whoa! Too philosophical. Let's return to the cocktail party.

In a previous post, I described two different algorithms for answering questions like the one posed in the original scenario. We observed that the probabilistic one was better than the deterministic one in several ways, but it had a very serious drawback; it doesn't always give the correct answer! Now let's see what happens when we inject something that might be called a failure into the story. Every party I've ever attended had a bathroom for the guests' convenience. Regularly, but at unpredictable times, a guest will simply disappear for a few moments and be unavailable to send or receive messages with the other guests. This models a failure because a given guest may, at any moment, go to (sheer force of will allows me to refrain from potty humor here) the loo. We can now examine how failures impact the two algorithms previously described.

First the deterministic algorithm. If the host makes a request to look up a number while one of the guests happens to be indisposed, and if we're unlucky enough so that guest happens to hold the card for which we're searching, then the algorithm cannot possibly return the correct answer. Worse than that, since the ring has a temporary hole in it, the algorithm cannot even complete and it will hang. If there is only one water closet, we can overcome this problem by having each guest remember their nearest two neighbors on each side, allowing the algorithm to "hop over" any failure. If there are two bathrooms, raising the possibility of two simultaneous failures, there will need to be some sort of ring repair sub-algorithm that gets invoked whenever a failure is recognized. I'll leave the design of such a repair strategy to the reader, but suffice to say that the resulting overall algorithm will be even more complex in order to handle the exceptional failure cases. Even if there is only one copy of each card at the party (we'll talk about redundancy in a minute), this new algorithm will still be deterministic except for failures. Looking at things more holistically, however, it is clear that an inopportune call of nature can cause the "deterministic" algorithm to give the wrong answer, just like the probabilistic one.

Speaking of the probabilistic algorithm, what do we need to modify to accommodate the possibility of unpredictable bathroom breaks? The answer is, absolutely nothing. This algorithm is inherently potty tolerant. Assuming one copy of each card, it will still report a false negative result if we are unlucky enough that the person holding the exact card for which we are currently searching happens to be taking a break, just like the "deterministic" algorithm. But there is no disconnect between our expectations of the algorithm and its actual behavior either with or without failures. In all cases, there is an unlikely possibility of a false negative result.

There is an easy way to rescue the determinacy of the "deterministic" algorithm by making multiple copies of the cards. In general, b+1 copies are required to ensure no false negatives if there are b baƱos at the party. This trick works for both algorithms in fact, and for the probabilistic one, it will also reduce the probability of a missed shout, because any of the b+1 guests holding a copy of the card being searched can respond with the correct answer. In reality, it is not always easy to bound the number of simultaneous failures, so there remains, always, some possibility of a false negative being returned by either algorithm.

Thinking of failures as being exceptional, rare occurrences often leads to inefficient, and sometimes incorrect, algorithms. For those of us designing systems that are intended to scale without bounds, it is a mental convenience we cannot afford. As the number of guests at the party continues to increase, the probability of an inconvenient leak becomes completely unexceptional. Paradoxically, many of the algorithms we end up using for these sorts of problems, like the simple shout and listen algorithm at the party, feel almost too primitive to be useful, and yet, they often are the most robust. To be successful in this arena, we must view failures as a normal, expected part of the problem definition.

We must plan to fail.

Friday, February 11, 2011

Back to the Party

I recently posed the party problem while interviewing a candidate. There are many good solutions to the problem, and his approach was on target, if conventional. Typically, he pointed out, these problems are solved by forming the distributed processes, ahem, party guests, into a simple topology, like a ring. Most textbooks on distributed algorithms treat this case at some length because it lends itself nicely to inductive proofs. Of course, a fixed topology wasn't part of the original problem statement, so any algorithm that answers the original question, What word is on the card whose number is 50342?, would need to include a sub-algorithm to first organize the guests into a ring where each party-goer has exactly two neighbors.

Let's describe one such algorithm in more detail. I'd like to defer, until a later post, the part of the problem that allows guests to swap cards in any way they choose. We'll assume for now that each guest continues to hold the same deck of cards they are given when they enter the room. Our initial goal is to introduce each new guest to two others, whom we'll call the R-neighbor and the L-neighbor. A guest's R-neighbor is the one person in the room who has the smallest ID that is larger than his own. Similarly, the L-neighbor is the one who has the largest ID that is smaller than his. The only exception is where the tail of the ring joins the head, at which point the guest with the largest ID in the entire room has as his R-neighbor the guest with the smallest ID, and vice versa. Note, this is a logical ring, not a physical one. If we don't allow the guests to eat and drink, the algorithm will surely fail.

Organizing the guests into such a logical ring isn't too difficult. We could say each person, when she enters the room, chooses a four-digit ID number that is unlikely to be chosen by anyone else and writes it down. (Notice how randomness enters the story, literally, at the front door.) If the guest is the first to arrive, there's nothing more to do; she comprises a degenerate ring of one and is her own R-neighbor and L-neighbor. Subsequent guests all choose their own, probably unique, ID. Each one then chooses some other person in the room, it doesn't matter who, and hands them a message that says, My ID is XXXX. Please find my place in the ring. Upon receiving such a message, written on a slip of paper, each guest checks to see if the new guest's ID falls in between its own ID and the ID of its R-neighbor. If not, he passes the message on to his R-neighbor, who does exactly the same thing. If the new ID does fall between my own ID and that of my R-neighbor's, the new guest is inserted into the ring, which is done by sending messages to the new guest and to his original R-neighbor, introducing them to each other and informing them of the new neighborhood arrangement.

Once the logical ring is in place, the original question can be answered with each guest passing slips of paper to one of only two other guests, its neighbors in the ring. Here's one such algorithm. Upon receiving a message, each guest performs the following sequence of actions (remember, we're looking for a symmetric solution, so every guest must behave the same way) starting with whomever receives the original note from our host.

  1. If there isn't already an ID at the top of the message, write my own ID there.
  2. If there is already an ID, and it's mine, we're finished. Hand the answer back to the host and skip the remaining steps.
  3. Search through my deck of cards looking for the number 50342.
  4. If I find a card with that number on it, write the associated word on the slip of paper; this is the answer.
  5. Pass the message on to my R-neighbor.

If we discount the possibility that two guests might choose the same ID at the front door, this algorithm is deterministic, both for the positive and negative cases. That is, if the card in question exists in any guest's deck, the algorithm will discover it and answer the question, and if there is no card, the algorithm will report that it doesn't exist. This particular version of the algorithm will provide both answers after sending exactly n+2 messages (counting the two in which the host is involved), where n is the total number of guests at the party. There are various ways to reduce the total number of messages required, but in all cases the number of messages needed will grow as the number of guests increases. Still, this is a pretty darn good algorithm that behaves just the way we like our computer software to behave — deterministically.

Just for grins, lets also describe a probabilistic algorithm that solves the same problem given the same constraints. This one doesn't presuppose any particular topology, so we can skip the organizational first phase. Instead, we proceed as follows. When the host hands the slip of paper with the question on it to a random guest, he shouts to everybody in the room,

"Attention! What word is on the card whose number is 50342?"

Upon receiving this message, each guest searches through his deck of cards and, if he finds the number 50342, writes the associated word on a slip of paper and hands it to the guest who shouted the question, who then passes the answer along to the host. If no one answers within some reasonable amount of time, we assume the card does not exist and send a message back to the host to that effect.

Which approach is "better?" This second algorithm is better in a number of regards. It's certainly a lot simpler. It's also more efficient, requiring only four messages in the positive case — the original question from the host, the shout, the reply from the question, and the reply to the host — and only three messages in the negative case. It also scales better, since neither of these performance metrics grows as the number of guests increases.

But the probabilistic algorithm has a major drawback that would prevent its use by most any red-blooded programmer: it is, well, probabilistic. There is some chance, however small, that the guest holding the requested card does not hear the shout and therefore fails to answer. The shouting guest, in that case, will wait a respectable period of time and then declare, the card doesn't exist. To be precise, he can only truthfully claim that he doesn't know whether it exists or not, since the negative case and the failure case are indistinguishable.

Another drawback of the latter algorithm, also related to its probabilistic nature, is in choosing the right "reasonable amount of time" to wait before declaring the card can't be found. If the chosen timeout is too short, some of the guests may not have enough time to search through their card decks to find the card. If it's too long, the nominal performance of the negative case suffers. Yes, we could have the guest repeat the shout after a short timeout in order to increase the chance that everybody hears it and to give everyone a longer period in which to search their individual card decks. However, there will always be some remaining possibility of a false negative.

Does that mean the probabilistic algorithm is fatally flawed and should be avoided in all cases? More on that later...