Friday, April 15, 2011
Thursday, March 17, 2011
Wednesday, March 16, 2011
Religion has long been obsessed with the true name of God. The ancient Jews revered it and believed it had such power they forbade its utterance by ordinary people and spoke it officially once every seven years. Sikhs have a word for the concept, satnam. The true name of true names I guess.
The concept of true name is ancient and pervasive in the secular world too. Rumplestiltskin cherished his so much he sang songs about it, then tore himself in half when it was discovered. One of my favorite science fiction stories is actually titled True Names. Written by the mathematician Vernor Vinge in 1981, the novella explored the promise and threat of one's true name in cyberspace, and incidentally invented the cyberpunk genre which was later fleshed out by William Gibson, Neal Stephenson, Bruce Sterling and others. The title of this post is borrowed from Vinge's first anthology.
Back here in the cyberpresent, names for persistent objects in computer technology have an unfortunately muddled story. Traditional file systems conflate the notions of name and containment for files. We say that a file is "in a directory" or "contained in a folder" when really we're just speaking about a prefix of its name. This leads to some weird, but by now familiar, holes in the metaphor. When you "move" a file from one folder to another, usually the file's contents aren't moved at all; the file is simply renamed. In Unix-based file systems, the "hard link" mechanism allows a single file to have more than one name. This metaphorical blemish of having to explain that an object can be in two places at once is often gotten around by pretending the hard links are distinctly different kinds of names from the originals, so a file has one definite location and perhaps some other, lesser references to it from other locations. This notion is patently false; all names are created equal in this case and if you delete the "original" hard link the others will continue to work as usual. The fact is, files can have different names in different contexts. But do they have a satnam that is the same in every context?
Sort of. In traditional Unix/Linux file systems, a file's inode number is its true name. All other names are mapped to the inode by the file system's directory mechanism. The inode data structure itself contains information about where to find the various parts and pieces (blocks) that make up the file's data content. An inode number is a 32-bit or, in more recent releases, 64-bit binary persistent name for a file. Ideally, a satnam does not depend on context; it refers to the same thing everywhere and everywhen. But a 32-bit inode is not long enough to allow that since it accommodates only about 4 billion files. This may seem like a big number, but it isn't even close to being big enough to uniquely identify every computer file in existence. So the Unixen came up with the concept of a file system, which effectively scopes inodes so they make sense only within the limited context of a single file system, usually hosted on one server. In other words, inodes do not have global scope. Nor is 4 billion inodes enough even to uniquely identify all the files in a single file system, if we remember that files may be created and then deleted. Again, ideally each instance of a file would have a satnam that's different from every other instance that ever existed. But inodes do not have such universal scope either, since they must be reused. The newer 64-bit inodes do allow up to about 18 X 1018 distinct inodes, which helps us build larger file systems that span several servers, but the namespace is still not large enough to support universally scoped true names.
Distributed storage systems that aspire to scale without limit, must support an essentially infinite (that is, a very, very large finite) number of distinct objects, over the entire lifetime of, well, everything. Further, the degree of cooperation and agreement required to implement a coordinated distributed namespace is, to say the least, exorbitant, even impossible in most cases. Conventional, hierarchical naming schemes always require a centralized naming authority somewhere along the way (that's why the IANA exists). These types of distributed storage systems really do need a satnam, a universally-scoped name for each stored object.
My colleagues here at Caringo, Paul Carpentier, Jan Van Riel, and Tom Teugels, invented a process (US Patent 7,793,112) by which an intrinsic satnam can be derived from any binary object. This technique, the basis for Content Addressable Storage, or CAS, uses a cryptographic hash to digest the bits and spit out a 128-bit name. Depending on the strength of the hash algorithm, this can produce a universally unique identifier, a true name. The drawback is, smart hackers keep finding ways to defeat the crypto algorithms to produce completely different objects that have the same names.
At Caringo, we built a clustered, scalable storage system called Swarm using a different technique that doesn't suffer this drawback. We assign a satnam to an object when it is initially created, a 128-bit name that, with very high probability is different from any other satnam being used anywhere in the universe at any time, past, present, or future. This universally unique identifier is stored with the object's data bits and becomes part of the object as it moves around the world, gets replicated, and so on.
The differences between these two approaches suggest a rather deep philosophical question. Is it the case that one's fundamental identity is embodied in the sum of all his traits (the data bits) or is it the fact of his creation that identifies him? If I create a data object today here in Austin, and my great-great-great-granddaughter also creates a data object on Mars using precisely the same bits in the same order, are they the same object? Should they have the same satnam? Philosophically, I have no idea, but Caringo Swarm says, no. Objects created at different times or places are different objects, even if they appear otherwise to be the same. An object's satnam is universally scoped and intractable to guess, unless you have the object itself.
Which brings us back to Rumplestiltskin. The only way to access an object stored in CAStor is to produce its satnam. If you wish to keep it confidential, then you simply do not ever divulge its true name to anyone you don't trust. Don't go singing it around the campfire.
Thursday, March 3, 2011
A couple months ago I wrote a useful tool to help us visualize the operation of a CAStor cluster in situ. It provides a bird's eye view of the nodes and communication among them by replaying a syslog file captured during an actual execution. We plan to use this tool for a number of different things, from marketing to tech support, even design validation. As a first application, I have begun using it as the basis for a series of tutorial videos explaining the inner workings of CAStor. Here's the first installment.
Check it out!
This capability to maintain many different models of what others know, in addition to our own knowledge store, is an amazing feat of the human mind. And we do it all the time. Every conversation you have with another person relies on a gigantic store of knowledge about what she knows, whether she knows that I know she knows it, and so on. But this ability is definitely a higher function of the brain and does not come without some mental effort.
experiment performed by Wimmer and Perner in the '80s shows this mental ability does not mature until fairly late, four years old or so, in a child's life. The experiment that demonstrates this is elegant and simple. Children are told the following story. A little boy named Maxi and his mother return from a shopping trip at which they purchased a bar of chocolate. The mother places the chocolate in a blue cabinet and closes it. When Maxi goes outside to play, the mother moves the chocolate bar to another, green cabinet. When Maxi returns to the kitchen, he decides to have a bite of chocolate. Which cabinet, the children are asked, will Maxi look in to find the bar?
Before the age of three or four years, almost all children will say Maxi will look in the green cabinet, because that's where the mother put it! Proving they have not yet fully developed the ability to model someone else's mind or empathize with another, albeit fictional, person's knowledge base. It may also be the case that this mental ability is lost or stunted as we become old (speaking strictly from personal experience).
So this stuff is hard. It's the difference between teaching someone to dance and choreographing a ballet. CS curricula, by and large, train us to be omniscient, sequential programmers. Perhaps that's why so many otherwise talented developers mistakenly believe every storage node should just know to look in the green cabinet.
Wednesday, February 16, 2011
At the time, I was dumbfounded. The answer seems so obvious I had no reply. Of course we should do our best to push the failure cases into a corner, solve them separately, and keep the blue sky paths as true blue and predictable as possible. Now, after some careful and mostly sober consideration, I can revisit that moment and provide the answer I wish I had thought of at the time. That's because, I have a blog. Not just anybody can get one of these, you know.
The answer I wish I had given is, not always.
We software engineers tend to think of failures as exceptions, bad things that don't happen very often. When it does happen, a failure injects asynchrony and non-determinacy into the problem whether we like it or not. One can never be sure of when it will happen or exactly how the failure will exhibit itself. We can easily allow ourselves to believe that failures are tricks the universe plays on us. But really, a failure is just a mismatch between our imagination and reality. If something we expect to happen doesn't happen, or something we expect to happen quickly takes too long, or something we expect to behave in a certain way starts acting up, we call that a failure. But that doesn't mean reality has failed us, it just means our imaginations are limited.
Whoa! Too philosophical. Let's return to the cocktail party.
In a previous post, I described two different algorithms for answering questions like the one posed in the original scenario. We observed that the probabilistic one was better than the deterministic one in several ways, but it had a very serious drawback; it doesn't always give the correct answer! Now let's see what happens when we inject something that might be called a failure into the story. Every party I've ever attended had a bathroom for the guests' convenience. Regularly, but at unpredictable times, a guest will simply disappear for a few moments and be unavailable to send or receive messages with the other guests. This models a failure because a given guest may, at any moment, go to (sheer force of will allows me to refrain from potty humor here) the loo. We can now examine how failures impact the two algorithms previously described.
First the deterministic algorithm. If the host makes a request to look up a number while one of the guests happens to be indisposed, and if we're unlucky enough so that guest happens to hold the card for which we're searching, then the algorithm cannot possibly return the correct answer. Worse than that, since the ring has a temporary hole in it, the algorithm cannot even complete and it will hang. If there is only one water closet, we can overcome this problem by having each guest remember their nearest two neighbors on each side, allowing the algorithm to "hop over" any failure. If there are two bathrooms, raising the possibility of two simultaneous failures, there will need to be some sort of ring repair sub-algorithm that gets invoked whenever a failure is recognized. I'll leave the design of such a repair strategy to the reader, but suffice to say that the resulting overall algorithm will be even more complex in order to handle the exceptional failure cases. Even if there is only one copy of each card at the party (we'll talk about redundancy in a minute), this new algorithm will still be deterministic except for failures. Looking at things more holistically, however, it is clear that an inopportune call of nature can cause the "deterministic" algorithm to give the wrong answer, just like the probabilistic one.
Speaking of the probabilistic algorithm, what do we need to modify to accommodate the possibility of unpredictable bathroom breaks? The answer is, absolutely nothing. This algorithm is inherently potty tolerant. Assuming one copy of each card, it will still report a false negative result if we are unlucky enough that the person holding the exact card for which we are currently searching happens to be taking a break, just like the "deterministic" algorithm. But there is no disconnect between our expectations of the algorithm and its actual behavior either with or without failures. In all cases, there is an unlikely possibility of a false negative result.
There is an easy way to rescue the determinacy of the "deterministic" algorithm by making multiple copies of the cards. In general, b+1 copies are required to ensure no false negatives if there are b baños at the party. This trick works for both algorithms in fact, and for the probabilistic one, it will also reduce the probability of a missed shout, because any of the b+1 guests holding a copy of the card being searched can respond with the correct answer. In reality, it is not always easy to bound the number of simultaneous failures, so there remains, always, some possibility of a false negative being returned by either algorithm.
Thinking of failures as being exceptional, rare occurrences often leads to inefficient, and sometimes incorrect, algorithms. For those of us designing systems that are intended to scale without bounds, it is a mental convenience we cannot afford. As the number of guests at the party continues to increase, the probability of an inconvenient leak becomes completely unexceptional. Paradoxically, many of the algorithms we end up using for these sorts of problems, like the simple shout and listen algorithm at the party, feel almost too primitive to be useful, and yet, they often are the most robust. To be successful in this arena, we must view failures as a normal, expected part of the problem definition.
We must plan to fail.
Friday, February 11, 2011
I recently posed the party problem while interviewing a candidate. There are many good solutions to the problem, and his approach was on target, if conventional. Typically, he pointed out, these problems are solved by forming the distributed processes, ahem, party guests, into a simple topology, like a ring. Most textbooks on distributed algorithms treat this case at some length because it lends itself nicely to inductive proofs. Of course, a fixed topology wasn't part of the original problem statement, so any algorithm that answers the original question, What word is on the card whose number is 50342?, would need to include a sub-algorithm to first organize the guests into a ring where each party-goer has exactly two neighbors.
Let's describe one such algorithm in more detail. I'd like to defer, until a later post, the part of the problem that allows guests to swap cards in any way they choose. We'll assume for now that each guest continues to hold the same deck of cards they are given when they enter the room. Our initial goal is to introduce each new guest to two others, whom we'll call the R-neighbor and the L-neighbor. A guest's R-neighbor is the one person in the room who has the smallest ID that is larger than his own. Similarly, the L-neighbor is the one who has the largest ID that is smaller than his. The only exception is where the tail of the ring joins the head, at which point the guest with the largest ID in the entire room has as his R-neighbor the guest with the smallest ID, and vice versa. Note, this is a logical ring, not a physical one. If we don't allow the guests to eat and drink, the algorithm will surely fail.
Organizing the guests into such a logical ring isn't too difficult. We could say each person, when she enters the room, chooses a four-digit ID number that is unlikely to be chosen by anyone else and writes it down. (Notice how randomness enters the story, literally, at the front door.) If the guest is the first to arrive, there's nothing more to do; she comprises a degenerate ring of one and is her own R-neighbor and L-neighbor. Subsequent guests all choose their own, probably unique, ID. Each one then chooses some other person in the room, it doesn't matter who, and hands them a message that says, My ID is XXXX. Please find my place in the ring. Upon receiving such a message, written on a slip of paper, each guest checks to see if the new guest's ID falls in between its own ID and the ID of its R-neighbor. If not, he passes the message on to his R-neighbor, who does exactly the same thing. If the new ID does fall between my own ID and that of my R-neighbor's, the new guest is inserted into the ring, which is done by sending messages to the new guest and to his original R-neighbor, introducing them to each other and informing them of the new neighborhood arrangement.
Once the logical ring is in place, the original question can be answered with each guest passing slips of paper to one of only two other guests, its neighbors in the ring. Here's one such algorithm. Upon receiving a message, each guest performs the following sequence of actions (remember, we're looking for a symmetric solution, so every guest must behave the same way) starting with whomever receives the original note from our host.
- If there isn't already an ID at the top of the message, write my own ID there.
- If there is already an ID, and it's mine, we're finished. Hand the answer back to the host and skip the remaining steps.
- Search through my deck of cards looking for the number 50342.
- If I find a card with that number on it, write the associated word on the slip of paper; this is the answer.
- Pass the message on to my R-neighbor.
If we discount the possibility that two guests might choose the same ID at the front door, this algorithm is deterministic, both for the positive and negative cases. That is, if the card in question exists in any guest's deck, the algorithm will discover it and answer the question, and if there is no card, the algorithm will report that it doesn't exist. This particular version of the algorithm will provide both answers after sending exactly n+2 messages (counting the two in which the host is involved), where n is the total number of guests at the party. There are various ways to reduce the total number of messages required, but in all cases the number of messages needed will grow as the number of guests increases. Still, this is a pretty darn good algorithm that behaves just the way we like our computer software to behave — deterministically.
Just for grins, lets also describe a probabilistic algorithm that solves the same problem given the same constraints. This one doesn't presuppose any particular topology, so we can skip the organizational first phase. Instead, we proceed as follows. When the host hands the slip of paper with the question on it to a random guest, he shouts to everybody in the room,
"Attention! What word is on the card whose number is 50342?"
Upon receiving this message, each guest searches through his deck of cards and, if he finds the number 50342, writes the associated word on a slip of paper and hands it to the guest who shouted the question, who then passes the answer along to the host. If no one answers within some reasonable amount of time, we assume the card does not exist and send a message back to the host to that effect.
Which approach is "better?" This second algorithm is better in a number of regards. It's certainly a lot simpler. It's also more efficient, requiring only four messages in the positive case — the original question from the host, the shout, the reply from the question, and the reply to the host — and only three messages in the negative case. It also scales better, since neither of these performance metrics grows as the number of guests increases.
But the probabilistic algorithm has a major drawback that would prevent its use by most any red-blooded programmer: it is, well, probabilistic. There is some chance, however small, that the guest holding the requested card does not hear the shout and therefore fails to answer. The shouting guest, in that case, will wait a respectable period of time and then declare, the card doesn't exist. To be precise, he can only truthfully claim that he doesn't know whether it exists or not, since the negative case and the failure case are indistinguishable.
Another drawback of the latter algorithm, also related to its probabilistic nature, is in choosing the right "reasonable amount of time" to wait before declaring the card can't be found. If the chosen timeout is too short, some of the guests may not have enough time to search through their card decks to find the card. If it's too long, the nominal performance of the negative case suffers. Yes, we could have the guest repeat the shout after a short timeout in order to increase the chance that everybody hears it and to give everyone a longer period in which to search their individual card decks. However, there will always be some remaining possibility of a false negative.
Does that mean the probabilistic algorithm is fatally flawed and should be avoided in all cases? More on that later...
Friday, January 28, 2011
In this paper, we show the surprising result that no completely asynchronous consensus protocol can tolerate even a single unannounced process death. We do not consider Byzantine failures, and we assume that the message system is reliable- it delivers all messages correctly and exactly once. Nevertheless, even with these assumptions, the stopping of a single process at an inopportune time can cause any distributed commit protocol to fail to reach agreement. Thus, this important problem has no robust solution without further assumptions about the computing environment or still greater restrictions on the kind of failures to be tolerated!
Wednesday, January 26, 2011
The first time I encountered this was with a canonical problem in distributed fault tolerance, called the Byzantine Agreement problem. Don't Google it! There's been so much written and so many results about the topic over the last thirty years that you'll never return if you vector off now. Instead, let me present a concrete instance of the general problem in order to talk about why randomness is such a useful, and in some cases essential, part of the solution.
Suppose there are four computerized weather stations distributed about a certain city. Each station has, among many other sensors, a light detector, and the stations can communicate with each other via whatever protocol we wish to devise. Every hour, on the hour, each station samples its own light sensor and makes an independent judgement about whether it is "daytime" or "nighttime." At noon and at midnight, it is highly likely that all stations will make the same assessment about whether it's day or night. But at dusk and dawn, local variations in light and sensor sensitivity will probably cause the stations to arrive at different conclusions about which is the best classification. Our job is to devise an algorithm and a protocol that allows the stations to all agree on a single answer, and that has two very reasonable properties: 1) if all the stations individually come to the same conclusion about whether it's day or night, then the group must agree on that answer, and 2) even if they don't start out with the same answer initially, they must still agree on some value (it doesn't matter which).
But wait, there's no villain in this story yet. One of the stations doesn't follow the rules. It can say anything at any time to any other station without constraint. Maybe it's faulty, or maybe it's nefarious. Crazy or evil -- it doesn't matter. We may as well assume the bad guy is a malicious spy who knows everything that has happened, using that information to thwart our little plot and prevent the remaining stations from agreeing whether it's day or night. Now there's a worthy antagonist.
There is a simple, but subtly wrong, deterministic algorithm that attempts to allow the three non-faulty stations to reach agreement at all times of the day or night, even with the omniscient meddling of the faulty one. It goes like this: Each station sends its initial value to every other station. After collecting all four values (including its own), each station determines whether three of the four responses have the same value. If so, the station takes this value as its final answer. If it's a tie, the station picks a predetermined default, say "daytime," and the nodes exchange values in another round of communication. The algorithm terminates when all the correct nodes have chosen the (same) final answer.
You can see that, if it's noon or midnight and all three non-faulty stations report the same value, there is little the nefarious foe can do to confuse this algorithm; they will all see at least three identical answers and immediately choose it. There is danger at dusk or dawn, however. We're still okay if two of the non-faulty stations thinks it's day and the other one thinks it's night. In this case, there's no way the nefarious foe can influence the vote so that "nighttime" is chosen by any of the non-faulty stations. They will all pick the default value, "daytime," after the first round and confirm it as the agreed value on the second round.
The remaining case is vulnerable to attack. When two of the non-faulty stations initially think it's night and the other thinks it's day, the nefarious foe, in full omniscient glory, can whisper "nighttime" to the nighttime stations and "daytime" to the daytime station. The nighttime stations will therefore continue to vote nighttime, because they see three votes for that value, while the minority station will continue to choose the default value, "daytime." Thus, the faulty foe can ensure the stations never reach agreement on any value.
Surprisingly, this simple but incorrect deterministic algorithm can easily be converted to a simple and correct probabilistic one, just by giving each station a fair coin to flip. Instead of picking a predetermined default value (which is subject to being discovered and exploited by the evil one), each station flips its coin in the event there are not three votes for the same value. Heads, it's daytime, tails it's night. If the non-faulty stations do not unanimously agree, The Faulty One can still influence one of the two cases, as described above, but since it cannot predict the value of the coin flip (that's what random means) there's no way to tell which is the vulnerable case. There's only a 50% chance our malicious adversary will be able to thwart agreement on any given round, so if we just keep repeating the algorithm round after round, the stations will eventually reach agreement, despite the best efforts of the know-it-all evil station.
Probably. We have a choice with this probabilistic algorithm. We can either stop it after a fixed number of rounds, and accept the (tunably small) possibility that agreement has not yet been reached, or we can run it until agreement is reached, and accept the (tunably small) possibility that it takes so long we no longer care whether it's night or day. This exact trade-off is typical of distributed, probabilistic algorithms.
To be fair, there is a simple deterministic algorithm that solves this problem (see the seminal 1980 paper by Pease, Shostak, and Lamport), but it depends on a subtle condition that doesn't always exist in real-life distributed systems -- synchronous communication. Another famous result in this research area is that of Fischer, Lynch, and Paterson, which proves no deterministic agreement algorithm can exist unless we can give, with certainty, an upper bound on how long it will take messages to be delivered between processes (weather stations in this case). And yet, there are many good probabilistic solutions to this important practical problem!
Back in 1987, a fellow graduate student at the University of Texas at Austin, Russell Turpin, and I were struck by this result. So much so that we spent several months reading and analyzing the plethora of research on probabilistic agreement and writing a survey paper on the topic. Now, twenty-five years later, Dr. Turpin and I are still friends and colleagues, and we expect our interest and passion for distributed solutions to continue for many more years to come.