Friday, April 15, 2011

Cloudy with a Chance of FUD

I can predict the future. It just takes me a long time.

It's becoming clear to pretty much everyone now that the not-so-distant future of IT infrastructure will include something called a cloud. The exact meaning of that term is currently being, um, remodeled by established companies who want to co-opt it to refer to all sorts of crazy things. But to me, the term cloud has a simple, specific meaning and a well-defined time of origin. It is essentially the set of storage and compute web services offered by Amazon Web Services division starting in March of 2006 under the names Simple Storage Service (S3) and Elastic Compute Cloud (EC2). From the outside, the cloud looks like an infinite-capacity computer and disk drive with a REST interface to allow one to store data and programs and to execute programs. Think of a computer with a processor, a disk drive, and a network, then explode the whole thing so it vaporizes into a dispersed cloud of resources. The processor becomes EC2, the disk is S3, and the network is the Internet itself.

But that's not much of a prognostication. We are already there, functionally if not yet at scale. Many small Web 2.0 companies use the Amazon cloud, or its major competitor Rackspace, exclusively. Soon, SMBs of all sorts will be doing the same. They don't buy racks of servers or network gear or manage power and bandwidth costs, they simply rent IT infrastructure from the cloud and focus on their core businesses. Netflix, the world's leading provider of Internet subscription videos, has recently ported most of its enormous application to the Amazon cloud and is now Amazon's biggest customer, despite the fact they are also competitors.

This trend toward commoditization of IT infrastructure will likely continue until we reach a point where computation and storage become utilities like Internet connectivity already is. The only important difference between Infrastructure as a Service (IaaS) providers will be price. Those providers will still buy plenty of equipment and software from Dell, HP, EMC, Oracle, etc., but the SMB market will evaporate for those infrastructure vendors. Furthermore, complex, high margin hardware, like RAID controllers, NAS, and distributed file systems, will be a thing of the past. Clever, fault-tolerant distributed software like CAStor (if I do say so myself) will allow data centers to be built with simpler, cheaper hardware, much like Amazon, Google, and Facebook build them today. Computation and storage, like network connectivity before them, will become information utilities. IaaS vendors will sell megabyte-hours and CPU minutes the same way electric utilities sell kilowatt-hours. Larger enterprise customers will still have a need for "private" data centers, but even those will be API-compatible with the cloud, making it effortless for applications to execute and store into local clouds or the Intercloud. Enterprises will either choose to under-provision their data centers and buy from the cloud during peak loads, or over-provision and sell excess capacity to the cloud, the same way private electric generation facilities today sell power back to the grid.

Who will the cloud vendors, the IaaS providers, be? Probably not an online book seller or a hosting provider. It's not likely even to be one of the big seven infrastructure hardware/software companies around today, Dell, HP, EMC, NetApp, Oracle, HDS, or IBM. Remember when AOL and CompuServe were going to own the Internet? It didn't happen, and today the Internet lives with the big Telcos, AT&T, Verizon, Vodafone, Time-Warner, etc. That's because they are utilities already. They know how to run digital services with razor-thin margins, they have high-volume billing and support organizations, and they already own the business accounts. Telcos are slow but persistent, and they will eventually own the cloud too.

For those of us who design software for a living, the high value, high margin applications will execute in the compute cloud and manage content flowing into and out of the content cloud. I expect such applications to fall into five categories. Production software will aid content authors, whether it be entertainment videos or corporate email, in preparing and submitting content objects to the cloud. On the other end, distribution applications, like YouTube, Netflix, Ebay, and Amazon-the-etailer will help content consumers find and view/read/play content objects. In between, there are analytics to discover trends and measure consumption rates and monitization applications that use the analytics to make money for the cloud providers and/or content creators. Finally, promotional and marketing applications, like search, to help consumers find relevant content and, not incidentally, influence them to make more profitable choices. That's right, I said search is a promotional tool.

Monetization of stored content will become more of a predictable science, based in part on content value (which is mostly, but not entirely, production cost) and content volume, the number of consumers of a particular piece of content.

Although some of the words I'm using here come from the media and entertainment space, "production", "distribution", "promotion", etc., this model really applies to all content in the cloud, including corporate data, personal backups, and so on. When I write an email, for example, it doesn't cost very much to produce and the intended audience is no more than a few people, usually. This is an example of low value, low volume content, and the best way to monetize its storage, archival and management in the cloud is to charge the person to which it is most important, namely the producer or the producer's company. In the high value, high volume quadrant is the more traditional M&E content like full-length movies, books, and musical compilations. Consumers typically pay for this kind of content and the revenue is shared between the cloud vendor and the content provider. The low value, high volume quadrant includes typical short form videos on YouTube and Hulu. It didn't cost very much to produce "Charlie bit my finger" or "Double rainbow guy," but such content is viewed by millions of people and could be monetized most effectively with advertising, again sharing the revenue with the producers.

I'm pretty confident these things will come to pass eventually. How long will it take? Where will we be in, say, 2015? I can answer those questions precisely and with great confidence. Just give me a few years to think about it.

Thursday, March 17, 2011

Visualizing Replication

Here's the second in my video tutorial series.  This one explains how CAStor replicates objects in order to protect against hardware failures and ensure data integrity.

Wednesday, March 16, 2011

Satnam and Other Dangers

The idea of a True Name for something or someone is very old. A true name captures the essence, the intrinsic identity of something in a short moniker that is distinguished from every other name. For centuries, mankind has recognized the fascinating duality that one's true name is both a treasure and a vulnerability.

Religion has long been obsessed with the true name of God. The ancient Jews revered it and believed it had such power they forbade its utterance by ordinary people and spoke it officially once every seven years. Sikhs have a word for the concept, satnam. The true name of true names I guess.

The concept of true name is ancient and pervasive in the secular world too. Rumplestiltskin cherished his so much he sang songs about it, then tore himself in half when it was discovered. One of my favorite science fiction stories is actually titled True Names. Written by the mathematician Vernor Vinge in 1981, the novella explored the promise and threat of one's true name in cyberspace, and incidentally invented the cyberpunk genre which was later fleshed out by William Gibson, Neal Stephenson, Bruce Sterling and others. The title of this post is borrowed from Vinge's first anthology.

Back here in the cyberpresent, names for persistent objects in computer technology have an unfortunately muddled story. Traditional file systems conflate the notions of name and containment for files. We say that a file is "in a directory" or "contained in a folder" when really we're just speaking about a prefix of its name. This leads to some weird, but by now familiar, holes in the metaphor. When you "move" a file from one folder to another, usually the file's contents aren't moved at all; the file is simply renamed. In Unix-based file systems, the "hard link" mechanism allows a single file to have more than one name. This metaphorical blemish of having to explain that an object can be in two places at once is often gotten around by pretending the hard links are distinctly different kinds of names from the originals, so a file has one definite location and perhaps some other, lesser references to it from other locations. This notion is patently false; all names are created equal in this case and if you delete the "original" hard link the others will continue to work as usual. The fact is, files can have different names in different contexts. But do they have a satnam that is the same in every context?

Sort of. In traditional Unix/Linux file systems, a file's inode number is its true name. All other names are mapped to the inode by the file system's directory mechanism. The inode data structure itself contains information about where to find the various parts and pieces (blocks) that make up the file's data content. An inode number is a 32-bit or, in more recent releases, 64-bit binary persistent name for a file. Ideally, a satnam does not depend on context; it refers to the same thing everywhere and everywhen. But a 32-bit inode is not long enough to allow that since it accommodates only about 4 billion files. This may seem like a big number, but it isn't even close to being big enough to uniquely identify every computer file in existence. So the Unixen came up with the concept of a file system, which effectively scopes inodes so they make sense only within the limited context of a single file system, usually hosted on one server. In other words, inodes do not have global scope. Nor is 4 billion inodes enough even to uniquely identify all the files in a single file system, if we remember that files may be created and then deleted. Again, ideally each instance of a file would have a satnam that's different from every other instance that ever existed. But inodes do not have such universal scope either, since they must be reused. The newer 64-bit inodes do allow up to about 18 X 1018 distinct inodes, which helps us build larger file systems that span several servers, but the namespace is still not large enough to support universally scoped true names.

Distributed storage systems that aspire to scale without limit, must support an essentially infinite (that is, a very, very large finite) number of distinct objects, over the entire lifetime of, well, everything. Further, the degree of cooperation and agreement required to implement a coordinated distributed namespace is, to say the least, exorbitant, even impossible in most cases. Conventional, hierarchical naming schemes always require a centralized naming authority somewhere along the way (that's why the IANA exists). These types of distributed storage systems really do need a satnam, a universally-scoped name for each stored object.

My colleagues here at Caringo, Paul Carpentier, Jan Van Riel, and Tom Teugels, invented a process (US Patent 7,793,112) by which an intrinsic satnam can be derived from any binary object. This technique, the basis for Content Addressable Storage, or CAS, uses a cryptographic hash to digest the bits and spit out a 128-bit name. Depending on the strength of the hash algorithm, this can produce a universally unique identifier, a true name. The drawback is, smart hackers keep finding ways to defeat the crypto algorithms to produce completely different objects that have the same names.

At Caringo, we built a clustered, scalable storage system called Swarm using a different technique that doesn't suffer this drawback. We assign a satnam to an object when it is initially created, a 128-bit name that, with very high probability is different from any other satnam being used anywhere in the universe at any time, past, present, or future. This universally unique identifier is stored with the object's data bits and becomes part of the object as it moves around the world, gets replicated, and so on.

The differences between these two approaches suggest a rather deep philosophical question. Is it the case that one's fundamental identity is embodied in the sum of all his traits (the data bits) or is it the fact of his creation that identifies him? If I create a data object today here in Austin, and my great-great-great-granddaughter also creates a data object on Mars using precisely the same bits in the same order, are they the same object? Should they have the same satnam? Philosophically, I have no idea, but Caringo Swarm says, no. Objects created at different times or places are different objects, even if they appear otherwise to be the same. An object's satnam is universally scoped and intractable to guess, unless you have the object itself.

Which brings us back to Rumplestiltskin. The only way to access an object stored in CAStor is to produce its satnam. If you wish to keep it confidential, then you simply do not ever divulge its true name to anyone you don't trust. Don't go singing it around the campfire.

Thursday, March 3, 2011


A couple months ago I wrote a useful tool to help us visualize the operation of a CAStor cluster in situ. It provides a bird's eye view of the nodes and communication among them by replaying a syslog file captured during an actual execution. We plan to use this tool for a number of different things, from marketing to tech support, even design validation. As a first application, I have begun using it as the basis for a series of tutorial videos explaining the inner workings of CAStor. Here's the first installment.

Check it out!

Chocolate Empathy

One reason it's so hard to design good distributed software is that programmers must, for lack of a less anthropomorphic term, empathize with multiple processes at once. A CAStor storage node, for example, is not omniscient. It knows only about its own state and enough about the global cluster to operate. Answering even very simple questions like, "How much available storage is there in the cluster?" involves communicating and agreeing with other nodes. From the programmers' perspective, we must imagine ourselves in the place of each of the individual nodes and model, in our heads, what is known and, more importantly sometimes, what cannot possibly be known by each node at each point in the algorithm.

This capability to maintain many different models of what others know, in addition to our own knowledge store, is an amazing feat of the human mind. And we do it all the time. Every conversation you have with another person relies on a gigantic store of knowledge about what she knows, whether she knows that I know she knows it, and so on. But this ability is definitely a higher function of the brain and does not come without some mental effort.

A famous psychological experiment performed by Wimmer and Perner in the '80s shows this mental ability does not mature until fairly late, four years old or so, in a child's life. The experiment that demonstrates this is elegant and simple. Children are told the following story. A little boy named Maxi and his mother return from a shopping trip at which they purchased a bar of chocolate. The mother places the chocolate in a blue cabinet and closes it. When Maxi goes outside to play, the mother moves the chocolate bar to another, green cabinet. When Maxi returns to the kitchen, he decides to have a bite of chocolate. Which cabinet, the children are asked, will Maxi look in to find the bar?

Before the age of three or four years, almost all children will say Maxi will look in the green cabinet, because that's where the mother put it! Proving they have not yet fully developed the ability to model someone else's mind or empathize with another, albeit fictional, person's knowledge base. It may also be the case that this mental ability is lost or stunted as we become old (speaking strictly from personal experience).

So this stuff is hard. It's the difference between teaching someone to dance and choreographing a ballet. CS curricula, by and large, train us to be omniscient, sequential programmers. Perhaps that's why so many otherwise talented developers mistakenly believe every storage node should just know to look in the green cabinet.

Wednesday, February 16, 2011

Plan to Fail

During a particularly intense design session the other day, one of my more brilliant engineers and co-architect of CAStor asked what he thought to be a rhetorical question. If we can provide a deterministic algorithm that completely solves a problem except for failures, shouldn't we do that?

At the time, I was dumbfounded. The answer seems so obvious I had no reply. Of course we should do our best to push the failure cases into a corner, solve them separately, and keep the blue sky paths as true blue and predictable as possible. Now, after some careful and mostly sober consideration, I can revisit that moment and provide the answer I wish I had thought of at the time. That's because, I have a blog. Not just anybody can get one of these, you know.

The answer I wish I had given is, not always.

We software engineers tend to think of failures as exceptions, bad things that don't happen very often. When it does happen, a failure injects asynchrony and non-determinacy into the problem whether we like it or not. One can never be sure of when it will happen or exactly how the failure will exhibit itself. We can easily allow ourselves to believe that failures are tricks the universe plays on us. But really, a failure is just a mismatch between our imagination and reality. If something we expect to happen doesn't happen, or something we expect to happen quickly takes too long, or something we expect to behave in a certain way starts acting up, we call that a failure. But that doesn't mean reality has failed us, it just means our imaginations are limited.

Whoa! Too philosophical. Let's return to the cocktail party.

In a previous post, I described two different algorithms for answering questions like the one posed in the original scenario. We observed that the probabilistic one was better than the deterministic one in several ways, but it had a very serious drawback; it doesn't always give the correct answer! Now let's see what happens when we inject something that might be called a failure into the story. Every party I've ever attended had a bathroom for the guests' convenience. Regularly, but at unpredictable times, a guest will simply disappear for a few moments and be unavailable to send or receive messages with the other guests. This models a failure because a given guest may, at any moment, go to (sheer force of will allows me to refrain from potty humor here) the loo. We can now examine how failures impact the two algorithms previously described.

First the deterministic algorithm. If the host makes a request to look up a number while one of the guests happens to be indisposed, and if we're unlucky enough so that guest happens to hold the card for which we're searching, then the algorithm cannot possibly return the correct answer. Worse than that, since the ring has a temporary hole in it, the algorithm cannot even complete and it will hang. If there is only one water closet, we can overcome this problem by having each guest remember their nearest two neighbors on each side, allowing the algorithm to "hop over" any failure. If there are two bathrooms, raising the possibility of two simultaneous failures, there will need to be some sort of ring repair sub-algorithm that gets invoked whenever a failure is recognized. I'll leave the design of such a repair strategy to the reader, but suffice to say that the resulting overall algorithm will be even more complex in order to handle the exceptional failure cases. Even if there is only one copy of each card at the party (we'll talk about redundancy in a minute), this new algorithm will still be deterministic except for failures. Looking at things more holistically, however, it is clear that an inopportune call of nature can cause the "deterministic" algorithm to give the wrong answer, just like the probabilistic one.

Speaking of the probabilistic algorithm, what do we need to modify to accommodate the possibility of unpredictable bathroom breaks? The answer is, absolutely nothing. This algorithm is inherently potty tolerant. Assuming one copy of each card, it will still report a false negative result if we are unlucky enough that the person holding the exact card for which we are currently searching happens to be taking a break, just like the "deterministic" algorithm. But there is no disconnect between our expectations of the algorithm and its actual behavior either with or without failures. In all cases, there is an unlikely possibility of a false negative result.

There is an easy way to rescue the determinacy of the "deterministic" algorithm by making multiple copies of the cards. In general, b+1 copies are required to ensure no false negatives if there are b baƱos at the party. This trick works for both algorithms in fact, and for the probabilistic one, it will also reduce the probability of a missed shout, because any of the b+1 guests holding a copy of the card being searched can respond with the correct answer. In reality, it is not always easy to bound the number of simultaneous failures, so there remains, always, some possibility of a false negative being returned by either algorithm.

Thinking of failures as being exceptional, rare occurrences often leads to inefficient, and sometimes incorrect, algorithms. For those of us designing systems that are intended to scale without bounds, it is a mental convenience we cannot afford. As the number of guests at the party continues to increase, the probability of an inconvenient leak becomes completely unexceptional. Paradoxically, many of the algorithms we end up using for these sorts of problems, like the simple shout and listen algorithm at the party, feel almost too primitive to be useful, and yet, they often are the most robust. To be successful in this arena, we must view failures as a normal, expected part of the problem definition.

We must plan to fail.

Friday, February 11, 2011

Back to the Party

I recently posed the party problem while interviewing a candidate. There are many good solutions to the problem, and his approach was on target, if conventional. Typically, he pointed out, these problems are solved by forming the distributed processes, ahem, party guests, into a simple topology, like a ring. Most textbooks on distributed algorithms treat this case at some length because it lends itself nicely to inductive proofs. Of course, a fixed topology wasn't part of the original problem statement, so any algorithm that answers the original question, What word is on the card whose number is 50342?, would need to include a sub-algorithm to first organize the guests into a ring where each party-goer has exactly two neighbors.

Let's describe one such algorithm in more detail. I'd like to defer, until a later post, the part of the problem that allows guests to swap cards in any way they choose. We'll assume for now that each guest continues to hold the same deck of cards they are given when they enter the room. Our initial goal is to introduce each new guest to two others, whom we'll call the R-neighbor and the L-neighbor. A guest's R-neighbor is the one person in the room who has the smallest ID that is larger than his own. Similarly, the L-neighbor is the one who has the largest ID that is smaller than his. The only exception is where the tail of the ring joins the head, at which point the guest with the largest ID in the entire room has as his R-neighbor the guest with the smallest ID, and vice versa. Note, this is a logical ring, not a physical one. If we don't allow the guests to eat and drink, the algorithm will surely fail.

Organizing the guests into such a logical ring isn't too difficult. We could say each person, when she enters the room, chooses a four-digit ID number that is unlikely to be chosen by anyone else and writes it down. (Notice how randomness enters the story, literally, at the front door.) If the guest is the first to arrive, there's nothing more to do; she comprises a degenerate ring of one and is her own R-neighbor and L-neighbor. Subsequent guests all choose their own, probably unique, ID. Each one then chooses some other person in the room, it doesn't matter who, and hands them a message that says, My ID is XXXX. Please find my place in the ring. Upon receiving such a message, written on a slip of paper, each guest checks to see if the new guest's ID falls in between its own ID and the ID of its R-neighbor. If not, he passes the message on to his R-neighbor, who does exactly the same thing. If the new ID does fall between my own ID and that of my R-neighbor's, the new guest is inserted into the ring, which is done by sending messages to the new guest and to his original R-neighbor, introducing them to each other and informing them of the new neighborhood arrangement.

Once the logical ring is in place, the original question can be answered with each guest passing slips of paper to one of only two other guests, its neighbors in the ring. Here's one such algorithm. Upon receiving a message, each guest performs the following sequence of actions (remember, we're looking for a symmetric solution, so every guest must behave the same way) starting with whomever receives the original note from our host.

  1. If there isn't already an ID at the top of the message, write my own ID there.
  2. If there is already an ID, and it's mine, we're finished. Hand the answer back to the host and skip the remaining steps.
  3. Search through my deck of cards looking for the number 50342.
  4. If I find a card with that number on it, write the associated word on the slip of paper; this is the answer.
  5. Pass the message on to my R-neighbor.

If we discount the possibility that two guests might choose the same ID at the front door, this algorithm is deterministic, both for the positive and negative cases. That is, if the card in question exists in any guest's deck, the algorithm will discover it and answer the question, and if there is no card, the algorithm will report that it doesn't exist. This particular version of the algorithm will provide both answers after sending exactly n+2 messages (counting the two in which the host is involved), where n is the total number of guests at the party. There are various ways to reduce the total number of messages required, but in all cases the number of messages needed will grow as the number of guests increases. Still, this is a pretty darn good algorithm that behaves just the way we like our computer software to behave — deterministically.

Just for grins, lets also describe a probabilistic algorithm that solves the same problem given the same constraints. This one doesn't presuppose any particular topology, so we can skip the organizational first phase. Instead, we proceed as follows. When the host hands the slip of paper with the question on it to a random guest, he shouts to everybody in the room,

"Attention! What word is on the card whose number is 50342?"

Upon receiving this message, each guest searches through his deck of cards and, if he finds the number 50342, writes the associated word on a slip of paper and hands it to the guest who shouted the question, who then passes the answer along to the host. If no one answers within some reasonable amount of time, we assume the card does not exist and send a message back to the host to that effect.

Which approach is "better?" This second algorithm is better in a number of regards. It's certainly a lot simpler. It's also more efficient, requiring only four messages in the positive case — the original question from the host, the shout, the reply from the question, and the reply to the host — and only three messages in the negative case. It also scales better, since neither of these performance metrics grows as the number of guests increases.

But the probabilistic algorithm has a major drawback that would prevent its use by most any red-blooded programmer: it is, well, probabilistic. There is some chance, however small, that the guest holding the requested card does not hear the shout and therefore fails to answer. The shouting guest, in that case, will wait a respectable period of time and then declare, the card doesn't exist. To be precise, he can only truthfully claim that he doesn't know whether it exists or not, since the negative case and the failure case are indistinguishable.

Another drawback of the latter algorithm, also related to its probabilistic nature, is in choosing the right "reasonable amount of time" to wait before declaring the card can't be found. If the chosen timeout is too short, some of the guests may not have enough time to search through their card decks to find the card. If it's too long, the nominal performance of the negative case suffers. Yes, we could have the guest repeat the shout after a short timeout in order to increase the chance that everybody hears it and to give everyone a longer period in which to search their individual card decks. However, there will always be some remaining possibility of a false negative.

Does that mean the probabilistic algorithm is fatally flawed and should be avoided in all cases? More on that later...

Friday, January 28, 2011

Impossibility Result for Asynchronous Consensus

From Fischer, Lynch, Paterson, "Impossibility of Distributed Consensus with One Faulty Process"

In this paper, we show the surprising result that no completely asynchronous consensus protocol can tolerate even a single unannounced process death. We do not consider Byzantine failures, and we assume that the message system is reliable- it delivers all messages correctly and exactly once. Nevertheless, even with these assumptions, the stopping of a single process at an inopportune time can cause any distributed commit protocol to fail to reach agreement. Thus, this important problem has no robust solution without further assumptions about the computing environment or still greater restrictions on the kind of failures to be tolerated!

Wednesday, January 26, 2011

Provably Probable Is Better Than Probably Provable

Randomness and distributed algorithms are often found together, like movie stars and orphans. It's not just that distributed algorithms often exhibit stochastic behavior, which they do; randomness is also an essential ingredient of many distributed solutions. I'm not speaking loosely when I use the word "essential" here. There are many problems in the distributed arena that are provably impossible to solve deterministically, but if you give processes access to a fair coin and allow them to flip it occasionally, voila!, an efficient, reliable solution emerges that yields the correct behavior. Probably.

The first time I encountered this was with a canonical problem in distributed fault tolerance, called the Byzantine Agreement problem. Don't Google it! There's been so much written and so many results about the topic over the last thirty years that you'll never return if you vector off now. Instead, let me present a concrete instance of the general problem in order to talk about why randomness is such a useful, and in some cases essential, part of the solution.

Suppose there are four computerized weather stations distributed about a certain city. Each station has, among many other sensors, a light detector, and the stations can communicate with each other via whatever protocol we wish to devise. Every hour, on the hour, each station samples its own light sensor and makes an independent judgement about whether it is "daytime" or "nighttime." At noon and at midnight, it is highly likely that all stations will make the same assessment about whether it's day or night. But at dusk and dawn, local variations in light and sensor sensitivity will probably cause the stations to arrive at different conclusions about which is the best classification. Our job is to devise an algorithm and a protocol that allows the stations to all agree on a single answer, and that has two very reasonable properties: 1) if all the stations individually come to the same conclusion about whether it's day or night, then the group must agree on that answer, and 2) even if they don't start out with the same answer initially, they must still agree on some value (it doesn't matter which).

But wait, there's no villain in this story yet. One of the stations doesn't follow the rules. It can say anything at any time to any other station without constraint. Maybe it's faulty, or maybe it's nefarious. Crazy or evil -- it doesn't matter. We may as well assume the bad guy is a malicious spy who knows everything that has happened, using that information to thwart our little plot and prevent the remaining stations from agreeing whether it's day or night. Now there's a worthy antagonist.

There is a simple, but subtly wrong, deterministic algorithm that attempts to allow the three non-faulty stations to reach agreement at all times of the day or night, even with the omniscient meddling of the faulty one. It goes like this: Each station sends its initial value to every other station. After collecting all four values (including its own), each station determines whether three of the four responses have the same value. If so, the station takes this value as its final answer. If it's a tie, the station picks a predetermined default, say "daytime," and the nodes exchange values in another round of communication. The algorithm terminates when all the correct nodes have chosen the (same) final answer.

You can see that, if it's noon or midnight and all three non-faulty stations report the same value, there is little the nefarious foe can do to confuse this algorithm; they will all see at least three identical answers and immediately choose it. There is danger at dusk or dawn, however. We're still okay if two of the non-faulty stations thinks it's day and the other one thinks it's night. In this case, there's no way the nefarious foe can influence the vote so that "nighttime" is chosen by any of the non-faulty stations. They will all pick the default value, "daytime," after the first round and confirm it as the agreed value on the second round.

The remaining case is vulnerable to attack. When two of the non-faulty stations initially think it's night and the other thinks it's day, the nefarious foe, in full omniscient glory, can whisper "nighttime" to the nighttime stations and "daytime" to the daytime station. The nighttime stations will therefore continue to vote nighttime, because they see three votes for that value, while the minority station will continue to choose the default value, "daytime." Thus, the faulty foe can ensure the stations never reach agreement on any value.

Surprisingly, this simple but incorrect deterministic algorithm can easily be converted to a simple and correct probabilistic one, just by giving each station a fair coin to flip. Instead of picking a predetermined default value (which is subject to being discovered and exploited by the evil one), each station flips its coin in the event there are not three votes for the same value. Heads, it's daytime, tails it's night. If the non-faulty stations do not unanimously agree, The Faulty One can still influence one of the two cases, as described above, but since it cannot predict the value of the coin flip (that's what random means) there's no way to tell which is the vulnerable case. There's only a 50% chance our malicious adversary will be able to thwart agreement on any given round, so if we just keep repeating the algorithm round after round, the stations will eventually reach agreement, despite the best efforts of the know-it-all evil station.

Probably. We have a choice with this probabilistic algorithm. We can either stop it after a fixed number of rounds, and accept the (tunably small) possibility that agreement has not yet been reached, or we can run it until agreement is reached, and accept the (tunably small) possibility that it takes so long we no longer care whether it's night or day. This exact trade-off is typical of distributed, probabilistic algorithms.

To be fair, there is a simple deterministic algorithm that solves this problem (see the seminal 1980 paper by Pease, Shostak, and Lamport), but it depends on a subtle condition that doesn't always exist in real-life distributed systems -- synchronous communication. Another famous result in this research area is that of Fischer, Lynch, and Paterson, which proves no deterministic agreement algorithm can exist unless we can give, with certainty, an upper bound on how long it will take messages to be delivered between processes (weather stations in this case). And yet, there are many good probabilistic solutions to this important practical problem!

Back in 1987, a fellow graduate student at the University of Texas at Austin, Russell Turpin, and I were struck by this result. So much so that we spent several months reading and analyzing the plethora of research on probabilistic agreement and writing a survey paper on the topic. Now, twenty-five years later, Dr. Turpin and I are still friends and colleagues, and we expect our interest and passion for distributed solutions to continue for many more years to come.


Wednesday, January 19, 2011

A Party Game

Let's play a party game. Here are the rules. When each of us arrives at the party, the host hands us some flash cards. Each card has a single number and a single word written on it. Generally, cards have different number-word pairs, but the same pair might be repeated on more than one card. A given number is always associated with the same word, however.
Everybody gets some number of cards, from just a few to a handful or even a whole deck. We are told we may swap cards with others, in pretty much any way we want, except we cannot look at others' cards. In fact, we can communicate with others in only two limited ways: we can write a short note on a slip of paper and hand it to one other person, or we can shout a message to the entire group. The latter mechanism is somewhat error-prone since we can never be sure others are listening and can hear our shout. Also, someone else may decide to shout at the same time, in which case neither of you may be heard.
After a few drinks, the game commences. The host chooses someone at the party and hands her a slip of paper on which is written a simple question, "What word is on the card whose number is 50342?"
Staying within the rules and constraints outlined above, how would you behave, and how would you want other party goers to behave, in order to "efficiently" answer that question?

Sunday, January 16, 2011

Automatons, Organisms, and Teenagers

Wouldn't it be great if you could tell your child exactly what to do in a given situation and be completely confident he or she would do just that, every time, without question? If you've ever raised a teenager, I promise you've wished for that before. But would it really be all that great?
Yes, we could, theoretically, help our kids avoid many of the mistakes that we made growing up. On the other hand, they could never be better than us. It turns out, all the pleasures and pains of child rearing stem from the unexpected, emergent behavior of our children. Thank goodness kids are organisms, not automatons.
My software team and I at Caringo, Inc. are raising our own child, named CAStor -- a symmetrically distributed, massively scalable, clustered storage system. CAStor is not quite a teenager yet, but already I'm getting questions from customers and partners.
"Why does it behave this way in that situation?" they ask. To which I'm often forced to answer simply, "I don't know."
When the behavior in question results in a positive outcome (which it frequently does), they wonder why I don't take credit for it. After all, CAStor is a computer program, a digital automaton, and we programmed the darned thing. How could we not know why it behaves the way it does?
The answer is, computers may be automatons, but computer networks are organisms.
In the past two decades, we have seen the emergence of a whole new class of application and infrastructure software. Scalable, robust, distributed solutions capable of handling hundreds of millions of users, perform billions of transactions, store and retrieve millions of gigabytes, and maintain high availability despite hardware failures, software upgrades, and malicious attacks. Those of us who design and architect such solutions understand (sometimes) there is a qualitative difference between these organic, adaptive computer networks and the "do this now" sort of programs we learned to write in school.
This blog is about the practical potential and limitations of distributed computing, the trade-offs between determinism and probability, certainty and surprise. Along the way, I hope we'll explore all kinds of topics, from space photography to train tracks, thermodynamics to wine glasses, the meaning of knowledge, randomness, and time.