## Wednesday, January 26, 2011

### Provably Probable Is Better Than Probably Provable

Randomness and distributed algorithms are often found together, like movie stars and orphans. It's not just that distributed algorithms often exhibit stochastic behavior, which they do; randomness is also an essential ingredient of many distributed solutions. I'm not speaking loosely when I use the word "essential" here. There are many problems in the distributed arena that are provably impossible to solve deterministically, but if you give processes access to a fair coin and allow them to flip it occasionally, voila!, an efficient, reliable solution emerges that yields the correct behavior. Probably.

The first time I encountered this was with a canonical problem in distributed fault tolerance, called the Byzantine Agreement problem. Don't Google it! There's been so much written and so many results about the topic over the last thirty years that you'll never return if you vector off now. Instead, let me present a concrete instance of the general problem in order to talk about why randomness is such a useful, and in some cases essential, part of the solution.

Suppose there are four computerized weather stations distributed about a certain city. Each station has, among many other sensors, a light detector, and the stations can communicate with each other via whatever protocol we wish to devise. Every hour, on the hour, each station samples its own light sensor and makes an independent judgement about whether it is "daytime" or "nighttime." At noon and at midnight, it is highly likely that all stations will make the same assessment about whether it's day or night. But at dusk and dawn, local variations in light and sensor sensitivity will probably cause the stations to arrive at different conclusions about which is the best classification. Our job is to devise an algorithm and a protocol that allows the stations to all agree on a single answer, and that has two very reasonable properties: 1) if all the stations individually come to the same conclusion about whether it's day or night, then the group must agree on that answer, and 2) even if they don't start out with the same answer initially, they must still agree on some value (it doesn't matter which).

But wait, there's no villain in this story yet. One of the stations doesn't follow the rules. It can say anything at any time to any other station without constraint. Maybe it's faulty, or maybe it's nefarious. Crazy or evil -- it doesn't matter. We may as well assume the bad guy is a malicious spy who knows everything that has happened, using that information to thwart our little plot and prevent the remaining stations from agreeing whether it's day or night. Now there's a worthy antagonist.

There is a simple, but subtly wrong, deterministic algorithm that attempts to allow the three non-faulty stations to reach agreement at all times of the day or night, even with the omniscient meddling of the faulty one. It goes like this: Each station sends its initial value to every other station. After collecting all four values (including its own), each station determines whether three of the four responses have the same value. If so, the station takes this value as its final answer. If it's a tie, the station picks a predetermined default, say "daytime," and the nodes exchange values in another round of communication. The algorithm terminates when all the correct nodes have chosen the (same) final answer.

You can see that, if it's noon or midnight and all three non-faulty stations report the same value, there is little the nefarious foe can do to confuse this algorithm; they will all see at least three identical answers and immediately choose it. There is danger at dusk or dawn, however. We're still okay if two of the non-faulty stations thinks it's day and the other one thinks it's night. In this case, there's no way the nefarious foe can influence the vote so that "nighttime" is chosen by any of the non-faulty stations. They will all pick the default value, "daytime," after the first round and confirm it as the agreed value on the second round.

The remaining case is vulnerable to attack. When two of the non-faulty stations initially think it's night and the other thinks it's day, the nefarious foe, in full omniscient glory, can whisper "nighttime" to the nighttime stations and "daytime" to the daytime station. The nighttime stations will therefore continue to vote nighttime, because they see three votes for that value, while the minority station will continue to choose the default value, "daytime." Thus, the faulty foe can ensure the stations never reach agreement on any value.

Surprisingly, this simple but incorrect deterministic algorithm can easily be converted to a simple and correct probabilistic one, just by giving each station a fair coin to flip. Instead of picking a predetermined default value (which is subject to being discovered and exploited by the evil one), each station flips its coin in the event there are not three votes for the same value. Heads, it's daytime, tails it's night. If the non-faulty stations do not unanimously agree, The Faulty One can still influence one of the two cases, as described above, but since it cannot predict the value of the coin flip (that's what random means) there's no way to tell which is the vulnerable case. There's only a 50% chance our malicious adversary will be able to thwart agreement on any given round, so if we just keep repeating the algorithm round after round, the stations will eventually reach agreement, despite the best efforts of the know-it-all evil station.

Probably. We have a choice with this probabilistic algorithm. We can either stop it after a fixed number of rounds, and accept the (tunably small) possibility that agreement has not yet been reached, or we can run it until agreement is reached, and accept the (tunably small) possibility that it takes so long we no longer care whether it's night or day. This exact trade-off is typical of distributed, probabilistic algorithms.

To be fair, there is a simple deterministic algorithm that solves this problem (see the seminal 1980 paper by Pease, Shostak, and Lamport), but it depends on a subtle condition that doesn't always exist in real-life distributed systems -- synchronous communication. Another famous result in this research area is that of Fischer, Lynch, and Paterson, which proves no deterministic agreement algorithm can exist unless we can give, with certainty, an upper bound on how long it will take messages to be delivered between processes (weather stations in this case). And yet, there are many good probabilistic solutions to this important practical problem!

Back in 1987, a fellow graduate student at the University of Texas at Austin, Russell Turpin, and I were struck by this result. So much so that we spent several months reading and analyzing the plethora of research on probabilistic agreement and writing a survey paper on the topic. Now, twenty-five years later, Dr. Turpin and I are still friends and colleagues, and we expect our interest and passion for distributed solutions to continue for many more years to come.

Probably.