The first time I encountered this was with a canonical problem in distributed fault tolerance, called the Byzantine Agreement problem. Don't Google it! There's been so much written and so many results about the topic over the last thirty years that you'll never return if you vector off now. Instead, let me present a concrete instance of the general problem in order to talk about why randomness is such a useful, and in some cases essential, part of the solution.

Suppose there are four computerized weather stations distributed about a certain city. Each station has, among many other sensors, a light detector, and the stations can communicate with each other via whatever protocol we wish to devise. Every hour, on the hour, each station samples its own light sensor and makes an independent judgement about whether it is "daytime" or "nighttime." At noon and at midnight, it is highly likely that all stations will make the same assessment about whether it's day or night. But at dusk and dawn, local variations in light and sensor sensitivity will probably cause the stations to arrive at different conclusions about which is the best classification. Our job is to devise an algorithm and a protocol that allows the stations to all agree on a single answer, and that has two very reasonable properties: 1) if all the stations individually come to the same conclusion about whether it's day or night, then the group must agree on that answer, and 2) even if they don't start out with the same answer initially, they must still agree on

*some*value (it doesn't matter which).

But wait, there's no villain in this story yet. One of the stations doesn't follow the rules. It can say anything at any time to any other station without constraint. Maybe it's faulty, or maybe it's nefarious. Crazy or evil -- it doesn't matter. We may as well assume the bad guy is a malicious spy who knows everything that has happened, using that information to thwart our little plot and prevent the remaining stations from agreeing whether it's day or night. Now there's a worthy antagonist.

There is a simple, but subtly wrong, deterministic algorithm that attempts to allow the three non-faulty stations to reach agreement at all times of the day or night, even with the omniscient meddling of the faulty one. It goes like this: Each station sends its initial value to every other station. After collecting all four values (including its own), each station determines whether three of the four responses have the same value. If so, the station takes this value as its final answer. If it's a tie, the station picks a predetermined default, say "daytime," and the nodes exchange values in another round of communication. The algorithm terminates when all the correct nodes have chosen the (same) final answer.

You can see that, if it's noon or midnight and all three non-faulty stations report the same value, there is little the nefarious foe can do to confuse this algorithm; they will all see at least three identical answers and immediately choose it. There is danger at dusk or dawn, however. We're still okay if two of the non-faulty stations thinks it's day and the other one thinks it's night. In this case, there's no way the nefarious foe can influence the vote so that "nighttime" is chosen by any of the non-faulty stations. They will all pick the default value, "daytime," after the first round and confirm it as the agreed value on the second round.

The remaining case is vulnerable to attack. When two of the non-faulty stations initially think it's night and the other thinks it's day, the nefarious foe, in full omniscient glory, can whisper "nighttime" to the nighttime stations and "daytime" to the daytime station. The nighttime stations will therefore continue to vote nighttime, because they see three votes for that value, while the minority station will continue to choose the default value, "daytime." Thus, the faulty foe can ensure the stations never reach agreement on any value.

Surprisingly, this simple but incorrect deterministic algorithm can easily be converted to a simple and correct probabilistic one, just by giving each station a fair coin to flip. Instead of picking a predetermined default value (which is subject to being discovered and exploited by the evil one), each station flips its coin in the event there are not three votes for the same value. Heads, it's daytime, tails it's night. If the non-faulty stations do not unanimously agree, The Faulty One can still influence one of the two cases, as described above, but since it cannot predict the value of the coin flip (that's what random means) there's no way to tell which is the vulnerable case. There's only a 50% chance our malicious adversary will be able to thwart agreement on any given round, so if we just keep repeating the algorithm round after round, the stations will eventually reach agreement, despite the best efforts of the know-it-all evil station.

Probably. We have a choice with this probabilistic algorithm. We can either stop it after a fixed number of rounds, and accept the (tunably small) possibility that agreement has not yet been reached, or we can run it until agreement is reached, and accept the (tunably small) possibility that it takes so long we no longer care whether it's night or day. This exact trade-off is typical of distributed, probabilistic algorithms.

To be fair, there is a simple deterministic algorithm that solves this problem (see the seminal 1980 paper by Pease, Shostak, and Lamport), but it depends on a subtle condition that doesn't always exist in real-life distributed systems -- synchronous communication. Another famous result in this research area is that of Fischer, Lynch, and Paterson, which proves no deterministic agreement algorithm can exist unless we can give, with certainty, an upper bound on how long it will take messages to be delivered between processes (weather stations in this case). And yet, there are many good probabilistic solutions to this important practical problem!

Back in 1987, a fellow graduate student at the University of Texas at Austin, Russell Turpin, and I were struck by this result. So much so that we spent several months reading and analyzing the plethora of research on probabilistic agreement and writing a survey paper on the topic. Now, twenty-five years later, Dr. Turpin and I are still friends and colleagues, and we expect our interest and passion for distributed solutions to continue for many more years to come.

Probably.

With the coin toss, The Faulty One can not thwart agreement because he can not predict how the others will vote. What it comes down to is prediction. If the other voters were to come up with their own algorithm and not tell the otherwise-omniscient Faulty One, that would also prevent prediction and thwart the malicious plan. The secret algorithm could be deterministic or not. Therefore it is not probabilistic algorithms per se that save the day, but insufficient predictive ability.

ReplyDeleteRandom means unpredictable, but unpredictable to whom?

Carlos, you are exactly right. It's about the ability of the faulty process to predict the future states of the non-faulty ones. In fact, there are a number of practical solutions to the asynchronous BA problem that involve encryption. However, if the communication model is asynchronous (no bound on the delivery times of messages) all these solutions MUST be probabilistic. Lynch, et el, prove it in the paper I referenced.

ReplyDeleteYour last question is something I plan to revisit in a future post. Briefly, if random means unpredictable, then truly random means unpredictable to everyone and everything. The question of whether such a phenomenon exists in the universe gets deep very quickly, touching on quantum physics, the foundations of logic, and theology, all of which I have only a hobbyist familiarity with.