Thursday, March 17, 2011

Visualizing Replication

Here's the second in my video tutorial series.  This one explains how CAStor replicates objects in order to protect against hardware failures and ensure data integrity.

Wednesday, March 16, 2011

Satnam and Other Dangers

The idea of a True Name for something or someone is very old. A true name captures the essence, the intrinsic identity of something in a short moniker that is distinguished from every other name. For centuries, mankind has recognized the fascinating duality that one's true name is both a treasure and a vulnerability.

Religion has long been obsessed with the true name of God. The ancient Jews revered it and believed it had such power they forbade its utterance by ordinary people and spoke it officially once every seven years. Sikhs have a word for the concept, satnam. The true name of true names I guess.

The concept of true name is ancient and pervasive in the secular world too. Rumplestiltskin cherished his so much he sang songs about it, then tore himself in half when it was discovered. One of my favorite science fiction stories is actually titled True Names. Written by the mathematician Vernor Vinge in 1981, the novella explored the promise and threat of one's true name in cyberspace, and incidentally invented the cyberpunk genre which was later fleshed out by William Gibson, Neal Stephenson, Bruce Sterling and others. The title of this post is borrowed from Vinge's first anthology.

Back here in the cyberpresent, names for persistent objects in computer technology have an unfortunately muddled story. Traditional file systems conflate the notions of name and containment for files. We say that a file is "in a directory" or "contained in a folder" when really we're just speaking about a prefix of its name. This leads to some weird, but by now familiar, holes in the metaphor. When you "move" a file from one folder to another, usually the file's contents aren't moved at all; the file is simply renamed. In Unix-based file systems, the "hard link" mechanism allows a single file to have more than one name. This metaphorical blemish of having to explain that an object can be in two places at once is often gotten around by pretending the hard links are distinctly different kinds of names from the originals, so a file has one definite location and perhaps some other, lesser references to it from other locations. This notion is patently false; all names are created equal in this case and if you delete the "original" hard link the others will continue to work as usual. The fact is, files can have different names in different contexts. But do they have a satnam that is the same in every context?

Sort of. In traditional Unix/Linux file systems, a file's inode number is its true name. All other names are mapped to the inode by the file system's directory mechanism. The inode data structure itself contains information about where to find the various parts and pieces (blocks) that make up the file's data content. An inode number is a 32-bit or, in more recent releases, 64-bit binary persistent name for a file. Ideally, a satnam does not depend on context; it refers to the same thing everywhere and everywhen. But a 32-bit inode is not long enough to allow that since it accommodates only about 4 billion files. This may seem like a big number, but it isn't even close to being big enough to uniquely identify every computer file in existence. So the Unixen came up with the concept of a file system, which effectively scopes inodes so they make sense only within the limited context of a single file system, usually hosted on one server. In other words, inodes do not have global scope. Nor is 4 billion inodes enough even to uniquely identify all the files in a single file system, if we remember that files may be created and then deleted. Again, ideally each instance of a file would have a satnam that's different from every other instance that ever existed. But inodes do not have such universal scope either, since they must be reused. The newer 64-bit inodes do allow up to about 18 X 1018 distinct inodes, which helps us build larger file systems that span several servers, but the namespace is still not large enough to support universally scoped true names.

Distributed storage systems that aspire to scale without limit, must support an essentially infinite (that is, a very, very large finite) number of distinct objects, over the entire lifetime of, well, everything. Further, the degree of cooperation and agreement required to implement a coordinated distributed namespace is, to say the least, exorbitant, even impossible in most cases. Conventional, hierarchical naming schemes always require a centralized naming authority somewhere along the way (that's why the IANA exists). These types of distributed storage systems really do need a satnam, a universally-scoped name for each stored object.

My colleagues here at Caringo, Paul Carpentier, Jan Van Riel, and Tom Teugels, invented a process (US Patent 7,793,112) by which an intrinsic satnam can be derived from any binary object. This technique, the basis for Content Addressable Storage, or CAS, uses a cryptographic hash to digest the bits and spit out a 128-bit name. Depending on the strength of the hash algorithm, this can produce a universally unique identifier, a true name. The drawback is, smart hackers keep finding ways to defeat the crypto algorithms to produce completely different objects that have the same names.

At Caringo, we built a clustered, scalable storage system called Swarm using a different technique that doesn't suffer this drawback. We assign a satnam to an object when it is initially created, a 128-bit name that, with very high probability is different from any other satnam being used anywhere in the universe at any time, past, present, or future. This universally unique identifier is stored with the object's data bits and becomes part of the object as it moves around the world, gets replicated, and so on.

The differences between these two approaches suggest a rather deep philosophical question. Is it the case that one's fundamental identity is embodied in the sum of all his traits (the data bits) or is it the fact of his creation that identifies him? If I create a data object today here in Austin, and my great-great-great-granddaughter also creates a data object on Mars using precisely the same bits in the same order, are they the same object? Should they have the same satnam? Philosophically, I have no idea, but Caringo Swarm says, no. Objects created at different times or places are different objects, even if they appear otherwise to be the same. An object's satnam is universally scoped and intractable to guess, unless you have the object itself.

Which brings us back to Rumplestiltskin. The only way to access an object stored in CAStor is to produce its satnam. If you wish to keep it confidential, then you simply do not ever divulge its true name to anyone you don't trust. Don't go singing it around the campfire.

Thursday, March 3, 2011


A couple months ago I wrote a useful tool to help us visualize the operation of a CAStor cluster in situ. It provides a bird's eye view of the nodes and communication among them by replaying a syslog file captured during an actual execution. We plan to use this tool for a number of different things, from marketing to tech support, even design validation. As a first application, I have begun using it as the basis for a series of tutorial videos explaining the inner workings of CAStor. Here's the first installment.

Check it out!

Chocolate Empathy

One reason it's so hard to design good distributed software is that programmers must, for lack of a less anthropomorphic term, empathize with multiple processes at once. A CAStor storage node, for example, is not omniscient. It knows only about its own state and enough about the global cluster to operate. Answering even very simple questions like, "How much available storage is there in the cluster?" involves communicating and agreeing with other nodes. From the programmers' perspective, we must imagine ourselves in the place of each of the individual nodes and model, in our heads, what is known and, more importantly sometimes, what cannot possibly be known by each node at each point in the algorithm.

This capability to maintain many different models of what others know, in addition to our own knowledge store, is an amazing feat of the human mind. And we do it all the time. Every conversation you have with another person relies on a gigantic store of knowledge about what she knows, whether she knows that I know she knows it, and so on. But this ability is definitely a higher function of the brain and does not come without some mental effort.

A famous psychological experiment performed by Wimmer and Perner in the '80s shows this mental ability does not mature until fairly late, four years old or so, in a child's life. The experiment that demonstrates this is elegant and simple. Children are told the following story. A little boy named Maxi and his mother return from a shopping trip at which they purchased a bar of chocolate. The mother places the chocolate in a blue cabinet and closes it. When Maxi goes outside to play, the mother moves the chocolate bar to another, green cabinet. When Maxi returns to the kitchen, he decides to have a bite of chocolate. Which cabinet, the children are asked, will Maxi look in to find the bar?

Before the age of three or four years, almost all children will say Maxi will look in the green cabinet, because that's where the mother put it! Proving they have not yet fully developed the ability to model someone else's mind or empathize with another, albeit fictional, person's knowledge base. It may also be the case that this mental ability is lost or stunted as we become old (speaking strictly from personal experience).

So this stuff is hard. It's the difference between teaching someone to dance and choreographing a ballet. CS curricula, by and large, train us to be omniscient, sequential programmers. Perhaps that's why so many otherwise talented developers mistakenly believe every storage node should just know to look in the green cabinet.