Wednesday, March 16, 2011

Satnam and Other Dangers

The idea of a True Name for something or someone is very old. A true name captures the essence, the intrinsic identity of something in a short moniker that is distinguished from every other name. For centuries, mankind has recognized the fascinating duality that one's true name is both a treasure and a vulnerability.

Religion has long been obsessed with the true name of God. The ancient Jews revered it and believed it had such power they forbade its utterance by ordinary people and spoke it officially once every seven years. Sikhs have a word for the concept, satnam. The true name of true names I guess.

The concept of true name is ancient and pervasive in the secular world too. Rumplestiltskin cherished his so much he sang songs about it, then tore himself in half when it was discovered. One of my favorite science fiction stories is actually titled True Names. Written by the mathematician Vernor Vinge in 1981, the novella explored the promise and threat of one's true name in cyberspace, and incidentally invented the cyberpunk genre which was later fleshed out by William Gibson, Neal Stephenson, Bruce Sterling and others. The title of this post is borrowed from Vinge's first anthology.

Back here in the cyberpresent, names for persistent objects in computer technology have an unfortunately muddled story. Traditional file systems conflate the notions of name and containment for files. We say that a file is "in a directory" or "contained in a folder" when really we're just speaking about a prefix of its name. This leads to some weird, but by now familiar, holes in the metaphor. When you "move" a file from one folder to another, usually the file's contents aren't moved at all; the file is simply renamed. In Unix-based file systems, the "hard link" mechanism allows a single file to have more than one name. This metaphorical blemish of having to explain that an object can be in two places at once is often gotten around by pretending the hard links are distinctly different kinds of names from the originals, so a file has one definite location and perhaps some other, lesser references to it from other locations. This notion is patently false; all names are created equal in this case and if you delete the "original" hard link the others will continue to work as usual. The fact is, files can have different names in different contexts. But do they have a satnam that is the same in every context?

Sort of. In traditional Unix/Linux file systems, a file's inode number is its true name. All other names are mapped to the inode by the file system's directory mechanism. The inode data structure itself contains information about where to find the various parts and pieces (blocks) that make up the file's data content. An inode number is a 32-bit or, in more recent releases, 64-bit binary persistent name for a file. Ideally, a satnam does not depend on context; it refers to the same thing everywhere and everywhen. But a 32-bit inode is not long enough to allow that since it accommodates only about 4 billion files. This may seem like a big number, but it isn't even close to being big enough to uniquely identify every computer file in existence. So the Unixen came up with the concept of a file system, which effectively scopes inodes so they make sense only within the limited context of a single file system, usually hosted on one server. In other words, inodes do not have global scope. Nor is 4 billion inodes enough even to uniquely identify all the files in a single file system, if we remember that files may be created and then deleted. Again, ideally each instance of a file would have a satnam that's different from every other instance that ever existed. But inodes do not have such universal scope either, since they must be reused. The newer 64-bit inodes do allow up to about 18 X 1018 distinct inodes, which helps us build larger file systems that span several servers, but the namespace is still not large enough to support universally scoped true names.

Distributed storage systems that aspire to scale without limit, must support an essentially infinite (that is, a very, very large finite) number of distinct objects, over the entire lifetime of, well, everything. Further, the degree of cooperation and agreement required to implement a coordinated distributed namespace is, to say the least, exorbitant, even impossible in most cases. Conventional, hierarchical naming schemes always require a centralized naming authority somewhere along the way (that's why the IANA exists). These types of distributed storage systems really do need a satnam, a universally-scoped name for each stored object.

My colleagues here at Caringo, Paul Carpentier, Jan Van Riel, and Tom Teugels, invented a process (US Patent 7,793,112) by which an intrinsic satnam can be derived from any binary object. This technique, the basis for Content Addressable Storage, or CAS, uses a cryptographic hash to digest the bits and spit out a 128-bit name. Depending on the strength of the hash algorithm, this can produce a universally unique identifier, a true name. The drawback is, smart hackers keep finding ways to defeat the crypto algorithms to produce completely different objects that have the same names.

At Caringo, we built a clustered, scalable storage system called Swarm using a different technique that doesn't suffer this drawback. We assign a satnam to an object when it is initially created, a 128-bit name that, with very high probability is different from any other satnam being used anywhere in the universe at any time, past, present, or future. This universally unique identifier is stored with the object's data bits and becomes part of the object as it moves around the world, gets replicated, and so on.

The differences between these two approaches suggest a rather deep philosophical question. Is it the case that one's fundamental identity is embodied in the sum of all his traits (the data bits) or is it the fact of his creation that identifies him? If I create a data object today here in Austin, and my great-great-great-granddaughter also creates a data object on Mars using precisely the same bits in the same order, are they the same object? Should they have the same satnam? Philosophically, I have no idea, but Caringo Swarm says, no. Objects created at different times or places are different objects, even if they appear otherwise to be the same. An object's satnam is universally scoped and intractable to guess, unless you have the object itself.

Which brings us back to Rumplestiltskin. The only way to access an object stored in CAStor is to produce its satnam. If you wish to keep it confidential, then you simply do not ever divulge its true name to anyone you don't trust. Don't go singing it around the campfire.

No comments:

Post a Comment