Barry Talks!
The Official Weblog of Barry Briggs


World News

United States
Europe
Middle East
Asia
Africa
Latin and South America
Opinion
Business
Government


Membership

Join Now

Login

Terabytes, Petabytes and Metadata

Speculations on the Future of Computing

By Barry Briggs

September 2003 © Barry Briggs All Rights Reserved

Disk capacities double every year, and thus within a decade or so we will be seeing commodity 200 terabyte disks in home PC's. That is an enormous amount of data: by comparison, the entire contents of the United States Library of Congress requires something around a tenth of that, around 20 terabytes, give or take a few hundred gigs. (It's somewhat humbling to think that a person's lifework, say, the collected works of Shakespeare, can easily fit on a floppy disk.)

Put a different way, in 200 terabytes one can store the entire accumulated knowledge of the human race with much more than half the disk to spare.

All this capacity raises two fundamental questions: if we can put all this knowledge on a single commodity disk, how will we ever find anything? And if all that only requires a fraction of the available space, what will we use the rest for?

The answers, it turns out, are related; and, I think, as we outline them we will also begin to see the shape of the next great revolution in computing.

Let's start with the first question, how in a great ocean of information we will ever be able to find anything. To date, there have been two fundamentally different approaches to retrieving data, and they are both wholly inadequate to the coming challenge. SQL, a thirty-year-old technology, requires that data be first structured into tables, rows and columns. Having done so -- a nontrivial task -- queries of the form "find me all employees in California making over $100,000" become possible.

Here's how that table might look:

ID Name Address Town State Salary
100001 Joe Schmoe 1 Elm Street Anytown CA 100000
100002 Mary Schmary 1 Maple Street Everytown MA 85000
100003 Jane Schmane 1 Oak Street Sometown AR 110000
100004 Bill Schmill 1 Palm Street Manytown HI 50000
100005 Frank Schmank 1 Cherry St Notown FL 75000

Note that there are few columns and many rows; and in most, if not all production databases this is true, that is, that the number of data rows vastly outnumber the categories (columns, or "metadata") -- often by three, four, or even five orders of magnitude.

In the next ten years we will invert this relationship.

The other approach to finding data is search, best typified by the great Web engines: Google, Altavista, Overture. Although there were and are ongoing attempts to categorize and/or add "semantics" to Web pages now in existence, Web search remains largely a simple, brute-force operation. First all pages are scanned and their terms counted and ranked, according to various algorithms (such as how many other pages link to this one). Then when users type in a word or phrase, the engine returns all the pages containing it.

Thus when I type the word "data" into Altavista, for example, it returns some 58 million pages -- an utterly useless result.

We are now on the verge of a metadata revolution. Metadata will allow us to find the data we are looking for, it will enable computers to predict what we need, it will make computers vastly more valuable to us, and over time metadata, coupled with new pattern-recognition algorithms, will enable computers take another step toward intelligence and even independent thought.

Most database architects would cringe at the notion of radically more metadata. But we are in pursuit of an essential truth here, first articulated by no less than an authority than Immanuel Kant, namely, that for any object there are an infinite number of ways to describe it, to classify it, and to thus compare it against other objects.

Consider the Mona Lisa. Database-style, we might design ways to classify paintings that results in a table like this (we've only shown a few sample rows):

 

Artist Painting Date Country Medium
daVinci Mona Lisa 1504 Italy Oil
Warhol Marilyn 1967 USA Silk screen
Titian Worship of Venus 1518 Italy Oil
Rubens Union of Earth and Water 1618 Netherlands Oil

With this table, we can issue queries like: "show me all Italian paintings before 1600," and the like, but this table hardly captures all the volumes of descriptive content that might be applied to the Mona Lisa: descriptive content that might be used by humans and computers to find it and compare it against other works of art. It's quite hard with the limited amount of metadata to issue deeply useful queries like "show me more like it," or "how is that like Titian?", or "who's Tony Bennett, anyway?".

Only with a vast amount of descriptive content can we begin to answer those questions. If we were to merely scratch the surface regarding the Mona Lisa, we would note it is:

  • By daVinci
  • Around 1504
  • Other names: La Gioconda, La Joconde
  • Oil on wood
  • Of a Florentine woman
  • "The perfect beauty of a woman"
  • Kind of dark
  • In the Louvre
  • Stolen once
  • Picture of a road with a woman in front of it
  • Not photorealistic
  • Artist is dead
  • Not for sale

 

MONALISA:
Mona Lisa, by Leonardo da Vinci, ca. 1504

And so on, quite literally ad infinitum. It's easy to see that the number of ways to describe an object requires infinitely more space than the object itself. As Kant might put it, we humans are unable to conceive of a "thing-in-itself" (an object, in our terms) outside of the dimensions of space and time in which we are trapped; but space and time being infinite, at least according to our perceptions, implies that there are an infinite number of ways of perceiving the thing-in-itself.

For an example perhaps more quantifiable in terms of information theory, consider Homer's Iliad. Composed of some 24 books each containing perhaps a thousand lines or so of poetry, the Iliad in and of itself is a fairly compact thing. However, for five thousand years since its original composition people have been writing about Homer and his poetry, from nearly every conceivable standpoint: histories, grammars, analyses of diction, archaeology, theatre and movie adaptations, and so on.

This is what we mean when we say that the relationship between rows and columns is inverted. For everything -- person, artwork, web page, whatever: there ought to be limitless ways to describe, categorize, critique. That being the case, then there ought to be ways to use this metadata in truly useful ways, to find and compare the data you really need.

With enormous disks now yawning before us, we now have the technical ability to dramatically increase the metadata/data ratio. Today, that ratio probably averages around .001 (e.g., a million-record customer database may have anywhere from a hundred to -- at most -- a thousand columns). As we approach a whole integer value for the ratio imagine how useful our repositories of information will become; as we invert the current number (making it 1000) whole new capabilities will arise, as we shall see.

Does this mean then that we'll be employing armies of people keying in metadata? Certainly not. We have basic classification and taxonomy schemes -- like the Dewey Decimal System -- already in place. And through the greatest innovation in computing since the database -- the hyperlink -- we already enormous amounts of associative connections between data. Lastly, human beings are inveterate content creators, collaborators, and critics, so new descriptive content appears constantly (I wonder just how many new books are being written about Homer, around the world, this very moment.)

Still, what is necessary is a computer-specifiable way to describe all the different sorts of metadata there are such that the tools which follow can manipulate it, search it, and exchange it. This is not just a "better database schema" but rather a way to wrap everything that is known about an object, and to enable it to constantly expand and grow, for if there is one constant about the universe, it's that we gain more knowledge over time.

At this point the solution to one of the grand challenges of computing will be at hand. I can say, "Show me some art," and the computer's logic will first intuit that that really means "show me some art that I will like," and from there will apply all sorts of heuristics which act upon the accumulated metadata, such as:

  • Art that I've looked at before
  • Art that my friends like
  • Art that other people like me like
  • Art that will fit in my house
  • Art that I can afford (which will really constrain the set!)

Now what's really exciting is the possibility that we can teach computers to formulate metadata -- that is, perform their own categorization -- on their own. What is it about the art that I like? Maybe I only like art with a certain low reflectivity, because I don't like glare, or maybe I like only paintings that will fit on my wall.

We'll have more to say about this is a moment, but it's now easy to see then as both humans and computers start generating metadata at an explosive rate even 200TB drives won't be able to hold it all, and so the intelligent exchange of metadata -- not unlike the weblog protocol RSS -- will permit its sharing. This is tremendously significant, since it means that our computers will be the repositories of how we think about the world. Moreover, since presumably metadata will fly back and forth depending upon its currency, its state at any given time will reflect the worldview of the human race.

One of the foundations of human intelligence lies in our ability to detect patterns, that is, put in our language, the ability to distinguish metadata (classifiers, categories) and to rank an object's relative conformance to those criteria ("hey, all those balls are blue, mostly").

When computers can detect these same patterns, or different ones (better still), their value to us will be immensely magnified. They will be intelligent; we will see them as our indispensable assistants, always ready with information we need, when we need it.

We're not there yet. Indeed, we have yet to really begin this journey. First we have to create the mechanisms for storing all this metadata, and then build the intelligent mechanisms to search it, examine it, and retrieve the appropriate data. Only then can we layer on the inductive pattern determination and recognition which will be so crucial to computers' utility to us.

We haven't begun the journey, but parts of the road are becoming clearer.

 

References:

 

  1. Jim Gray and Dave Patterson, "A Conversation with Jim Gray," ACM Queue, June 2003. Fascinating discussion about the future of computer storage which inspired this essay. Online at http://www.acmqueue.org/modules.php?name=Content&pa=showpage&pid=43

  2. Edgar Codd, "A Relational Model of Data for Large Shared Data Banks", Communications of the ACM, Vol. 13, No. 6, June 1970. Reprinted online at http://www.acm.org/classics/nov95/toc.html

  3. Dave Winer, "RSS 2.0 Specification," Harvard Law School. Online at http://blogs.law.harvard.edu/tech/rss.

Discuss

This page was last updated: Friday, April 16, 2004 at 1:58:51 PM
Copyright 2009 Barry Briggs < ? bostonites # >
This is a Manila Site

This site is using the Default theme.