Terabytes, Petabytes and Metadata
Speculations on the Future of Computing
By
Barry Briggs
September 2003
© Barry Briggs All Rights Reserved
Disk capacities double every year, and thus within a decade
or so we will be seeing commodity 200 terabyte disks in home PC's. That
is an enormous amount of data: by comparison, the entire contents of the United
States Library of Congress requires something around a tenth of that, around 20
terabytes, give or take a few hundred gigs. (It's somewhat humbling to think that
a person's lifework, say, the collected works of Shakespeare, can easily fit on
a floppy disk.)
Put a different way, in 200 terabytes one can store the
entire accumulated knowledge of the human race with much more than half the
disk to spare.
All this capacity raises two fundamental questions: if we
can put all this knowledge on a single commodity disk, how will we ever find
anything? And if all that only requires a fraction of the available space, what
will we use the rest for?
The answers, it turns out, are related; and, I think, as we
outline them we will also begin to see the shape of the next great revolution
in computing.
Let's start with the first question, how in a great ocean of
information we will ever be able to find anything. To date, there have been two
fundamentally different approaches to retrieving data, and they are both wholly
inadequate to the coming challenge. SQL, a thirty-year-old technology, requires
that data be first structured into tables, rows and columns. Having done so -- a
nontrivial task -- queries of the form "find me all employees in California making over $100,000" become possible.
Here's how that table might look:
|
ID
|
Name
|
Address
|
Town
|
State
|
Salary
|
|
100001
|
Joe Schmoe
|
1 Elm Street
|
Anytown
|
CA
|
100000
|
|
100002
|
Mary Schmary
|
1 Maple Street
|
Everytown
|
MA
|
85000
|
|
100003
|
Jane Schmane
|
1 Oak Street
|
Sometown
|
AR
|
110000
|
|
100004
|
Bill Schmill
|
1 Palm Street
|
Manytown
|
HI
|
50000
|
|
100005
|
Frank Schmank
|
1 Cherry St
|
Notown
|
FL
|
75000
|
Note that there are few columns and many rows; and in most,
if not all production databases this is true, that is, that the number of data
rows vastly outnumber the categories (columns, or "metadata") -- often by
three, four, or even five orders of magnitude.
In the next ten years we will invert this relationship.
The other approach to finding data is search, best
typified by the great Web engines: Google, Altavista, Overture. Although there
were and are ongoing attempts to categorize and/or add "semantics" to Web pages
now in existence, Web search remains largely a simple, brute-force operation.
First all pages are scanned and their terms counted and ranked, according to various algorithms (such as how many other pages link to this one). Then when users
type in a word or phrase, the engine returns all the pages containing it.
Thus when I type the word "data" into Altavista, for
example, it returns some 58 million pages -- an utterly useless result.
We are now on the verge of a metadata revolution. Metadata
will allow us to find the data we are looking for, it will enable computers to
predict what we need, it will make computers vastly more valuable to us, and
over time metadata, coupled with new pattern-recognition algorithms, will
enable computers take another step toward intelligence and even independent
thought.
Most database architects would cringe at the notion of
radically more metadata. But we are in pursuit of an essential truth here,
first articulated by no less than an authority than Immanuel Kant, namely, that
for any object there are an infinite number of ways to describe it, to
classify it, and to thus compare it against other objects.
Consider the Mona Lisa. Database-style, we might design ways
to classify paintings that results in a table like this (we've only shown a few
sample rows):
|
Artist
|
Painting
|
Date
|
Country
|
Medium
|
|
daVinci
|
Mona Lisa
|
1504
|
Italy
|
Oil
|
|
Warhol
|
Marilyn
|
1967
|
USA
|
Silk screen
|
|
Titian
|
Worship of Venus
|
1518
|
Italy
|
Oil
|
|
Rubens
|
Union of Earth and Water
|
1618
|
Netherlands
|
Oil
|
With this table, we can issue queries like: "show me all
Italian paintings before 1600," and the like, but this table hardly captures
all the volumes of descriptive content that might be applied to the Mona Lisa:
descriptive content that might be used by humans and computers to find it and
compare it against other works of art. It's quite hard with the limited amount
of metadata to issue deeply useful queries like "show me more like it," or "how
is that like Titian?", or "who's Tony Bennett, anyway?".
Only with a vast amount of descriptive content can we
begin to answer those questions. If we were to merely scratch the surface
regarding the Mona Lisa, we would note it is:
- By daVinci
- Around 1504
- Other names: La Gioconda, La Joconde
- Oil on wood
- Of a Florentine woman
- "The perfect beauty of a woman"
- Kind of dark
- In the Louvre
- Stolen once
- Picture of a road with a woman in front of it
- Not photorealistic
- Artist is dead
- Not for sale
Mona Lisa, by Leonardo da Vinci, ca. 1504
And so on, quite literally ad infinitum. It's easy to
see that the number of ways to describe an object requires infinitely
more space than the object itself. As Kant might put it, we humans are unable
to conceive of a "thing-in-itself" (an object, in our terms) outside of the
dimensions of space and time in which we are trapped; but space and time
being infinite, at least according to our perceptions, implies that there are
an infinite number of ways of perceiving the thing-in-itself.
For an example perhaps more quantifiable in terms of
information theory, consider Homer's Iliad. Composed of some 24 books each
containing perhaps a thousand lines or so of poetry, the Iliad in and of
itself is a fairly compact thing. However, for five thousand years since
its original composition people have been writing about Homer and his poetry,
from nearly every conceivable standpoint: histories, grammars, analyses of
diction, archaeology, theatre and movie adaptations, and so on.
This is what we mean when we say that the relationship
between rows and columns is inverted. For everything -- person,
artwork, web page, whatever: there ought to be limitless ways to describe,
categorize, critique. That being the case, then there ought to be ways to use
this metadata in truly useful ways, to find and compare the data you really
need.
With enormous disks now yawning before us, we now have the
technical ability to dramatically increase the metadata/data ratio. Today, that
ratio probably averages around .001 (e.g., a million-record customer database
may have anywhere from a hundred to -- at most -- a thousand columns). As we
approach a whole integer value for the ratio imagine how useful our
repositories of information will become; as we invert the current number
(making it 1000) whole new capabilities will arise, as we shall see.
Does this mean then that we'll be employing armies of people
keying in metadata? Certainly not. We have basic classification and taxonomy
schemes -- like the Dewey Decimal System -- already in place. And through the
greatest innovation in computing since the database -- the hyperlink -- we
already enormous amounts of associative connections between data. Lastly, human
beings are inveterate content creators, collaborators, and critics, so new
descriptive content appears constantly (I wonder just how many new books are
being written about Homer, around the world, this very moment.)
Still, what is necessary is a computer-specifiable way to
describe all the different sorts of metadata there are such that the tools
which follow can manipulate it, search it, and exchange it. This is not just a
"better database schema" but rather a way to wrap everything that is known
about an object, and to enable it to constantly expand and grow, for if there
is one constant about the universe, it's that we gain more knowledge over time.
At this point the solution to one of the grand challenges of
computing will be at hand. I can say, "Show me some art," and the computer's
logic will first intuit that that really means "show me some art that I will
like," and from there will apply all sorts of heuristics which act upon the
accumulated metadata, such as:
- Art that I've looked at before
- Art that my friends like
- Art that other people like me like
- Art that will fit in my house
- Art that I can afford (which will really constrain the set!)
Now what's really exciting is the possibility that we can
teach computers to formulate metadata -- that is, perform their own
categorization -- on their own. What is it about the art that I like? Maybe
I only like art with a certain low reflectivity, because I don't like glare, or
maybe I like only paintings that will fit on my wall.
We'll have more to say about this is a moment, but it's now
easy to see then as both humans and computers start generating metadata at an
explosive rate even 200TB drives won't be able to hold it all, and so the
intelligent exchange of metadata -- not unlike the weblog protocol RSS -- will
permit its sharing. This is tremendously significant, since it means that our
computers will be the repositories of how we think about the world. Moreover,
since presumably metadata will fly back and forth depending upon its currency,
its state at any given time will reflect the worldview of the human race.
One of the foundations of human intelligence lies in our
ability to detect patterns, that is, put in our language, the ability to
distinguish metadata (classifiers, categories) and to rank an object's relative
conformance to those criteria ("hey, all those balls are blue, mostly").
When computers can detect these same patterns, or different
ones (better still), their value to us will be immensely magnified. They
will be intelligent; we will see them as our indispensable assistants,
always ready with information we need, when we need it.
We're not there yet. Indeed, we have yet to really begin
this journey. First we have to create the mechanisms for storing all this
metadata, and then build the intelligent mechanisms to search it, examine it,
and retrieve the appropriate data. Only then can we layer on the inductive
pattern determination and recognition which will be so crucial to computers'
utility to us.
We haven't begun the journey, but parts of the road are
becoming clearer.
References:
- Jim Gray and Dave Patterson, "A Conversation with Jim
Gray," ACM Queue, June 2003. Fascinating discussion about the
future of computer storage which inspired this essay. Online at http://www.acmqueue.org/modules.php?name=Content&pa=showpage&pid=43
- Edgar Codd, "A Relational Model of Data for Large Shared
Data Banks", Communications of the ACM, Vol. 13, No. 6, June
1970. Reprinted online at http://www.acm.org/classics/nov95/toc.html
- Dave Winer, "RSS 2.0 Specification," Harvard Law School. Online at http://blogs.law.harvard.edu/tech/rss.
Discuss
|