The Other Deep Web

Ashley and I went up to the California International Antiquarian Book Fair today. It was truly a humbling experience in many ways. While a large number of the cracked and creaking volumes in the fair were of little interest (there seemed to be an overabundance of antique fishing books for some reason), there were a few astounding gems.

Of particular interest were the original scientific volumes. A first edition copy of Newton’s “Opticks“. Treatises by Galileo, Copernicus, and Descartes. An original copy of Einstein’s publication introducing the theory of general relativity, and another introducing the photo-electric effect. A first edition of the King James Bible. A first edition copy of Ray Bradbury’s “Fahrenheit 451” (a limited edition bound in asbestos). But why do these books matter? We know the words, the ideas, and the illustrations. What’s left to captivate us? I pondered this while watching a woman nearly suffer an emotional implosion while examining some obscure volume of philosophy that obviously held some powerful sway over her.

I guess in some way, we all look to try to get closer to the original author of some book that meant something to us. To try to get inside them. Maybe if we can reach back in time far enough through these books, we think, we can actually touch their authors’ greatness (and maybe some of what they had will rub off on us). While it starts innocently enough – a first edition here, a copy signed by the author there, an original marked up copy of the manuscript, etc. – but sometimes it gets truly weird.

At some point today the obsessive nature of the antiquarian book trade became readily apparent when I found one vendor selling Robert Louis Stevenson’s matriculation card from the University of Edinburgh. It’s one thing to like the guy’s books, it’s entirely another to want to own the card attesting to his status as a university graduate. That’s kicking it up a notch. In another area, I found a copy of Mark Twain’s “The Celebrated Jumping Frog” originally owned by Theodore Roosevelt, along with a personal letter from Roosevelt attesting to how much it meant to him. How ironic.

Internet geeks have been talking for years about the “deep web” – the dark matter of the Internet universe that remains hidden from the pervasive prying eyes of voyeur search engines. While the term is often applied to data curtained behind corporate firewalls, today the term took on a new meaning for me. Today, it referred to all the ancient “obsolete” knowledge trapped in ancient volumes that would never be visited by Googlebot, or scanned by Amazon. Today, it referred to the collective emotion of the human race for the books and authors they love. There’s no indexing that.

Book To The Future

Back when I wrote my book, I was surprised at the lack of sophistication in the publishing industry. I had always figured that the desktop publishing revolution would have streamlined the publishing industry – I envisioned elaborate templates and tools that would enable a publisher to easily choke down text and automatically pump out a finished book. Instead, the tools provided by my publisher consisted of a Word template that rendered everything (titles, headings, body text, etc.) as monospaced Courier – all of which was later laid out in Quark Express by hand.

Rewind to last week at Web 2.0: Brewster Kahle presented the seductive vision of universal access to knowledge that could be achieved by scanning the entirety of the Library of Congress for a pitiful $260 million. This revelation followed the announcement of Google Print, Google’s answer to Amazon.com’s Search Inside the Book feature, will enable users to find information in books as part of their Google search experience.

While I applaud both Google and Brewster’s vision, I sense a gap: Brewster’s proposal will give a digital access to books from the past; Google’s service will give (limited) digital access to books from the present. All I can wonder is: who will give digital access to books in the future?

While it is obvious that digitizing the Library of Congress is a manual procedure, it might come as a surprise that Google’s efforts are equally manual. Google generously offers to scan publisher’s content, thereby making it available via the Google Print service while protecting the publisher’s content. Scanning. Just like Amazon.com. By hand. This means that 75 years from the date of an author’s death in the future, Brewster’s organization will have to scan the author’s books by hand – books that Google will already probably already have in a digital form.

All of these undertakings smack of massive amounts of physical (i.e. non-digital) labour. So, if Amazon.com and Google are both doing it, why not cut out the middleman? Why not just have the publisher’s provide the PDF’s (or whatever is the appropriate digital format) of their content directly to Google or Amazon.com? Or, better yet, why not have the Library of Congress solicit electronic versions of books directly from publishers and escrow them for the time when they enter the public domain, just as they do for physical copies? Aside from the efforts of the Library of Congress to digitize rare books, I’m not aware of whether or not they do this already – does anyone know?

My fear here is that Google and Amazon.com will amass a digital library of scanned books that will remain gated off from the public even once the books within it have entered the public domain. Do we really want to still be running Project Gutenberg in another hundred or so years? Probably not.

If the Library of Congress isn’t already cooperating with publishers to escrow electronic copies of books, wouldn’t it make sense for Google and Amazon.com to pledge to release the electronic copies to the public, the Library of Congress, or Brewster Kahle’s organization once they’re in the public domain? After all, it’s not like they even have to fulfill the pledge for another seventy-five years.

Does anyone know if this is already part of Google/Amazon/Brewster’s plans?