5

images.jpegThe Chronicle of Higher Education has an article by Geoffrey Nunberg, an adjunct full professor at the School of Information at the University of California at Berkeley. He demonstrates how Google’s poor use of metadata for its scanned books will make work extremely difficult for scholars who need to search the book database. A really fascinating read about a subject I bet none of have thought about.

… But to pose those questions, you need reliable metadata about dates and categories, which is why it’s so disappointing that the book search’s metadata are a train wreck: a mishmash wrapped in a muddle wrapped in a mess. …

A search on “Internet” in books published before 1950 produces 527 results; “Medicare” for the same period gets almost 1,600. Or you can simply enter the names of famous writers or public figures and restrict your search to works published before the year of their birth. “Charles Dickens” turns up 182 results for publications before 1812, the vast majority of them referring to the writer. The same type of search turns up 81 hits for Rudyard Kipling, 115 for Greta Garbo, 325 for Woody Allen, and 29 for Barack Obama. (Or maybe that was another Barack Obama.) …

Then there are the classification errors, which taken together can make for a kind of absurdist poetry. H.L. Mencken’s The American Language is classified as Family & Relationships. A French edition of Hamlet and a Japanese edition of Madame Bovary are both classified as Antiques and Collectibles (a 1930 English edition of Flaubert’s novel is classified under Physicians, which I suppose makes a bit more sense.) An edition of Moby Dick is labeled Computers; The Cat Lover’s Book of Fascinating Facts falls under Technology & Engineering. And a catalog of copyright entries from the Library of Congress is listed under Drama (for a moment I wondered if maybe that one was just Google’s little joke). …

It’s clear that Google designed the system without giving much thought to the need for reliable metadata. In fact, Google’s great achievement as a Web search engine was to demonstrate how easy it could be to locate useful information without attending to metadata or resorting to Yahoo-like schemes of classification. But books aren’t simply vehicles for communicating information, and managing a vast library collection requires different skills, approaches, and data than those that enabled Google to dominate Web searching.

 
5