TeleRead: Bring the E-Books Home

News & views on e-books, libraries, publishing and related topics
September 2nd, 2009

Google Book Search’s metadata is a disaster

By Paul Biba

images.jpegThe Chronicle of Higher Education has an article by Geoffrey Nunberg, an adjunct full professor at the School of Information at the University of California at Berkeley. He demonstrates how Google’s poor use of metadata for its scanned books will make work extremely difficult for scholars who need to search the book database. A really fascinating read about a subject I bet none of have thought about.

… But to pose those questions, you need reliable metadata about dates and categories, which is why it’s so disappointing that the book search’s metadata are a train wreck: a mishmash wrapped in a muddle wrapped in a mess. …

A search on “Internet” in books published before 1950 produces 527 results; “Medicare” for the same period gets almost 1,600. Or you can simply enter the names of famous writers or public figures and restrict your search to works published before the year of their birth. “Charles Dickens” turns up 182 results for publications before 1812, the vast majority of them referring to the writer. The same type of search turns up 81 hits for Rudyard Kipling, 115 for Greta Garbo, 325 for Woody Allen, and 29 for Barack Obama. (Or maybe that was another Barack Obama.) …

Then there are the classification errors, which taken together can make for a kind of absurdist poetry. H.L. Mencken’s The American Language is classified as Family & Relationships. A French edition of Hamlet and a Japanese edition of Madame Bovary are both classified as Antiques and Collectibles (a 1930 English edition of Flaubert’s novel is classified under Physicians, which I suppose makes a bit more sense.) An edition of Moby Dick is labeled Computers; The Cat Lover’s Book of Fascinating Facts falls under Technology & Engineering. And a catalog of copyright entries from the Library of Congress is listed under Drama (for a moment I wondered if maybe that one was just Google’s little joke). …

It’s clear that Google designed the system without giving much thought to the need for reliable metadata. In fact, Google’s great achievement as a Web search engine was to demonstrate how easy it could be to locate useful information without attending to metadata or resorting to Yahoo-like schemes of classification. But books aren’t simply vehicles for communicating information, and managing a vast library collection requires different skills, approaches, and data than those that enabled Google to dominate Web searching.

Digg us. Slashdot us. Facebook us. Twitter us. Share the news.
  • Digg
  • Slashdot
  • Facebook
  • Twitter
  • del.icio.us
  • Reddit
  • StumbleUpon
  • Technorati
  • NewsVine
  • LinkedIn
  • MySpace
  • Suggest to Techmeme via Twitter
  • Netvibes
  • PDF

5 Responses to “Google Book Search’s metadata is a disaster”

  1. I guess Google never metadata it didn’t like.

  2. Google booksearch is certainly a disaster – but their policy of not making books available to non USA viewers is just shameful. There are ways around this – but ones we won’t tell Google! Google is quite prepared to take money from overseas advertisers and overseas users, but makes sure any value stays at home. So I don’t buy anything through Google adverts.
    I am not sure who designed their user interface, but I get the impression they weren’t a big book user!

  3. I think the two comments left at the linked site by:

    mightythylacine – September 02, 2009 at 07:02 pm

    and

    gsheldon – September 02, 2009 at 08:59 pm

    sum it up pretty nicely.

  4. Felix Torres Says:
    September 2nd, 2009 at 9:59 pm

    It isn’t quite true that Google’s crappy effort leaves scholars “no worse off” than if it didn’t exist because the terms of the infamous “class-action settlement” preclude anybody else being able to take a crack at doing the job right. Absent Google’s land-grab, the possibility would exist that some kind of public-private consortium might arise to do the job properly and still be economically viable while protecting traditional copyrights. The Goggle library, however, has in effect poisoned the well to the extent that even if google fails to gain their desired monopoly, no alternative is likely to emerge any time soon.
    The only alternatives left in this timeline are the crappy google monopoly or nothing at all.

    So much for doing no evil.

  5. I agree with Felix Torres that “Google’s land grab,” if it is successful, is likely to prevent or at least delay the arrival of an online digital library that would be much better. It’s one of the central arguments in the letter I mailed yesterday to the judge in this dispute. You can download that letter before the judge gets a chance to read it from here:

    http://inklingbooks.com/googlesettlement/files/JudgeChinLetter.pdf

    Here’s a key passage in that letter:

    ****
    Still worse for research is Google’s quirky scheme to limit online display to only part of a book. Imagine a book from the 1930s in which a historian is describing some commonly held point of view. Just before his description ends, a researcher’s quota of screen displays runs out. He fails to read the next page, which begins, “The evidence, however, indicates that what I’ve just described is not true.” That’s bad. If you want to understand what a books says, you need the book itself, not this settlement’s castrated slice of it.

    Keep in mind that I’m not against displaying books online. I use them all the time. What I’m against is a scheme that violates copyright on a vast scale, creating a vast collection that benefits no one but Google, and then in an odd spasm of self-flagination, appeases its guilt by ripping text out of its original context and striping out maps, graphs, and written by others (“inserts”). Books treated in such a brutal and cavalier fashion are virtually worthless for research.

    I don’t want to criticize without being constructive. What’s needed isn’t a what Google wants to create, a kleptomaniac mad man’s attic cluttered, willy-nilly, with titles stolen (in copyright terms) from libraries. What’s needed is the sort of library we create for ourselves, one based on an intelligent selection of books that really matter, books that haven’t been mutilated in some mechanical fashion. Most important of all, we need a digital library that grows carefully over time rather than one that’s thrown together in haste like this one. Built wisely, the effort to locate an author and get the proper permission won’t require as much expense or trouble as this insane effort to digitize and database every paper text.

    *****

    It may be a bit vain to like your own words, but I love the expression, “a kleptomaniac mad man’s attic cluttered, willy-nilly, with titles stolen (in copyright terms) from libraries.” It’s hard to come up with a more apt expression of what Google is doing.

Leave a Reply

Subscribe without commenting