Images

Joseph Esposito decided to take a look at Jane Austen’s works, as represented by Google versions.  He comes to the conclusion that Google is doing the reading world a real disservice.  Here’s part of what he says in his article in The Scholarly Kitchen:

Rather than pay for the Penguin or any other edited version of Austen, I decided to be a cheapskate and searched for free Google versions.  And that’s when things began to go wrong.  The Google editions were packed with errors. If I were not studying Google Ebooks for professional reasons, if I were not already familiar with the works of Austen, would I have gone on? Would I have thought that Austen does not know how to place quotation marks, that she made grammatical mistakes that would embarrass even a high school freshman, or that her dialogue sometimes breaks off without explanation?  I began to wonder what service or disservice Google had performed, rendering one of the world’s most popular writers in a form as bizarre as the Zemblan translation of Shakespeare in Nabokov’s “Pale Fire.”

The problems with the Google versions of Austen potentially stem from four sources, though it is the third of these that is the principal culprit:

  1. The original print edition. Except for those books sent to Google directly by publishers, the books found in Google Ebooks derive from Google’s mass digitization project of library collections. The assumption is that if a university library puts a book on its shelf, that book must be okay.  This is a bad assumption, however.  Publishers make mistakes, libraries make mistakes; over time the seriousness of these mistakes becomes more apparent.  Texts have to be reviewed; sometimes explanatory notes are necessary to provide context.  Take a look at the public-domain 1911 edition of Encyclopaedia Britannica (not part of Google Ebooks) and ask yourself if some people will confuse the text for a modern one.
  2. Digital scans of the print edition. Digital scanning has gotten very good; as far as I know, Google does as good a job as anyone at this.  But scanning can nonetheless introduce errors and odd artifacts and in any event does not provide a text in a modern typeface that can fit any size screen.  This may not be a problem for a scholar working with obscure works, but for popular works like “Pride and Prejudice,” this raises a real barrier to readership.
  3. Optical character recognition (OCR). OCR has come a long way. Each Google Ebook of a scanned text is accompanied by an OCR’d version, which allows the user to change fonts and type size and reflow the text. Unfortunately, some of the OCR for the Austen volumes I read was simply terrible. Words got mashed together, spacing was bizarre, punctuation was simply not picked up, etc., etc. The problem for Austen is that it is the OCR version, not the scanned pages, that most readers are likely to use, as they will be reading from mobile devices with tiny screens, which require reflowed text.
  4. Metadata. The metadata for the public domain works in Google Ebooks is atrocious.  Geoffrey Nunberg has written forcefully about this.  To his comments I would add that the crime is worse for popular works such as Austen’s.  Scholars can struggle with poor metadata, but someone who might pick up a classic but once in his or her life is not likely to make much of an effort.  My experience with “Mansfield Park” is not atypical.  I began to read the work (in the corrupt OCR version) and then came to what appeared to be the end.  But it was marked the “end of Volume One.” There was no Volume Two.  I had a similar experience when researching books about the Beatles and encountered a title and cover for a Beatles book, but inside was a musical score by Mozart.

I wish to be clear that I am restricting my criticism to the small number of literary classics that continue to have a popular readership.  Google has done a disservice to these works and their readers.  Free is a terrible price, as many readers will flock to these free editions — not knowing that other things are not equal — bypassing the edited volumes prepared by scrupulous publishers.

Thanks to BookofJoe for the link.

17 COMMENTS

  1. This is just silly. There are wonderful error free versions available at no cost from places other than google books. For example, the handcrafted versions in the mobileread.com ebook library and the amazing free ebook library at the University of Adelaide. Bad editions are bad for all of the reasons stated in the article, but that does not mean that good ones are not available.

    As an English major at University, I spent literally thousands of dollars on books that are now available for free. Is there a lack of meta data? Absolutely. Does it matter–usually not.

  2. “The terrible price of free … ”

    This issue has absolutely nothing whatsoever to do with ‘free’. This is about Google Editions.
    It is clear from the many articles on the topic, that paid for eBooks are suffering from the exact same type issues, despite their not being free at all.

  3. Just want to agree with the comments already made – free eBooks, notably those from Project Gutenberg, can be of very high quality, while purchased eBooks can exhibit all the flaws described above.

    I do appreciate this discussion – amid all the frenzy about devices, pricing models, and DRM, the issue of quality hasn’t received enough attention.

  4. Yep. What everybody else says. This incriminates Google, but not “free.” Gutenberg and Distributed Proofreading are really wonderful services to literature, and exactly what the Internet should be all about.

  5. Nobody else is RTFA :). He mentions Project Gutenberg, actually bothering to give it a link.

    So thanks for letting me know I need to discount the comments here somewhat, even when Howard’s keyword-matching non-sequiturs aren’t among them :).

  6. While I haven’t checked recently, the PG version of “Heart of Darkness” hadan error in the first sentence. Of course, some people might say the lack of italics is just a hold over feature from PG pure plan vanilla days and isn’t truly an error, but I’d count it has a reason to avoid PG. Other texts may or may not have the some issue with missing italics or use the horrible practice of substituting quotes or all caps.

  7. It’s probably self-evident, but there are a lot of volumes which PG won’t get to any time soon (because of lack of manpower). I’m just throwing out numbers, but I would estimate that for every PG title, there are probably 1000 public domain titles which haven’t made it to PG. Imperfect editions are much better than nothing.

  8. I’ve noticed a lot of evident digital conversion errors in contemporary e-books published by big (and little) houses as well. They’re unlikely to improve until publishers are willing to devote editors to ebooks – as it is, editors are used for print editions only and e-conversion is handled out-of-house or by the production and/or design depts.

  9. The metadata for the public domain works in Google Ebooks is atrocious.

    As if metadata in professional publisher’s e-books is better… Maybe inside the text of the book is better, but anyone working with tools that can read metada in the e-books know what kind of garbage is inside the metadata of the e-book.

  10. The best-known books at Project Gutenberg (PG) were done in the early years of the project, before Distributed Proofreaders was started. In those days, one person did a whole book. There was no proofreading, no review. Some texts are fine, some are terrible. Distributed Proofreaders (DP) did better immediately and has since improved (a text is now given four to seven rounds of review).

    DP is re-doing some of the early PG texts, but it’s a slow process.

    PG isn’t a guarantee of quality, but recent publications are usually better than even commercial ebooks.

  11. @Robert: “Imperfect editions are much better than nothing.” As the article suggests, I wonder if that’s true.

    It pains me to see discussions of bad scan-and-OCR, when a perfectly workable solution has been available for the problem for years: Higher-resolution scans of larger type. I used to routinely scan and OCR books and documents, and I discovered that increasing the font size on a simple copier by even 10-15% resulted in better scan-ability. Paperbacks can often be increased in size over 20%, and still fit neatly on a letter-sized piece of paper.

    Further, scanning the copier pages was easier than scanning the book pages, and using higher-resolution (300 DPI at least) better captured those characters and resulted in far fewer errors.

    The bulk of scan-and-OCR problems seen today are a result of not taking either of these common-sense steps, mainly because lower resolutions and 1-step scanning of book pages takes less time. But the steps I outlined severely cut down on the time needed for extensive proofing, so they are worth the extra effort (really, just one extra step).

    Google, of course, is not the only culprit here: Most backlist books are being scanned by this inaccurate low-res method by third-party contractors. Hopefully, sooner or later everyone will get the message that some simple improvements to their methods can result in significantly better copy.

  12. Google ebooks’ epub editions are, alas, worthless. No human eyes appear to have corrected a single page of any of them.

    Google ebooks’ pdf editions are, on the other hand, priceless. These are basically images of the scanned pages, and so are error-free reproductions. They also allow one of a scholarly bent to compare various editions say of Austen’s works. The files are big, and not to be read on small devices. But I love them and prefer them to PG titles, if I’m reading on a laptop.

    Not mentioned here, though maybe in the original article: http://www.archive.org has lots of choices for public domain titles, including text versions (that you can correct yourself and resubmit), pdf versions, djvu, and online reading.

The TeleRead community values your civil and thoughtful comments. We use a cache, so expect a delay. Problems? E-mail newteleread@gmail.com.