On October 1, Distributed Proofreaders quietly celebrated its fifth birthday. Distributed Proofreaders, in case you did not know, is the principal supplier of etexts to Project Gutenberg. It is a web-based environment for the computer-assisted, distributed proofreading of OCR-ed texts.

Project Gutenberg now hosts more than 16,000 ebooks, but until recently that amount numbered in the hundreds. Part of the enormous growth was caused by a web application written by Charles Franks that gave readers who had always wanted to give back to the project, but had thus far been scared away by the long production time of any given ebook, a way to contribute.

That web application was Distributed Proofreaders: a site where a volunteer could correct a single page of a book if that was all they had time for.

Michael Hart started Project Gutenberg in 1971 with the publication of an electronic version of the US Declaration of Independence, and in the process invented the ebook and the spam run (he sent the file to all hundred computers on the internet). In 1990, almost twenty years later, Project Gutenberg had 100 etexts to offer its visitors.

Until that time, the project’s volunteers pretty much did their own thing; there was a nebulous person at the equally opaque Project Gutenberg that they talked to, but other than that it was monk’s business as usual. In 1999 however, Greg Newby and Pietro Di Miceli introduced forums where volunteers could talk to each other. And apparently talk they did, coming up with new ways to share the burden of producing etexts.

“It was in this new, expansive atmosphere,” as Jim Tinsley writes in the preface to The Project Gutenberg FAQ 2002 (recommended reading for the introduction alone!) “with ideas flooding in from enthusiasts newly energized by the project, that Charles Franks (Charlz) came up with the idea of a web site that would serve to distribute the work of proofing a book among many volunteers. But not only did he think of the concept; he went ahead and did it!”

Distributed Proofreaders was run from a computer in Charles Franks’ home, until somewhere in 2002 the Internet Archive offered to host the site instead. On November 8, Charles Frank and another volunteer, Charles Aldorondo, decided to try a little experiment to stress test the new hardware and to get more volunteers involved. They sent a story to the highly popular “News for Nerds” website Slashdot, knowing full well that if it got published, Distributed Proofreading would experience a Slashdotting: an increase in visitors so high, that only the strongest webservers could survive.

As it happened, the story did get published, the server held up, the old guard volunteers almost did not, and Distributed Proofreaders corrected a hundred etexts within a few days. Membership sky-rocketed from 841 to 5,156 within a week.

The next few days I will be talking a little bit more about the history of Distributed Proofreaders, highlighting special books we produced, the sort of texts a volunteer can run into, and the community that also is Distributed Proofreaders. And squirrels.

In all fairness, Distributed Proofreaders was not the first of its kind: the Christian Classics Ethereal Library, a “Project Gutenberg” for christian etexts, uses a similar system that predates Charles Franks’ software.

5 COMMENTS

  1. The work being performed by the volunteers at “Distributed Proofreaders” is magnificent. Yet, I think that it is time to consider a new supplementary strategy for making public domain books available. Instead of waiting for proof-reading each book why not release books in a preliminary form as a collection of scans together with the best current results from an automated optical character recognition analysis. This approach will allow for the release of myriad books. Each book should only require a few megabytes of storage using compression methods specialized for text. This is acceptable now because hard-disk and flash memories are capacious and relatively inexpensive. Also, storage costs continue to decline. I suspect that Google and Yahoo may take this approach because human proof-reading is slow and expensive.

    Will the “OpenReader” format support a collection of scans together with OCR text that can be used for searching? The OCR text would be supplemented with “index” type information that would allow a search result to point to a specific location on a scanned page.

  2. Garson, having quality and completeness levels, and including scans, would certainly seem to be useful, especially if it will enable you to lift a document to a next level. This should allow you to release early and often, as the FOSS people like to say.

    In a sense (a very vague and loose sense) this is already happening; many projects release tens of thousands of documents in scan form with uncorrected OCR, lots of scientific projects release a few heavily marked-up documents, and Distributed Proofreaders seems to be somewhere in between, with typographical characteristics often preserved in HTML versions. (HTML versions are not obligatory, but by my last count approximately 75% of the ebooks coming out of DP are accompagnied by an HTML version.)

    As for pointing to locations on a scanned page, I know it is possible, but it needs to be implemented. If it needs to be implemented at DP, it needs to be implemented in the entire system, which is going to be a lot of work. Personally, I’d prefer to work first on moving to a generic XML/SGML format for DP that retains most typographical information.

  3. Branko, Thanks for your volunteer work and thanks for the link to the discussion of the DjVu format and XML. Several years ago I visited the website of the “Million Book Project” when it was active at Carnegie Mellon. Unfortunately, the project seemed to be stalled and inactive when I visited. It is good to see that Brewster Kahle and the Internet Archive are moving forward together with Canadian libraries.

    I agree that retaining typographical information when proofreading and converting text into a markup language would be highly desirable. The HTML versions from Distributed Proofreaders that preserve some typography are great. Preserving precise information about publishers, edition numbers, print runs etc is also important as David Rothman has emphasized on this blog.

    The digitalization of millions of volumes is such a massive task that I think human proofreaders will only be able to perform markup on a subset of the volumes. For many books only limited formats will be available, e.g., scans, OCR data, and DjVu. I hope a widely deployed free reader with an openly specified format will be available for these books.

The TeleRead community values your civil and thoughtful comments. We use a cache, so expect a delay. Problems? E-mail newteleread@gmail.com.