Recapturing public domain texts with ReCAPTCHA

August 14, 2008

311

When is a CAPTCHA not annoying? When it is used to help digitize old public domain texts.

CAPTCHAs (an overly-cutesy acronym standing for “Completely Automated Public Turing test to tell Computers and Humans Apart”) are those tests that are supposed to verify you are a real human by making you type some distorted letters or numbers. This is meant to keep spambots from being able to register for accounts, make spam posts on forums, sniff out email addresses, or do other things that might be considered harmful.

Two Problems That Solve Each Other

The problem is that spambot software has gotten to the point where it can often correctly parse out even distorted CAPTCHAS unless they are so distorted that humans cannot read them either. Also, users are not fond of being forced to waste time squinting at jumbles of meaningless letters and numbers—especially older users with poor eyesight to begin with.

Meanwhile, volunteers digitizing old public domain texts have known for a long time that optical character recognition (OCR) systems are imperfect. They make lots of transcription mistakes, and the only way to hash them out is to have a human look at the words in question and correct them. This is the rationale behind Distributed Proofing. But it is hard to find people who have the time and inclination to read and correct entire pages of old literature.

But sometimes, two problems will solve each other. Researchers at Carnegie Mellon noticed the similarities between the two situations, and realized it would be possible to harness the “wasted time” dealing with CAPTCHAs as a force for translating mis-scanned words. They built this idea into a system called ReCAPTCHA.

ReCAPTCHA

Under ReCAPTCHA, the OCR program pairs images of a word that it cannot translate with a word that has been verified and asks users to type both of them. If the user got the known word correct, then he is assumed to have gotten the unknown word correct. Once several users give the same answer for an unknown word, the word is considered to be correctly identified.

ReCAPTCHA has proven remarkably resistant to cracking by CAPTCHA-cracking software—for precisely the same reason that the words needed to be passed by humans in the first place; OCR software just cannot read them, but humans can. And most casual users will consider it anything but a waste of their time.

According to the Ars Technica article, ReCAPTCHA is used by over 40,000 sites, and handles over 100 million words per day. It is currently being used to help digitize books for the Internet Archive, and older editions of the New York Times.

(Additional coverage: NPR, Slashdot)

4 COMMENTS

Kam August 14, 2008 at 8:29 pm

Some enterprising and generous soul could create a Google widget for such distributed proofing efforts.

Log in to leave a comment
Hadrien August 15, 2008 at 9:44 am

Love the concept, I use it during the registration process on Feedbooks.

Log in to leave a comment
New reCAPTCHA's for Webago | Webago News September 14, 2008 at 3:28 pm

[…] Comment on Recapturing public domain texts with ReCAPTCHA by … […]

Log in to leave a comment
Spam Filtering x2 | Can't Get Rich September 29, 2008 at 9:56 pm

[…] Comment on Recapturing public domain texts with ReCAPTCHA by New … […]

Log in to leave a comment

The TeleRead community values your civil and thoughtful comments. We use a cache, so expect a delay. Problems? E-mail newteleread@gmail.com. Cancel reply

You must be logged in to post a comment.

Share this:

Related

4 COMMENTS

The TeleRead community values your civil and thoughtful comments. We use a cache, so expect a delay. Problems? E-mail newteleread@gmail.com. Cancel reply

AMAZON

REVIEWS: E-Book & AUDIO BOOKS

SELF PUBLISHING: TECH & BIZ TIPS

MOST RECENT

POPULAR POSTS

MAJOR CATEGORIES