We’ve mentioned ReCAPTCHA a time or two—the security effort by Carnegie Mellon researchers that took two problems and made them solve each other: how to make a “CAPTCHA” (an automated Turing test meant to prove that a human wants to access the website rather than a spambot) that couldn’t be solved by a computer optical character recognizer, and how to digitize words in old documents that a computer’s OCR couldn’t puzzle out.

By feeding these unrecognizable words to web users, paired with words the computer knew already, it both tested whether they were real people and told the system what those words that were immune to OCR were. It was used to digitize the 130-year archives of the New York Times, one word at a time over the course of a series of millions of web interactions, and proved so successful that it ended up being bought by Google.

However, technology does march on. At DEFCON 18, a researcher presented an paper claiming that he had devised an algorithm that would solve ReCAPTCHAs 30% of the time. While that is only about a 1 in 3 success rate, when solving one ReCAPTCHA fails a website generally pops up another, so a bot using this algorithm would only need to keep trying until it eventually got one right.

The researcher’s website is a little hard to read (it was formerly black on red, and the researcher in a fit of pique at complaints of unreadability changed it to an equally hard-to-read black on grey) but includes links to the paper, the powerpoint he presented at DEFCON, and a flash video of the decoding in action.

Perhaps a little amusingly, on July 21, just before DEFCON, the researcher notes:

I cracked the old one at 10% and wrote a research paper on it, then they changed it a week before my presentation at DEFCON 18, so I recracked it. Only this time at 30%. I will post detailed information on how to properly crack it after the conference.

On recent ReCAPTCHAs I encountered, I had noticed there were some changes—the words were distorted, or parts of them had been flipped to negative view. I had wondered if something like this might have happened. It’s both a little sad and kind of ironic to ponder the spectacle of having to make words a computer formerly couldn’t recognize even harder for it to recognize.

I wonder if anything in the crack—presented in full as it was at the conference—could be adapted to help book-scanning OCR systems crack those hard-to-decipher words as easily as the researcher could crack the ReCAPTCHAs?

(Found via Slashdot.)


The TeleRead community values your civil and thoughtful comments. We use a cache, so expect a delay. Problems? E-mail newteleread@gmail.com.