imageGiven all the greats, from Austen to Dickens, the 19th century may have been the time for the novel. But zillions of words from then still need to be scanned, and beyond the withdrawal of Microsoft from the public domain scene, other challenges abound. Here’s one.

"OCR technology for 19th and early 20th century type fonts is not advancing," a Smithsonian staffer wrote during a discussion on the Open Library list.

True? If so, what about the massive Google digitization project?

How much technology is Google sharing to help address this particular problem? And if Google isn’t addressing these issues, should the company be spending more money? Are its existing open source efforts enough? Remember, the better the quality of optical character recognition, the less need for clean-up work. More accurate OCRing could improve both the quality and quantity of public domain efforts, not just the commercial side of Google’s scanning initiative.

The TeleRead angle: If even Google isn’t devoting enough resources to the problem, should a consortium of librarians and researchers try to address it? I know these matters concern Brewster Kahle and others at the Internet Archive. But should they be getting more help? And from where? Remember, Google is a rival of the Archive-related Open Content Alliance. Time for Google and OCA to get a little closer?

Disclosure/Reminder: though you’d never know it from the number of times I pick on Google, I own a microscopic speck of the company for retirement purposes.

NO COMMENTS

The TeleRead community values your civil and thoughtful comments. We use a cache, so expect a delay. Problems? E-mail newteleread@gmail.com.