Over the years I’ve scanned and OCR’ed many printed books into electronic form for Gutenberg Australia—most of the Edgar Wallace collection there is my work, for instance—and during that time it’s become clear that not all typos are equal. After awhile, in fact, it became possible for me to divide typos into categories, as follows:
Category 1: Typos due to English orthography
Some letter sequences in English serif text happen to resemble others. The sequence ‘of her’, for instance, looks very much like ‘other’, and ‘thing’ looks very much like ‘tiling’. Every second or third book I scanned had these mistakes in it somewhere.
Category 2: Typos due to the publishers’ choice of font
These used to arise when I was scanning a set of books in the same series. Some fonts/size combinations happened to trick the OCR software in consistent ways. Many series had very narrow risers on the ‘h’, for instance, making it easy to mistake them for an ‘n’. Extra spaces were also very common.
Category 3: Typos in recurring proper names
Many of the books I scanned belong to series featuring ongoing characters and ongoing locations. If the name ‘Winstanley’ was misread ‘Wihstanleg’ in the first of these, I could be reasonably sure that the OCR would get that name wrong in the same manner in the rest of the book—and the other books, too.
Category 4: Words that just don’t belong
If I come across the word ‘modem’ in the OCR’ed scan of a book from 1935, for instance, I know I can change it to ‘modern’ without a second thought. The same applies to apparent U.S. spellings in books published in the UK, and vice versa.
Category 5: Typos that a spelling check will pick up
Good OCR programs have a plausibility tester built in to block words that don’t match English spelling, but this is often overridden. My final pass through the book is always done with a spelling checker, and I usually pick up a half-dozen errors.
Category 6: Someone else’s typos
There are always a few of these—errors by the author, the editor or the typesetter, which have crept into the original book.
* * *
What does this imply for efficient proofreading? It means that mindless page-by-page comparison of the original text with an OCR copy is just about the least efficient way to do it, because it requires predictable errors to be corrected over and over again, rather than through global changes. What’s more, it retains—sometimes even cherishes—the typos in Category 6, just because they happen to be typos in the original book. And yet this is the method currently used by most proofreading projects, including Distributed Proofreaders.
Step 1: OCR a cluster of books by the same author, in the same series, featuring the same characters, as far as possible. Gather these into a single word processing file. Use a master document if your software allows it.
Step 2: Run a macro to find and correct Category 1 and 4 errors. That includes changing straight quotes to smart quotes, removing characters that don’t occur in novels—at least in those written before the copyright period—like ‘>’ and ‘=’; highlighting potential error points like ‘of her’ and ‘other’; changing ‘modem’ to ‘modern’ and so on. Yes, this sometimes makes mistakes, but it fixes far more errors than it introduces, and if you highlight the changes as you make them, it’s easy to spot those rare points at which you’ve introduced an error rather than removing one.
Step 3: Start proofreading. When you find an error that doesn’t match English orthography—‘dosn’t’ for ‘doesn’t’, for instance—do a global search and replace, and highlight the replacements. It’s extremely likely that you’ll find the same error several times in the same file, though the frequency drops as the more common errors get corrected. The same applies to names of prominent places and ongoing characters like ‘Wihstanleg’. Common errors should be added to the macro from Step 2, so they’ll be found and fixed in the next batch.
Step 4: When you find an error that could be a real word—‘hell’ for ‘hello’—correct it in that one location only, but highlight the word throughout the document to make it easy to pick up similar errors where they occur.
Step 5: When you finally finish proofing the whole set of books, do a spelling check, remove the highlighting, then break them back up into their component files.
Seems like a lot of work? Yes, but it pays off. I estimate this approach saves me an hour of proofreading on a normal-sized novel—more on books with strange fonts or poor printing. And it’s much less stressful to fire off a whole volley of corrections at once, knowing you won’t have to deal with them again, than to meet your old friend ‘Wihstanleg’ for the eighth time in the same chapter.
This method also has the advantage that you very seldom need to refer to the source material. Sometimes, where whole lines have been omitted or garbled, you won’t have any choice, but I find I can correct 95 percent of a reasonably well-printed book without needing to refer to the original text at all.
Distributed Proofreaders regard this as anathema. I call it intelligent proofreading. You decide.