‘Digital Text Masters’ (Digitizing the classic public domain books)
By Jon Noring
The recent TeleBlog articles about the Project Gutenberg (PG) text Tarzan of the Apes (see 1, 2), suggest that not all is well in the existing corpus of public domain digital texts.
My personal experience the last twelve years in digitizing several public domain books has helped me to see a number of problems which I’ve mentioned in various forums, including the PG forums, and The eBook Community. For the sake of not turning this already long article into a whole book, I won’t cover here the complete list of problems I found, plus those found by others.
To summarize what I believe should be done to resolve most of the known problems, when it comes to creating a digital text of any work in the public domain, we should first produce and make available what we call a “digital text master,“ which meets a quite high degree of textual accuracy to an acceptable and known print source. From the “master,” various display formats, and derivative types of texts (e.g., modernized, corrected, composite, bowdlerized, parodied, etc.) can then be produced to meet a variety of user needs.
(Btw, what better example to illustrate the concept of a “digital text master” than to show the self-portrait of the great 17th century Dutch master painter, Rembrandt van Rijn, whose attention to detail and exactness is renowned.)
We must have some fixed frame of reference by which we produce digital texts of the public domain, otherwise it leads to problems of all kinds (a couple of these problems, but by no means all of them, are illustrated by the Tarzan of the Apes PG etext.) This is especially true for projects which have the intent of producing unqualified texts of public domain works (thereby implying faithfulness and accuracy) — such projects have an obligation to offer a digital text faithful and reasonably accurate to a known source book so the user knows what they are getting. This is somewhat like food labeling, so one knows what ingredients they are getting in their food.
Fortunately, Distributed Proofreaders (DP) is dedicated to this very goal, and their finished digital texts are being donated to the Project Gutenberg collection at a fairly fast clip. However, DP came on the scene relatively late in the game, so the most popular, classic works were already in the PG collection when DP arrived. As a result, DP has mostly focused on the lesser-known works, many of which are good, but will never be widely read compared to the great classics.
Unfortunately, however, in the PG Collection the great classic works, such as Tarzan of the Apes, are of unknown faithfulness and accuracy to an unknown (not recorded) source work (is that enough unknowns?) Even if they were digitally transcribed with “rigor” (to be clear, I believe a number of the earlier PG texts are of high-quality), how does one know? In effect, PG does not support the concept of a “digital text master,” preferring to be a “free-for-all archive” of whatever someone wants to submit. Until recently when the policy was changed for new texts, PG wouldn’t even tell you the provenance of what had been submitted — that information was intentionally stripped out.
The ultimate losers here are the users of the digitized public domain texts. They are, by and large, a trusting group, and simply assume that those who created the digital texts did their homework and faithfully transcribed the best sources. One reason for this TeleBlog article is to point out to users that if it is of concern to them, they should be more demanding and wary of the digital texts they find and use on the Internet. Be good consumers!
Especially beware of boilerplate statements that say such-and-such a text may not be faithful to any particular source book — if not, what is it “faithful” to? Should you spend a significant part of your valuable free time reading something of unknown provenance and faithfulness? This is especially true in education, where it is important the digital texts students use be of known provenance, and that the process of text digitization was guided by experts (and in effect “signed” by them) to assure faithfulness and accuracy — to be trustworthy.
The “Digital Text Masters” Project
For the above reasons, a few of us are now studying a non-profit project to digitally remaster the most well-known public domain works of the English language (including translations.) We will focus on about 500 to 1000 works in the next decade or so, so unlike DP which is understandably focused on numbers because there’s a lot to digitally transcribe (and they will do a good job at getting those books done), our focus will be on a very small number of the great works, and we will give them the full, royal treatment with little compromise. We will “do them right” and when in doubt, will come down on the side of rigor even if it appears to some to be overkill.
Here’s what we tentatively have in mind:
-
Of course, we have to begin generating the ranked list of works we’d like to digitally master over time. This list will not be etched in concrete, but will continue to morph. We will not only focus on fiction (although fiction may dominate the early works due to fiction’s overall simpler text structure and layout), but we will consider some of the great works of non-fiction which had significant influence on human progress.
-
For each Work, we will consult with scholars and lay enthusiasts to select the one (or more) source books that should be digitally mastered. The Internet now makes it very easy to bring together a large number of experts and enthusiasts and draw upon their collective wisdom.
(Importantly note that for some Works there may be more than one source edition selected to be digitally mastered. We do NOT plan to choose one particular edition and call that “canonical” and then eschew all others. Selection of source books to digitially master is on a case-by-case basis. If someone wants to put in the work/resources to focus on a particular source book — their work of love — we won’t stop them so long as what they do follows all the requirements and the resources are there to properly get the job done.)
-
We will find, or make ourselves, archival-quality scans of the selected source books. (Purchasing source books will be considered.) Archival quality means the master scans will likely be done at about 600 dpi full color with minimum distortion, and saved in lossless format. Calibration chart scans should accompany each scan set allowing for quality checking and normalization. Care will be taken to assure complete, quality scans of all pages, including the cover, back and spine. In essence, we don’t want someone to hurry to scan the source book, but rather take their time and do it right. Derivative page scan images (such as lower-rez versions) will be made available online alongside, and linked from, the digital text masters.
-
We will use a variety of processes to generate a very highly accurate text transcription of the source book. Such processes include OCR, multiple key entry, and a mix of the two in various combinations, along with running machine-checking algorithms for anomalies. The goal is a very low error rate. DP may be used, if DP agrees to participate (they are overwhelmed as it is with their current focus on the more obscure works), but we need to investigate multiple ways to do the actual textual transcription and to get a good measure of the likely error rate. Preservation of the actual characters used in the source books (including accented and special characters) will be done using UTF-8 encoding.
The process used to digitally master a given text will be meticulously recorded in a metadata supplement, including special notes particular to the source book. (Unusual and unique exceptions requiring special decisions and handling are likely to be encountered in most source books — this is where the consulting expertise of DP will be of great help.)
-
An XML version of the digital master will be created using a high-quality, structurally-oriented vocabulary, such as a selected subset of TEI. Original page numbers, exact line breaks, unusual errors which have to be corrected rather than flagged, and other information will be recorded right in the markup.
-
Library-quality metadata/cataloging will be produced for each digital master.
-
Several derivative user formats will be generated and distributed for each original digital master.
-
A database archive of all the digital text masters, associated page scan images, and all derivatives will be put together to allow higher level searching, annotation, and other kinds of interactivity.
-
A robust system will be setup to allow continued error reporting and correction of the existing digital text masters in the archive. Even though we plan very low error rates, we know some errors will slip through.
-
The project will bring together scholars and enthusiasts to build a library of annotations for each digital text master (especially useful for educational purposes), as well as encourage the addition of derivative editions (fully identified as such) for each Work.
-
We also would like to build real communities around the various works. For example, for each of the most popular works we may build an island in Second Life (or whatever will supplant Second Life in the future — Google is rumored to be working on a “Second Life killer.”) We want to make the books come alive, and not just be staid XML documents sitting in a dusty repository following the old-fashioned library model.
-
We will heavily promote the Digital Text Masters archive, especially for education and libraries since the collection will find ready acceptance there because of its quality, trustworthiness and metadata/cataloging. It will also be easier to produce and sell authoritative paper editions.
I could go on and list a few more, and expand on each one in more detail, but that should give a representative overview of the general vision.
We are looking at a few funding/revenue models (one of which is quite innovative) to help launch and maintain the project. The highest costs may be for double or triple key entry should we have that done commercially for any if not all source books — the remaining major cost may be for the design and maintenance of the database as well as developing other tools, some of which might be useful for other projects to digitize texts such as DP, and of course to benefit PG as well.
The project plans to start small and controlled, especially in the early phase where R&D to shake out remaining unknowns will be conducted, and work our way up from there. Proper governance and management will be put into place as early as possible. Ties to academia and education, the library community, and various organizations involved with digitizing or adding to the public domain (such as the Internet Archive, Open Content Alliance, the Wikipedia, etc.) will be actively pursued, but the general vision must not be compromised by those ties. I am in informal talks now with several organizations.
We need you!
Of course, the most important thing we need is you. If you agree with the general goals and approach of the “Digital Text Masters” project, and are interested in being involved in some capacity, then step forward. We are especially looking for people who enjoy the great classics, are detail-oriented, and believe in doing things right the first time even if it takes more effort. If interested, contact me in private, jon@noring.name, and let’s talk about your thoughts and what interests you. A teleconference among the interested people (who will be known as the founders) is being planned once we get a minimum critical mass brought together.
I look forward to hearing from you. And it would not surprise me if we’ll see a number of comments below, both critical and supportive, of the idea (if you support it, I hope you will comment!) I’m already anticipating some of the arguments to points which were not covered in this article for brevity sake.
(p.s., the use of the Digital Text Masters collection as a “test suite” to improve the quality of processes to auto-generate text from page scan images is discussed in my comment to the original Tarzan of the ApesTeleBlog article.)
(p.p.s., the “We Need You!” graphic is by Ben Bois. Although associated with OpenOffice, it is nevertheless a cool graphic!)

February 13th, 2007 at 1:51 am
Quietly, without much fanfare, DP has already started redoing some works that were done before DP joined Project Gutenberg. I don’t think that TPTB (The Powers That Be in DP) are opposed to this in principle; they’re just moving carefully.
I should think that DP would want to participate in the project. We already have the infrastructure and now, a five round system that is producing extremely high-quality works. Indeed, a Dickens novel would be a welcome break from working on The Diseases of the Horse and other such odd works meandering through the system.
We aren’t limited to offering works to PG, so far as I know. I would guess that TPTB would want to make any results available both to you and to PG. But that’s just a guess.
I can’t speak for TPTB. However, you shouldn’t fear to approach them.
February 13th, 2007 at 11:16 am
Karen, that is very good news about DP remastering some of the older PG works! It’s something several of us, for a long time, have encouraged them doing. I’ve mentioned it at least two times in the TeleBlog, and several times on gutvol-d over the last few years. With the new PG policy of allowing provenance information in the new texts (bravo to Michael Hart and Greg Newby), this is definitely a great development where the winners are the readers.
And I’ve not heard about the five round system. Does this mean five proofreading rounds? Looking at the DP home page, it still shows only three. One thing that interests me is how accurate is the three proofreading system? That’s one thing I hope DTM will be able to answer.
For others reading this, DP is focused on the production side of the texts. As far a I know they still only donate their work product to PG, and make no other use of the texts. In addition, they have not (as far as I know) instituted an archival scanning requirement or recommendation, although individuals scanning books could do so if they want. (For one book I submitted to them, an original of Burton’s Kama Sutra of Vatsyayana which still needs to be proofed, the scans were done at archival quality and then downsampled to meet DP’s preferences.)
I’ve even toyed for a while with the idea of starting a “Distributed Scanners” group (which could still be launched to support the “Digital Text Masters” project) — just find a bunch of persnickety people (maybe with a touch of obsessive-compulsiveness to assure things are done right <laugh/>) with suitable scanners (e.g., high-grade sheet feed scanners and the Plustek OpticBook) and steer them to scan the great classics according to a set of requirements and guidelines. DS would, of course, build a database for archiving the scan sets along with donating copies of the originals to the Internet Archive (the scan sets, at archival quality, will take up a lot of space, like five or more gigs per book, but with terabyte drives getting dirt cheap, and burning DVD’s also getting cheap, space is no longer an issue.)
So our vision for DTM, should it get launched, is much more comprehensive and wide-ranging. It certainly could take advantage of the DP system and if DP wants to be involved. (I don’t want to be presumptuous here — Juliet will understandably require that there be meat and real potential in the DTM proposal for her to commit any official DP mindshare to it.) A copy would go to PG, which is important, and we would get a copy as well. DTM would, of course, take the lead in securing the archival-quality and QC’d scans.
And DTM would work to start building the text annotation project (either alone, or preferably as part of David Rothman’s LibraryCity venture should he succeed at getting that funded in time) — important to make the texts much more useful in the educational market. Btw, since DTM will probably do a similar approach to mastering as what DP will do in the future with PGTEI, I see the works done by DP, which DTM won’t do, to augment the DTM works in the database we set up — again more synergies between the projects.
Since DP and PG are closely linked, PG will benefit from this as well. DTM is not intended to compete, but to complement PG and DP, and possibly open up doors the other projects will have more difficulty in opening. Visualize the association as a Venn diagram, with three circles intersecting with each other — DTM will have some of its circle outside of the others. The result is that the projects in toto will cover a bigger area.
February 13th, 2007 at 1:56 pm
It’s three proofreading rounds, then two formatting rounds — that’s five — and then a post-processor pulls everything together, makes sure that it’s consistent, and then sends it on to someone who checks and uploads the text.
As for redoing old texts — someone commented on it in the forums, but I don’t think there’s been any announcement. I”m not sure what’s been tackled other than Robert Louis Stevenson. We’re doing a multi-volume collection of his complete works.
The question of displaying the images as well as the etext is a vexing one. Many people in DP would like to do so and I understand that we’re holding the scans that we made and we control. However, we’ve done a fair number of books based on borrowed scans from Gallica, Google Books, and other such sites. Pictures of two-dimensional texts or pictures that are out of copyright can’t be copyrighted and if we were to display the “borrowed” pictures, we’d probably be OK but … it seems better not to push it.
I hope that I haven’t gotten myself in hot water here, by seeming to speak for DP. I’ve been volunteering there for four years, but I’m just a foot-soldier, not one of the generals.
February 13th, 2007 at 3:32 pm
Thanks, Karen, for clarifying.
We’ll see how this unfolds. Already I’ve gotten a couple private messages from “high placed people” with “well-known organizations”. We’ll see if there’s any legs to these and other inquiries.
No matter what happens, I think the idea of “digital text masters”, no matter how it is implemented, is striking a positive chord with a lot of people. I believe a lot of people do care about the faithfulness and provenance of the public domain texts they read, especially when it is brought to their attention. The typical reaction appears to be “I hadn’t thought about it before, but yeah, maybe I should be more concerned.”
February 13th, 2007 at 7:27 pm
First of all, I am no part of any cabel; just another foot soldier at DP. By and large, this whole argument seems rather futile: was something done “wrong” by anyone? New tools and methodologies evolve; and that which was available years ago, now is antiquated. It does NOT mean that work is wasted: if nothing else, it served as the inspiration to demand better. How did the first, poor, translations of Gilgamesh make it into print? OTOH, should those first translators feel their work has been discounted? Egos, egos.
We throw words like ‘bowlderized’ around as insults- but has anyone ever read his introduction? Poor Tom did his best, and is treated as the worst. We each fall somewhat shy of others expectations.
Right now, I’m trying to get my head around the Dublin Protocols which, I feel, would solve a lot of these issues. Shall we all, then, stop work on everything, until we know we’re working on the ultimate edition?
At least book burning is easier in this day:
pip *.*/del
Come on: we’re supposed to be the good guys! There’s lots of real enemies. We don’t have to emulate a Canadian Prime Minister, who described his party as like the early settlers who would find themselves attacked by Indians, circle their wagons, and start firing inwards…
February 15th, 2007 at 3:37 am
Project Gutenberg and DP quality standards are improving, and so are the tools and procedures in use, and although I believe there will be a market for re-done, certified very high-quality versions of what some may consider the canon of English literature, I also think that a lot of other works warrant attention, in particular those works with considerable added value, such as reference works, or old magazine runs, which are often much more difficult to get access to. The canon of English literature is relative easy to get hold of, but if you could scan and process a set of early magazines, you would add materials that or currently difficult to access.
Projects like Google Books are changing the playground as well, a tremendous number of books are now becoming available, which means our focus can move from scanning to harvesting and adding value (by proofreading and propper tagging) to what is already available.
Another point where PG is gradually improving is metadata. We now have a working catalog. It would also be very nice to build a working ‘reading room’ application around the correction, where people can read, annotate works, and share these annotations with friends or the community.
For languages other than English, we are where PG was 10 to 20 years ago. For both Dutch and Philippine related works, where I have been very active to grow the collection, we are still not through the canon of literature. That is, if such a thing can be demarcated, I have my reservations on an elitist view of literature, and would just as happy an obscure penny novel. Sometimes, these things can be hidden gems, and sometimes they have rightly been delegated to the realm of obscurity. Especially in developing countries, like the Philippines, works have become very difficult to obtain, due to low printing volumes, low quality paper, two devastating wars, and adverse climate conditions. Similarly, Dutch language literature from what is now Indonesia is extremely difficult to obtain, as they were often printed in low volumes on cheap paper, and very few copies have ever reached Europe.
The current long copyright terms are also very harmful for the promotion of literature. In the time-span between commercial non-profitability and entering the public domain, where they are again free to build upon, works get so far out-dated and out-of-touch with living culture, that they have effectively died… One of PG’s purposes, in my opinion, is reviving such works from oblivion.
February 15th, 2007 at 11:50 am
Thanks, Jeroen, for bringing up some important points.
I agree that among the huge corpus of “non-canon” public domain books (in whatever language), there are a significant number which are truly gems that people should know about, and should warrant special digital mastering. They should be added to the “canon.”
Two further comments on this point:
The proposed “Digital Text Masters” will not etch in stone the “canon”, but rather will be quite flexible as to what works it encompasses. The emphasis, though, especially in the early years, will be on those public domain books oft-used in education (both K-12 and post-secondary) and regarded by experts and lay enthusiasts to represent the best or the most influential of the public domain.
If someone believes a particular book should be added to the DTM collection, when it might otherwise not be considered, they may make their case. If they are willing to share in some of the costs (if any) and human effort of the digital mastering, they may get the go-ahead to include that book in the DTM collection. The details of the whole selection process still need to be shaken out, and will probably evolve over time.
Once DTM is established and most of the bugs shaken out of it, DTM can certainly diversify into other books and periodicals that make sense to digitize to DTM quality and for inclusion in the DTM database/archive. It’s hard to know the future, really, but we should definitely be ready to diversify. I have some ideas, but they are still nascent so I won’t describe them here.
Finally, DTM is not intended to replace Distributed Proofreaders, which we fully support in its task to digitize the large number of public domain books without regard to “canon.” In fact, I see some sort of working relationship develop between the two organizations (even if DTM proofing ends up not using DP or its system), and with PG which appears to be evolving to a general text archive from many sources.
For example, since DTM is intended to be a formal, more heavily-funded organization (while DP will necessarily always be more of a grassroots, volunteer-driven group even though it does have 501c3 status), I see DTM advancing the various technologies which could be shared with DP and with PG. So in some respects DTM might become the “technology development” arm of the large, multi-organizational community to digitize the public domain texts.
Something to think about, at least. If DTM becomes as successful at generating sustaining revenue as I believe it can, I envision DTM donating funds to DP and PG, preferably in a matching donation sense in order to spur others to donate to those organizations.
Well, I am getting well ahead of myself, since we haven’t yet organized, have no funds, nor produced anything! But I wanted to share how I see DTM relate to DP and PG, and of course to share some long-term goals should DTM get launched.
If nothing else, this exercise provides one vision of the future of the effort to digitize public domain texts, and it will add to the “public domain idea database” that others may draw from.
A related point I think is important to mention which I haven’t yet:
I observe that the public mindshare these days is on the scanning of public domain texts (which is great to see happen!) The downside of this is that in various quarters this quite public focus on scanning is hiding the advantages to users of having structured digital texts. Even though we know the advantages, we are so far not getting the message out — many see having scanned images plus raw, unproofed OCR text as more than sufficient to use these texts.
I see DTM as helping to get the message out that there’s significant benefit to society to create proofed and structured digital texts of many if not all the public domain works. Focusing on the great classics, though they be few in number compared to the entire corpus of public domain texts, helps in explaining the benefits of structured digital texts. It’s a little tougher when the texts being showcased are really obscure, arcane, and in some cases truly bizarre works. But talk about Mark Twain, or the Brontë sisters, and everyone recognizes them.
August 19th, 2007 at 6:22 pm
[...] Several interesting comments were posted about this article at TeleRead. If you wish to read those or post your own comment, you can do so by clicking the link: ‘Digital Text Masters’ (Digitizing the classic public domain books). [...]
September 15th, 2008 at 11:15 am
[...] where I could discuss my experiences as I learn to produce .epub formatted documents from a ‘Master Format’, and with the hope that others may also find the information [...]
December 7th, 2009 at 9:45 am
I believe Google have been using book scanners which read the distance of the pages using infra red 3D scanners, including the curve of the pages. So that when scanned they appear as flat images with little or no black depth marks on that often comes with book scanning. We usually carry out scanning using both ways. But the fastest way is always to slice the book and feed scan the pages if you are able to.
http://www.pearl-repro.co.uk
http://www.4document-scanning.co.uk
http://www.forms-data-capture.co.uk
http://www.microfiche-microfilm-scanning.co.uk