<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: 5 Years of Distributed Proofreaders</title>
	<atom:link href="http://www.teleread.com/2005/10/22/5-years-of-distributed-proofreaders/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.teleread.com/public-domain/5-years-of-distributed-proofreaders/</link>
	<description>News &#38; views on e-books, libraries, publishing and related topics</description>
	<lastBuildDate>Tue, 14 Feb 2012 21:55:20 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
	<item>
		<title>By: Garson Poole</title>
		<link>http://www.teleread.com/public-domain/5-years-of-distributed-proofreaders/comment-page-1/#comment-19872</link>
		<dc:creator>Garson Poole</dc:creator>
		<pubDate>Tue, 25 Oct 2005 10:22:13 +0000</pubDate>
		<guid isPermaLink="false">http://www.teleread.org/blog/?p=3747#comment-19872</guid>
		<description>Branko, Thanks for your volunteer work and thanks for the link to the discussion of the DjVu format and XML.  Several years ago I visited the website of the &quot;Million Book Project&quot; when it was active at Carnegie Mellon. Unfortunately, the project seemed to be stalled and inactive when I visited. It is good to see that Brewster Kahle and the Internet Archive are moving forward together with Canadian libraries.

I agree that retaining typographical information when proofreading and converting text into a markup language would be highly desirable. The HTML versions from Distributed Proofreaders that preserve some typography are great. Preserving precise information about publishers, edition numbers, print runs etc is also important as David Rothman has emphasized on this blog.

The digitalization of millions of volumes is such a massive task that I think human proofreaders will only be able to perform markup on a subset of the volumes. For many books only limited formats will be available, e.g., scans, OCR data, and DjVu. I hope a widely deployed free reader with an openly specified format will be available for these books.</description>
		<content:encoded><![CDATA[<p>Branko, Thanks for your volunteer work and thanks for the link to the discussion of the DjVu format and XML.  Several years ago I visited the website of the &#8220;Million Book Project&#8221; when it was active at Carnegie Mellon. Unfortunately, the project seemed to be stalled and inactive when I visited. It is good to see that Brewster Kahle and the Internet Archive are moving forward together with Canadian libraries.</p>
<p>I agree that retaining typographical information when proofreading and converting text into a markup language would be highly desirable. The HTML versions from Distributed Proofreaders that preserve some typography are great. Preserving precise information about publishers, edition numbers, print runs etc is also important as David Rothman has emphasized on this blog.</p>
<p>The digitalization of millions of volumes is such a massive task that I think human proofreaders will only be able to perform markup on a subset of the volumes. For many books only limited formats will be available, e.g., scans, OCR data, and DjVu. I hope a widely deployed free reader with an openly specified format will be available for these books.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Branko Collin</title>
		<link>http://www.teleread.com/public-domain/5-years-of-distributed-proofreaders/comment-page-1/#comment-18731</link>
		<dc:creator>Branko Collin</dc:creator>
		<pubDate>Mon, 24 Oct 2005 08:57:33 +0000</pubDate>
		<guid isPermaLink="false">http://www.teleread.org/blog/?p=3747#comment-18731</guid>
		<description>Garson, having quality and completeness levels, and including scans, would certainly seem to be useful, especially if it will enable you to lift a document to a next level. This should allow you to release early and often, as the FOSS people like to say. 

In a sense (a very vague and loose sense) this is already happening; many projects release tens of thousands of documents in scan form with uncorrected OCR, lots of scientific projects release a few heavily marked-up documents, and Distributed Proofreaders seems to be somewhere in between, with typographical characteristics often preserved in HTML versions. (HTML versions are not obligatory, but by my last count approximately 75% of the ebooks coming out of DP are accompagnied by an HTML version.)

As for pointing to locations on a scanned page, &lt;a href=&quot;http://www.archive.org/iathreads/post-view.php?id=25169&quot; rel=&quot;nofollow&quot;&gt;I know it is possible&lt;/a&gt;, but it needs to be implemented. If it needs to be implemented at DP, it needs to be implemented in the entire system, which is going to be a lot of work. Personally, I&#039;d prefer to work first on moving to a generic XML/SGML format for DP that retains most typographical information.</description>
		<content:encoded><![CDATA[<p>Garson, having quality and completeness levels, and including scans, would certainly seem to be useful, especially if it will enable you to lift a document to a next level. This should allow you to release early and often, as the FOSS people like to say. </p>
<p>In a sense (a very vague and loose sense) this is already happening; many projects release tens of thousands of documents in scan form with uncorrected OCR, lots of scientific projects release a few heavily marked-up documents, and Distributed Proofreaders seems to be somewhere in between, with typographical characteristics often preserved in HTML versions. (HTML versions are not obligatory, but by my last count approximately 75% of the ebooks coming out of DP are accompagnied by an HTML version.)</p>
<p>As for pointing to locations on a scanned page, <a href="http://www.archive.org/iathreads/post-view.php?id=25169" rel="nofollow">I know it is possible</a>, but it needs to be implemented. If it needs to be implemented at DP, it needs to be implemented in the entire system, which is going to be a lot of work. Personally, I&#8217;d prefer to work first on moving to a generic XML/SGML format for DP that retains most typographical information.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Garson Poole</title>
		<link>http://www.teleread.com/public-domain/5-years-of-distributed-proofreaders/comment-page-1/#comment-18541</link>
		<dc:creator>Garson Poole</dc:creator>
		<pubDate>Mon, 24 Oct 2005 04:57:07 +0000</pubDate>
		<guid isPermaLink="false">http://www.teleread.org/blog/?p=3747#comment-18541</guid>
		<description>The work being performed by the volunteers at &quot;Distributed Proofreaders&quot; is magnificent. Yet, I think that it is time to consider a new supplementary strategy for making public domain books available. Instead of waiting for proof-reading each book why not release books in a preliminary form as a collection of scans together with the best current results from an automated optical character recognition analysis. This approach will allow for the release of myriad books. Each book should only require a few megabytes of storage using compression methods specialized for text. This is acceptable now because hard-disk and flash memories are capacious and relatively inexpensive. Also, storage costs continue to decline. I suspect that Google and Yahoo may take this approach because human proof-reading is slow and expensive.

Will the “OpenReader” format support a collection of scans together with OCR text that can be used for searching? The OCR text would be supplemented with “index” type information that would allow a search result to point to a specific location on a scanned page.</description>
		<content:encoded><![CDATA[<p>The work being performed by the volunteers at &#8220;Distributed Proofreaders&#8221; is magnificent. Yet, I think that it is time to consider a new supplementary strategy for making public domain books available. Instead of waiting for proof-reading each book why not release books in a preliminary form as a collection of scans together with the best current results from an automated optical character recognition analysis. This approach will allow for the release of myriad books. Each book should only require a few megabytes of storage using compression methods specialized for text. This is acceptable now because hard-disk and flash memories are capacious and relatively inexpensive. Also, storage costs continue to decline. I suspect that Google and Yahoo may take this approach because human proof-reading is slow and expensive.</p>
<p>Will the “OpenReader” format support a collection of scans together with OCR text that can be used for searching? The OCR text would be supplemented with “index” type information that would allow a search result to point to a specific location on a scanned page.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: David Rothman</title>
		<link>http://www.teleread.com/public-domain/5-years-of-distributed-proofreaders/comment-page-1/#comment-17876</link>
		<dc:creator>David Rothman</dc:creator>
		<pubDate>Sun, 23 Oct 2005 13:01:32 +0000</pubDate>
		<guid isPermaLink="false">http://www.teleread.org/blog/?p=3747#comment-17876</guid>
		<description>That&#039;s a valuable  mini-history, Branko--thanks very much for your lastest post. I hope readers will pick up on the fact that DP&#039;s contribution isn&#039;t &lt;em&gt;just&lt;/em&gt; in quantity of texts, but also in quality.  - David</description>
		<content:encoded><![CDATA[<p>That&#8217;s a valuable  mini-history, Branko&#8211;thanks very much for your lastest post. I hope readers will pick up on the fact that DP&#8217;s contribution isn&#8217;t <em>just</em> in quantity of texts, but also in quality.  &#8211; David</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jon Noring</title>
		<link>http://www.teleread.com/public-domain/5-years-of-distributed-proofreaders/comment-page-1/#comment-17423</link>
		<dc:creator>Jon Noring</dc:creator>
		<pubDate>Sun, 23 Oct 2005 05:31:06 +0000</pubDate>
		<guid isPermaLink="false">http://www.teleread.org/blog/?p=3747#comment-17423</guid>
		<description>Congratulations to Distributed Proofreaders!</description>
		<content:encoded><![CDATA[<p>Congratulations to Distributed Proofreaders!</p>
]]></content:encoded>
	</item>
</channel>
</rss>

<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Page Caching using disk: enhanced
Database Caching using disk: basic
Object Caching 359/383 objects using disk: basic

Served from: www.teleread.com @ 2012-02-14 17:29:57 -->
