<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Google Books makes hash out of classics</title>
	<atom:link href="http://www.teleread.com/2009/09/01/google-books-makes-hash-out-of-classics/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.teleread.com/paul-biba/google-books-makes-hash-out-of-classics/</link>
	<description>News &#38; views on e-books, libraries, publishing and related topics</description>
	<lastBuildDate>Tue, 14 Feb 2012 21:55:20 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
	<item>
		<title>By: Mags</title>
		<link>http://www.teleread.com/paul-biba/google-books-makes-hash-out-of-classics/comment-page-1/#comment-1140761</link>
		<dc:creator>Mags</dc:creator>
		<pubDate>Tue, 01 Sep 2009 22:15:47 +0000</pubDate>
		<guid isPermaLink="false">http://www.teleread.org/?p=27862#comment-1140761</guid>
		<description>I used to work for an online text aggregator. They would take content from providers in whatever electronic format the provider could manage (this was about ten years ago so it was all over the place) and write a Perl filter to apply standard formatting so the end user had a consistent look and feel. My job was to QA the end product and troubleshoot any issues caused by the filter. It went back to the programmers until they got it right; only then was the content released to end users.

Interestingly, we were bought out by a competitor, which actually took hard copies of the documents and scanned and OCRed them. When their content began to be incorporated into ours, I noticed a lot of OCR errors (my favorite being &quot;World War Ill&quot;--that&#039;s capital i, and two lower-case ls; a word synonymous with &quot;sick.&quot;). The Powers That Be were shocked, shocked! to learn that there were OCR errors in their content. 

Ours was cleaner, because A. we started with electronic text (which of course is not possible with old books) and B. &lt;i&gt;there was a sentient human being checking that it wasn&#039;t junk.&lt;/i&gt; Computers are great for the heavy lifting but we primates haven&#039;t yet made ourselves totally obsolete. And I&#039;ve never worked with an OCRed text that was 100% clean.</description>
		<content:encoded><![CDATA[<p>I used to work for an online text aggregator. They would take content from providers in whatever electronic format the provider could manage (this was about ten years ago so it was all over the place) and write a Perl filter to apply standard formatting so the end user had a consistent look and feel. My job was to QA the end product and troubleshoot any issues caused by the filter. It went back to the programmers until they got it right; only then was the content released to end users.</p>
<p>Interestingly, we were bought out by a competitor, which actually took hard copies of the documents and scanned and OCRed them. When their content began to be incorporated into ours, I noticed a lot of OCR errors (my favorite being &#8220;World War Ill&#8221;&#8211;that&#8217;s capital i, and two lower-case ls; a word synonymous with &#8220;sick.&#8221;). The Powers That Be were shocked, shocked! to learn that there were OCR errors in their content. </p>
<p>Ours was cleaner, because A. we started with electronic text (which of course is not possible with old books) and B. <i>there was a sentient human being checking that it wasn&#8217;t junk.</i> Computers are great for the heavy lifting but we primates haven&#8217;t yet made ourselves totally obsolete. And I&#8217;ve never worked with an OCRed text that was 100% clean.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Elie Morisse</title>
		<link>http://www.teleread.com/paul-biba/google-books-makes-hash-out-of-classics/comment-page-1/#comment-1140710</link>
		<dc:creator>Elie Morisse</dc:creator>
		<pubDate>Tue, 01 Sep 2009 20:58:13 +0000</pubDate>
		<guid isPermaLink="false">http://www.teleread.org/?p=27862#comment-1140710</guid>
		<description>It&#039;s not the scans which are not good enough, but their OCR software, OCRopus, which is open-source and constantly improving.
As OCRopus will improve the EPUB versions too, and for the time being you can always download the original image-based PDF instead.

OCRed PDF and DjVu files from the Internet Archive with other OCR software aren&#039;t proofread either, full of typos and unsuitable for e-book readers.

Only the Gutenberg books are proofread and correctly formatted, but their number is also several orders of magnitude below.</description>
		<content:encoded><![CDATA[<p>It&#8217;s not the scans which are not good enough, but their OCR software, OCRopus, which is open-source and constantly improving.<br />
As OCRopus will improve the EPUB versions too, and for the time being you can always download the original image-based PDF instead.</p>
<p>OCRed PDF and DjVu files from the Internet Archive with other OCR software aren&#8217;t proofread either, full of typos and unsuitable for e-book readers.</p>
<p>Only the Gutenberg books are proofread and correctly formatted, but their number is also several orders of magnitude below.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mike Perry</title>
		<link>http://www.teleread.com/paul-biba/google-books-makes-hash-out-of-classics/comment-page-1/#comment-1140678</link>
		<dc:creator>Mike Perry</dc:creator>
		<pubDate>Tue, 01 Sep 2009 19:19:20 +0000</pubDate>
		<guid isPermaLink="false">http://www.teleread.org/?p=27862#comment-1140678</guid>
		<description>Quality is one appropriate gripe about what Google is doing. Quantity is another. Internet Archive seems to be filling up with multiple scans of the same title, typically from Google&#039;s library scans, leaving visitors confused about which would be the best download. 

I generally go with the version that has the most downloads, but that may be just because it was the first out and because others are doing as I am.

Google&#039;s behavior leaves me longing for the era when ebooks meant books done by Gutenberg, books legally copied and carefully proofed.</description>
		<content:encoded><![CDATA[<p>Quality is one appropriate gripe about what Google is doing. Quantity is another. Internet Archive seems to be filling up with multiple scans of the same title, typically from Google&#8217;s library scans, leaving visitors confused about which would be the best download. </p>
<p>I generally go with the version that has the most downloads, but that may be just because it was the first out and because others are doing as I am.</p>
<p>Google&#8217;s behavior leaves me longing for the era when ebooks meant books done by Gutenberg, books legally copied and carefully proofed.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: pond</title>
		<link>http://www.teleread.com/paul-biba/google-books-makes-hash-out-of-classics/comment-page-1/#comment-1140631</link>
		<dc:creator>pond</dc:creator>
		<pubDate>Tue, 01 Sep 2009 17:39:10 +0000</pubDate>
		<guid isPermaLink="false">http://www.teleread.org/?p=27862#comment-1140631</guid>
		<description>Google has a history of releasing online apps with the &#039;Beta&#039; label on them. But these things usually function.

Imagine if Gmail had launched with comparable quality to the Gulliver OCR! Gmail never would have taken off.

This really is a tremendous black eye to Google&#039;s reputation, and the widely-announced deal with Sony is only going to expose these warts and hurt Google even more.

Google really needs to fix this, fast.</description>
		<content:encoded><![CDATA[<p>Google has a history of releasing online apps with the &#8216;Beta&#8217; label on them. But these things usually function.</p>
<p>Imagine if Gmail had launched with comparable quality to the Gulliver OCR! Gmail never would have taken off.</p>
<p>This really is a tremendous black eye to Google&#8217;s reputation, and the widely-announced deal with Sony is only going to expose these warts and hurt Google even more.</p>
<p>Google really needs to fix this, fast.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Steve Jordan</title>
		<link>http://www.teleread.com/paul-biba/google-books-makes-hash-out-of-classics/comment-page-1/#comment-1140535</link>
		<dc:creator>Steve Jordan</dc:creator>
		<pubDate>Tue, 01 Sep 2009 15:03:54 +0000</pubDate>
		<guid isPermaLink="false">http://www.teleread.org/?p=27862#comment-1140535</guid>
		<description>To me, this alone is a good enough reason to forbid Google from posting scanned material.  If quality QC is beyond them, they shouldn&#039;t be allowed to do the job, any more than I should be allowed to build the next rocket to Mars in my garage.</description>
		<content:encoded><![CDATA[<p>To me, this alone is a good enough reason to forbid Google from posting scanned material.  If quality QC is beyond them, they shouldn&#8217;t be allowed to do the job, any more than I should be allowed to build the next rocket to Mars in my garage.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: David Rothman</title>
		<link>http://www.teleread.com/paul-biba/google-books-makes-hash-out-of-classics/comment-page-1/#comment-1140534</link>
		<dc:creator>David Rothman</dc:creator>
		<pubDate>Tue, 01 Sep 2009 15:03:32 +0000</pubDate>
		<guid isPermaLink="false">http://www.teleread.org/?p=27862#comment-1140534</guid>
		<description>OLD STORY, good cause, lol. David</description>
		<content:encoded><![CDATA[<p>OLD STORY, good cause, lol. David</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mike Cane</title>
		<link>http://www.teleread.com/paul-biba/google-books-makes-hash-out-of-classics/comment-page-1/#comment-1140531</link>
		<dc:creator>Mike Cane</dc:creator>
		<pubDate>Tue, 01 Sep 2009 14:54:38 +0000</pubDate>
		<guid isPermaLink="false">http://www.teleread.org/?p=27862#comment-1140531</guid>
		<description>*snort*
http://ebooktest.blogspot.com/2009/08/suckups-suckers-and-sloppiness-mislead.html</description>
		<content:encoded><![CDATA[<p>*snort*<br />
<a href="http://ebooktest.blogspot.com/2009/08/suckups-suckers-and-sloppiness-mislead.html" rel="nofollow">http://ebooktest.blogspot.com/2009/08/suckups-suckers-and-sloppiness-mislead.html</a></p>
]]></content:encoded>
	</item>
</channel>
</rss>

<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Page Caching using disk: enhanced
Database Caching using disk: basic
Object Caching 393/419 objects using disk: basic

Served from: www.teleread.com @ 2012-02-14 21:21:12 -->
