‘Generating high-quality PDFs from HTML + CSS’: Newspaper, book and OLPC angles
December 23, 2007 | 8:53 pm
By David Rothman
“Prince is a computer program that converts XML and HTML into PDF documents. Prince can read many XML formats, including XHTML and SVG. Prince formats documents according to style sheets written in CSS.” – Google video here, Prince-related site here.
The TeleRead take: The word is that the tech is good enough for books and newspapers. Are we headed toward, “Just print from the Web—and get book quality”? What’s more, there might be some OLPC apps. Imagine the possibilities in the mountains and jungles of developing countries. Local people and outside experts could collaborate on wikis on the Web—for example, localized agricultural and health manuals—and then micro-publishers could print heavily illustrated POD copies for the computerless. Not to mention more literary works in some cases. Who says Western public domain books should be the only ones churned out via POD?



Previous

SUBSCRIBE TO RSS
Comments:
OK. So can Open Office or a variety of other utilities. Why would you want to take a small-to-medium file and turn it into a large file, which loads slower and doesn’t work as well on varied screen sizes? The quality of print is dependent primarily on the printer, not the layout software.
True about OpenOffice—excellent point. I’ve actually used OpenOffice to produce PDFs for Sony E Ink machines. That said, I’m assuming that Prince will justify itself in terms of special features in areas such as pagination. And Prince does run on the XO (though I’d love to see at what speed). Be interesting to hear Jon Noring’s take on this. Thanks and happy holidays. David
1. I still don’t understand why you’d want to do this. Prince can’t perform miracles and many sites just won’t translate well to the limited page size of pdf. PDFs are good for a few things: creating a printable document, embedding fonts, and including vector graphics (tho svg is bringing vector graphics to browsers too). For reading online, pdfs suck IMO. Why do you need page numbers? Just to remember where you were? I’m sure we can come up with something better than pdf.
2. Creating a pdfs is a cpu intensive task. The XO would be the last machine on which you’d want to do it.
3. Why not just use wget -m or the Firfox plugin “scrapbook” to archive the site for offline browsing? The html version will be smaller than the corresponding pdf.
Ok, read the rest of your post (I had started with the comments) and see the answer to #1: printed docs for the computerless.
I’ve tried using Prince for this purpose (we’re very interested in converting Web pages to PDF for UpLib), and gave up on it quickly. It converts properly formatted XML to PDF, and there’s precious little properly formatted XML out there on the Web. You want a tool that’s more tolerant of the typical mistakes one finds in real Web pages. And most sites still aren’t XML, they’re HTML of some flavor.
As far as I know, there are basically two ways to go with this. The program HTMLDOC (http://www.htmldoc.org/) will convert older HTML to good-looking PDF, with lots of options. Unfortunately, Apple hired the guy who was working on it to fix/improve their CUPS printing system, and he hasn’t updated this freeware in over a year. The current version will convert HTML 3.2, but won’t handle newer constructs, and, worse yet, won’t handle Unicode.
Last January, I wrote a program (soon to be available as part of the UpLib open source release at http://uplib.parc.com/) called uplib-openoffice-convert-to-pdf, which takes a mass of XML and/or HTML, and uses the Open Office macro system to drive a conversion process, generating a PDF file. This can handle XHTML and HTML 4.x constructs, and Unicode. It’s got other problems which sometimes make the HTMLDOC route superior, but in general it works better than HTMLDOC for this purpose.
On the other hand, if you just want to generate PDF, and have control over the generation program, why not skip the XML step altogether, and use a package like ReportLab instead?