Joe Clark on web standards for e-books
March 10, 2010 | 7:15 am
By Chris Meadows
On the A List Apart website, Joe Clark has written an extremely good, extremely long essay on why HTML-based formats are becoming the new standard for e-books, and what needs to be done to clean that standard up.
Clark points out that HTML “is great for expressing words”—and not just words in websites, but the form of words used for most fiction and some non-fiction books—what Craig Mod called “Formless Content”. Every e-book reader on the market can display some HTML-based formats—everything but the Kindle can do ePub, and the Kindle’s AZW format is just HTML-based in a different way.
Of course, every format decision blocks off other avenues, possible roads not taken. Clark is not equivocal that in advocating adoption of HTML, he may be blocking off new forms of “book” that have yet to be invented. But on the other hand:
I am happy to contribute to the death of “vooks” and other multimedia websites masquerading as books. (I do not want a rectangle of video yammering at me while I’m trying to read.) They’re like animated popunder ads in that no actual “user” wants them, but somebody with an agenda does. Exterminating that species is something to which I am proud to contribute. For other forms of books, advocating strict HTML markup will cause as-yet-unknowable harm.
He then goes into details about problems that need to be solved in order for HTML to be successful as an e-book format of choice. The semantics have to be cleaned up and standardized, so that e-books can be created with valid HTML code. Also, production methods have a lot of room for improvement—especially the early generations of e-book created largely out of unproofed scans of paper books.
Clark goes so far as to suggest that manuscripts should be written in HTML, then converted to Word for editing and change tracking, then passed to InDesign. (Though he does admit this point of view is “so optimistic as to be ridiculous.”)
Instead of avoiding errors to begin with, the publishing industry may choose to fix errors after they’re made—but only if authors, especially big-name authors with ruthless literary agents, complain loudly until publishers have entire imprints’ E-books repaired. This will not result in authors writing good strong HTML for new books, but will clean up part of the mess.
After this, Clark goes into a considerable level of detail as to what formatting tasks should be handled by CSS, and what tasks by the reader software. He also talks about changes that should be made to e-book text to make for easier format conversions (such as using an endash with spaces instead of emdash with none).
This article is well worth reading—and in a perfect world, publishers would be taking it to heart. I have to agree that adopting these standards would go a long way toward cleaning up the currently execrable state of a lot of e-book conversion efforts.
Standards would mean that a lot of conversion tasks could be automated, and it would be considerably easier for publishers and self-publishing authors to create e-books out of completed manuscripts.
But standards or not, there is no denying that HTML is becoming a de facto standard for presenting text, simply because it is in so much use on the web and relatively easy to work with. And it will be perfectly adequate for most of the e-texts people read.
It will be interesting to see where it goes from here.



Previous

SUBSCRIBE TO RSS
Comments:
I don’t see the problem with existing HTML that Clark obviously does. The “structural deficiencies” he cites are largely elements created for the print era… which we need to move beyond. Trying to continue to render digital copy as if it is a projected sheet of paper is old-thinking; it’s time to develop new ways to express information, designed for the digital realm.
Further, there are few deficiencies he mentions that the use of style sheets, the occasional table or column format, and some truly creative digitally-minded design work, won’t solve. If Clark is really advocating books, as opposed to “vooks” and such, HTML can do that job ably.
And for an HTML fan, I’m not sure why he keeps coming back to Word and InDesign. Again, those are products that are optimized for print-based content. It’s time to move beyond that. Creating in HTML sounds good to me, but then, it can be converted directly into e-book formats. Using Word or a similar program to spell-check, etc, makes sense; but the content could as easily be created in Word or a similar app for those features, then pasted into HTML (to remove the extraneous coding that Word and other apps introduce) and formatted for e-book production.
As this is a method I use now, and as it results in very clean and compliant ePub formatting, I’d highly recommend it… pick your favorite word processor, and no InDesign needed.
This makes me wonder if he’s ever done print and digital concurrently.
Well, that depends on what the coders of the software want. You can’t CSS/hard format your way around some devices’ kinks.
Okay, never mind. I just went and read HIS article (which I should’ve done before I opened my mouth).
Interesting stuff.
Clark’s suggestions and pleas for standards to address such things as ligatures, varying width spaces, and the like, should be read and studied by everybody working on the next version of ePub.
Notions of ‘proper small caps’ are more difficult; but the ePub group could commission a set of fonts that are unicode complete (or at least including all the publishing-centered unicode special characters) and make them available to all device makers and software coders producing ePub reading agents.
This is exactly the sort of demand that should have gone into ePub in the first place; somehow, typesetters and typography experts were left out of the design group? That’s shameful, if true.
— asotir
Quoth Steve Jordan:
Footnotes, endnotes, sidebars, and callouts, to name a few, are not relics of a bygone past but legitimate entities that need their own structures. Some other structured formats, like DTBook and tagged PDF, provide for them. HTML doesn’t, though HTML5 fills in some gaps here and there.
His next two paragraphs demonstrate he didn’t actually read the article, so instead of wasting my time refuting claims I didn’t make, I’ll skip to the next item.
Because people use them.
Give that a try sometime. People have been publishing utilities to clean up MS Word “HTML” for ten years and nothing has worked. MS Word will not produce usable HTML.
I am not a newbie at this, and I am in fact here reading your comments, so don’t act like I’m in the other room out of earshot. Also, I answer all my mail, so you can talk to me directly, assuming spam filters don’t eat your message.
No, but it only takes about 15 minutes strip it by hand if you have a routine in place for it, especially for things like ligatures, em dashes, curly quotes, small caps, and other things that are problematic. Lots of find-and-replace and you have to know what you’re looking for, but still.
Yeah, it’s a hassle. I’ve never used a cleaner because the one that was recommended to me was $99 & didn’t want to spend the money.
I don’t think they should. IMO, in an ebook, footnotes and endnotes are worthless because there are no page numbers. Reciprocal hyperlinks serve for that function. Sidebars and callouts can be handled with blockquotes.
YES, it’s rudimentary, but trying to get HTML to render an ebook “page” (aka position) like a print page is not the point. The point is to treat the ebook like its own thing and work within its limitations.
Also, until the big pubs start correcting their ebooks for basic problems like…editing…some of these formatting discussions are way ahead of the game. People who read ebooks want content but they want it clean and they’re not even getting that much consideration.
I like these ideas, Joe, I do. But I *really* think that there are some devices you just can’t CSS your way around. They don’t honor much of anything, and even then, not even the rudiments. I’m looking at YOU, Kindle.
(Heh. I’ve never been “quothed” before.) Joe, I took the tone I did because Chris posted your article, not you, and I did not know if you spent time on this forum to respond to comments. Now I know.
In fact, I did read the article. And I stand by my statement that the entities you list can be created in HTML… they will simply have to take another form that is more suited to HTML. If you insist on putting footnotes, endnotes, etc, etc, in the same places you would on a printed page… then just print it. Digital is different… it’s time to do things differently.
Yes, the Word “cleaning utilities all suck. But doing a surface copy of Word text and pasting it into a design-based HTML app like Dreamweaver strips out all of the Word garbage and creates compliant HTML. I use it for all my novels, and I can accomplish the whole thing in less than 30 minutes. As far as I know, this can be done with any word processing program… yes, most people use Word (including myself), but there are other options.
Confidential to Steve Jordan: The elements (not entities) in XHTML 1.1 represent a closed and finite list. Your imagination may take flight and convince you that, yes, Virginia, there is a footnote in XHTML, but like all childhood dreams, this one is destined to be shattered.
Public to all: Yes, Joe (and Virginia, too), I am very well aware that there are no “footnotes” in XHTML.
In XHTML, there are “links”… and we use them. If you work in a new medium, you use the tools that are designed into it. If all you want to do is force the new medium to act like the old one… maybe you shouldn’t be using the new medium after all…
A link isn’t a footnote any more than it’s an ostrich.
Whether a link can replace a footnote depends on whether one’s concerned with form or function. No, a link doesn’t look like some text at the bottom of a page, usually in a smaller type size. But the function of that block of text is to allow one to look at some supplementary information and then return to the regular flow of the text. A link does that fine. It’s the “fancy” versus “well-formed” tension of another thread.