The shape of EPUB to come, #2: Hyphenation

By Hadrien Gardeur, Co-Founder of Feedbooks -

April 10, 2008

397

I remember very precisely my first day with an e-ink device. Instantly I realized what a difference it makes to read on a screen that looks like paper. But for some reason, the whole experience still felt like reading on a screen, instead of reading a book. It took me a few days to fully understand this impression: typesetting. I’m used to hyphenation, kerning, widows/orphans etc.—in a book. On a screen, the typesetting is usually very limited. While the screen looked like paper, the text looked like something that a screen displayed.

I managed to avoid this problem very quickly, using PDF files created for the device, but it’s still something that no reflowable format, no reading system solved yet.

Why should EPUB add support for hyphenation then? Customers expect the same quality of experience with an e-books than a book. Publishers are very picky with typesetting too. So let’s see how EPUB could add support for hyphenation…

Soft Hyphen

According to the W3C:

“In HTML, there are two types of hyphens: the plain hyphen and the soft hyphen. The plain hyphen should be interpreted by a user agent as just another character. The soft hyphen tells the user agent where a line break can occur.

Those browsers that interpret soft hyphens must observe the following semantics: If a line is broken at a soft hyphen, a hyphen character must be displayed at the end of the first line. If a line is not broken at a soft hyphen, the user agent must not display a hyphen character. For operations such as searching and sorting, the soft hyphen should always be ignored.

In HTML, the plain hyphen is represented by the “-” character (- or -). The soft hyphen is represented by the character entity reference ( or )”

Soft hyphens are already supported with the current OPS specifications for EPUB. I can see a situation where this is useful: you’re using a technical word that might not be available in a hyphenation dictionary, therefore you can manually specify where the word can be hyphenated.

But using soft hyphens in the whole flow of a book ? Bad idea. First of all, you can’t expect all books to be hyphenated this way. It also makes the process of generating a book much more difficult.

The CSS3 way

In the current version of the working draft for the CSS3 Generate Content for Paged Media Module, there’s a section dedicated to hyphenation.

The two most important properties are:

hyphens: used to set how a text is hyphenated. You can disable hyphenation, use the manual setting (certain characters and soft hyphens) or the auto setting that will fully hyphenate the text based on the resources specified.
hyphenate-resource: to list the resources when hyphens is set to auto.

You can also get extra control over the way your text is hyphenated using: hyphenate-before, hyphenate-after, hyphenate-lines & hyphenate-character.

I love the way you can automatically hyphenate the text using a list of resources: it’s very simple to set-up. The same way that we can embed fonts, we could also embed hyphenation dictionaries. The only problem in this case is that we may end up with large files. The reading system could automatically embed a few hyphenation dictionaries for the most common languages. And if you’re using something more exotic (technical words for example), you could also embed the right resource in your file.

In XSL, you can specify the language too. Specifying the language would be compulsory if you rely on the hyphenation dictionaries embedded with the reading system, but also a very powerful tool if it’s used the right way with hyphenate-resource. If your text uses more than a single language, you could specify multiple hyphenate-resource based on the language.

Conclusion

Hyphenation is just a single example of how typesetting support can be improved in EPUB. It is fairly easy to add this sort of support, although it would add some extra processing on the reading system.

Moderator’s note: I heartily agree with Hadrien’s concerns but would be interested in hearing from those who don’t. Why shouldn’t .epub get more typographical capabilities? Genuine bibliophiles love not just well-done hyphenation but also goodies such as kerning. I can enjoy hyphenation when reading .epub files in FBReader, but if there are ways to make hyphenation part of the official spec in a way that improves life for readers and developers alike, then why not? FBReader’s hyphenation is nice but far from perfect. I’m not even sure if Digital Editions, Adobe’s .epub app, offers hyphenation.

On another issue, no, that isn’t an official IDPF logo at the top of Hadrien’s fine article—rather, a suggested image from BookGlutton. What do you think of the possible prototype, though? Got any images of your own? Think the TeleBlog should hold a prototype contest to encourage the IDPF to take official action on a logo? Here’s hoping that IDPF will act on the “Intel Inside”-type idea that I proposed earlier. – D.R.

Technorati Tags: hyphenation,.epub,IDPF,International Digital Publishing Forum

9 COMMENTS

Jon Noring April 10, 2008 at 9:14 am

Great article, Hadrien!

As a contributor to IDPF’s OEBPS Working Group, Hadrien’s article on hyphenation support in EPub is certainly intriguing, and I suggest that Hadrien submit a formal request to the OEBPS Working Group asking the group to study it and determine whether to explicitly include recommendations and/or requirements in a future version of the OPS specification.

It is important in this discussion, and any discussion about EPub, to separate the Reading System (or as Lee Passey prefers to put it, the “User Agent”) from the OPS specification. (For those not familiar with the underpinnings of EPub, EPub is defined by the interrelated OPS, OPF and OCF specifications.)

Regarding Hadrien’s first suggestion, using the “soft hyphen” character, the current OPS specification implies support for this character per Unicode and the HTML specification. Thus, the OPS author (EPub author) may include that character in content. So, Hadrien’s recommendation is pointed towards EPub Publication authors.

However, OPS itself says nothing about how Reading Systems are to handle soft hyphens when encountered. Here the recommendations of HTML that Hadrien quoted come into play since OPS supports XHTML. Of course, a Reading System may completely ignore the soft hyphens it encounters in content and not use them in any manner (default rendering is that soft hyphens are invisible and of zero-width.)

So a future version of the OPS specification could add a section discussing the use of the soft hyphen, and include both authoring and Reading System suggestions, recommendations and requirements.

Hadrien’s second comment on the use of the CSS3 properties of hyphens and hyphenate-resource is also interesting. Currently OPS does allow these properties to be used in CSS documents, but does not require Reading Systems to recognize them. Any Reading System may simply ignore CSS properties that it is not required by OPS to recognize.

So, a future version of OPS could also discuss the use of these CSS3 properties. (Note that the OEBPS Working Group was reluctant to “bless” any of the CSS3 properties for the current OPS, since most of them are still in the draft phase at W3C. A Reading System, though, may choose to recognize and use any of them — at least that’s my current understanding without delving back into the subtleties of OPS.)

Log in to leave a comment
Alan Wallcraft April 10, 2008 at 9:20 am

FBReader’s automatic hyphenation is based on TeX’s approach, see http://www.tex.ac.uk/cgi-bin/texfaq2html?label=hyphen
It needs to know the language to hyphenate correctly.

Log in to leave a comment
Hadrien April 10, 2008 at 11:12 am

I agree Jon, the first step would be a recommandation in the OPS specs: a best-practice for authoring and a recommendation for the Reading System.

The TeX approach to hyphenation is usually considered pretty good but not perfect. In the OPF file of an EPUB book, you MUST specify the language of the book using dc:language (at least one element, multiple elements supported too). Therefore, a Reading System could automatically hyphenate if there’s default hyphenation patterns available for this language.

I think that we really need something similar to hyphenate-resource for another reason: for some languages or when you’re using a specialized vocabulary, you’ll most likely use words that the basic patterns won’t be able to hyphenate.

Log in to leave a comment
Jon Noring April 10, 2008 at 11:35 am

For situations where there are words in the content not of the primary language defined in dc:language (and better yet the attribute xml:lang applied to the <html> or <body> elements in each XHTML content document), then the author should apply xml:lang to them. This way hyphenation engines will know in which dictionary to look.

In addition, this allows text-to-speech engines to likewise know the language of each word.

Log in to leave a comment
Hadrien April 10, 2008 at 11:59 am

Yes you’re right, the xml:lang element is probably more suited for this than the DublinCore in the OPF file.

Log in to leave a comment
Jon Noring April 10, 2008 at 12:26 pm

It is simply good practice, and one I would require had I written the OPS specification, for all XHTML and DTBook documents in an OPS Publication, to include xml:lang on the root element. Doing it here, in addition to doing it in the OPF Package, assures the information is not lost when such documents are repurposed.

Log in to leave a comment
Jack Tingle April 10, 2008 at 2:59 pm

Please, don’t include hyphenation! Or any other really cool features. I read on a diverse range of systems, and each feature that gets added is another way for one or another of the systems to screw up. Keep it simple.

Other features you should consider not adding: ligatures, compound characters, accented characters which aren’t part of the normal set, nice looking but non-standard single quotes, double quotes, and apostrophes, to name but a few.

Regards,
Jack Tingle

Log in to leave a comment
Aaron S. Miller, CTO of BookGlutton April 10, 2008 at 4:56 pm

CSS is the way to nice typography in epubs. Rather than picking out properties to include, the OPS spec should REQUIRE ALL reading systems to be CSS2 compliant first, and recommend on top of that. This will help usher in all the goodness of CSS3 that is around the corner, and it won’t do any harm to backward compatibility with older versions of OPS. This is contrary to the approach taken right now, which I think favors hardware systems by selectively including certain CSS properties and not others, and by forbidding script interpretation, which at is the heart of cross-browser rendering consistency and which, with monstrosities like Internet Explorer 8 out there, is our only hope of having good-looking books on-line.

Log in to leave a comment
Tamas Simon April 10, 2008 at 5:49 pm

IMO instead of comities we need a good open-source implementation – both reader and author tool.
This would evolve faster than the standard and would provide a way to try new ideas.

I come from a CORBA background.
Though CORBA is pretty much dead because the OMG – the group overseeing the standard – reacted way too slow some gurus started a small company called ZeroC, open-sourced the product, fixed the old specs where they were lacking and simply are just doing the right thing.

I wonder if this will ever happen with EPUBS.

As you know from my posts… I’m not a big fan…

Log in to leave a comment

The TeleRead community values your civil and thoughtful comments. We use a cache, so expect a delay. Problems? E-mail newteleread@gmail.com. Cancel reply

You must be logged in to post a comment.

Share this:

Related

9 COMMENTS

The TeleRead community values your civil and thoughtful comments. We use a cache, so expect a delay. Problems? E-mail newteleread@gmail.com. Cancel reply

AMAZON

REVIEWS: E-Book & AUDIO BOOKS

SELF PUBLISHING: TECH & BIZ TIPS

MOST RECENT

POPULAR POSTS

MAJOR CATEGORIES