Epub 3 beefs up metadata but omits semantic enrichment, by Eric Hellman
June 23, 2011 | 9:39 am
Ironic amusement fills me when I hear book industry people say things like “metadata has become cool”, or “context is everything”. Welcome to the 20th century and all that. Meanwhile, in the library industry, metadata has been cool long enough to coat everything with a thick rind of freezer burn.
There’s good news and notsogood news for ebook metadata. The revision to the EPUB standard, published just a month ago, includes metadata tools that could eventually lead to a new era of metadata cooperation between publishers and the entire book supply chain, including libraries. At the same time, the revision fails to take advantage of ready-made vehicles for semantic enrichment of content, a move that could still provide new types of revenue for publishers while giving libraries new opportunities to remain relevant as books become digital.
Since I’m incurably optimistic, I’ll start with the half-full glass: Publication-level metadata. EPUB 3 includes a whole bunch of ways to include publication-level metadata in an EPUB container. As an example, imagine an EPUB3 for “Emma” with this mark-up in its package document (essentially the navigation directory for the book):
<metadata> ... <meta property="dcterms:identifier" id="pub-id"> urn:uuid:A1B0D67E-2E81-4DF5-9E67-A64CBE366809 </meta> <link rel="marc21xml-record" href="http://www.archive.org/download/cihm_29722/cihm_29722_marc.xml" /> <link rel="marc21xml-record" href="/cihm_29722_marc.xml" /> <link rel="foaf:homepage" href="http://openlibrary.org/books/OL24234129M/Emma" /> ... </metadata>
In this example, the first link element points to a MARC 21 xml record (MARC 21 is a blattarian standard for library metadata (look it up)) at the Internet Archive. The second link element points to the same record included in the EPUB container itself. There is also built-in vocabulary that allows the link element to point to ONIX, MODS, and XMP metadata records.
The meta element can also be used in the EPUB package document’s metadata block. It’s defined quite differently from HTML5’s empty meta element, with an about attribute and allowed text content. In principle, it can be used to encode arbitrary RDF triples, thanks to a prefix extension mechanism borrowed from RDFa which allows EPUB authors to add vocabularies to their documents.
These capabilities, on their own, could support major changes in the way that books are produced, delivered and accessed. In a publisher workflow, the EPUB file could serve as the carrier for all the components and versions of a book, even bits that today might be left out or lost in the caverns of so-called “content management systems”. A distributor would no longer need to match up content files with records in a separate metadata feed. EPUB books for libraries could be preloaded with cataloging and enrichment data, greatly simplifying the process of making the ebooks accessible in libraries.
Given the great advances for “package-level” metadata, it’s a bit disappointing that semantic mark-up of content documents missed the EPUB 3 boat. The story is a bit complicated, and it’s far from over. Imagine that you want to add mark-up to a book’s citations- perhaps you want to embed identifiers to support library linking systems. Or perhaps you’re a medical publisher and you want to embed machine readable statements about drugs and diseases in a pharmaceutical textbook. Or perhaps you want to publish a travel guide and you want search engines to pick out the places you’re describing. These applications are not really supported by the current version of EPUB 3.
EPUB content documents have a feature that you might think would do the trick, but doesn’t really. The epub:type attribute supports “semantic inflection” of elements. This attribute can be used to mark a paragraph as a bibliographic citation, for example, and supports many of the requirements imposed by conversion of content from legacy or specialized formats into the HTML5 dialect used by EPUB. It’s an important feature, but not enough to support semantic enrichment.
Part of the problem is EPUB 3’s dependence on HTML5, which is not yet a stable spec and is enmeshed in some surprisingly raw W3C politics. W3C has been the home of HTML standards development since the very early stages of the web, and has also been the home of semantic web standards development. HTML5 started outside of W3C in the WHATWG, an initiative to develop HTML in a way that would be backwards compatible with good-old fashioned non-XML HTML. W3C was convinced to fold WHATWG into its development efforts because of WHATWG’s corporate backing. Even so, the WHATWG version of the HTML5 spec drips with sarcasmtowards W3C HTML Working Group decisions.
During part of the development of EPUB 3, the HTML5 draft included “Microdata”, a method of embedding semantic mark-up in HTML. RDFa, a standard that competes with Microdata, was developed by W3C channels, and within W3C, it was decided in February of 2010 to move Microdata out of the HTML spec so as to give it equal footing with RDFa. Some participants in the EPUB working group wanted to include RDFa in the standard; others thought this would impose too much of a complexity burden on publisher-implementers. The EPUB draft ended up being released without either RDFa or Microdata.
The recent endorsement of Microdata by the Google-Yahoo-Bing cooperation haschanged the competitive landscape for embedded semantics. It’s now apparent that Microdata will get priority implementation in HTML development tools, leaving RDFa as a niche technology. For most use cases of EPUB semantic markup, the differences between RDFa and Microdata are small compared to the advantages of piggybacking on the technology investment supporting website creation.
According to members of the EPUB working group, it is expected that a dot release will follow relatively quickly behind EPUB 3.0. It seems to me that picking a semantic markup technology for content documents should now not be so hard. If you work for a publishing company that has ever mentioned semantic markup in a product plan, you should probably be making sure that the EPUB working group is aware of your needs. If you are a librarian who can imagine the possibilities of a semantically enriched EPUB collection, you should similarly be making your concerns known.
Although the EPUB working group includes representatives from tools vendors that might conceivably benefit from the adoption of EPUB-only constructs, the group’s track record for adopting wider web standards has been very encouraging. By adopting HTML5 as a stack component, the group has ensured that cheap or free tools to produce and author EPUB 3 content will be readily available.
Once semantic enrichment of ebooks becomes routine, libraries will play a vital role in their use. Libraries provide a copyright-friendly DRM-free community commons in which users case access and build on the information contained in licensed content. (Of course, I see “unglued” books as playing an equally important role in the library commons.)
The EPUB metadata glass is half full, and there’s more wine in the bottle!
Note: This is one thing I’ll be talking about on Saturday at the American Library Association meeting in New Orleans. (The program is somewhat inaccurate; the program will end at 10:30 AM at the latest. Ross Singer from Talis will lead off with an overview of semantic web technologies in libraries; I’ll follow with discussions of RDFa, the Facebook “Like” button and of course, EPUB.