XML and the many facets of publishing: Why publishers DO need XML
January 16, 2010 | 3:24 am
By Jean Kaplansky
One minute I was reading Roger Sperberg’s “Why Do Publisher’s Need XML?” post, and the next time I looked, I was typing like a mad woman.
“Wait!” I thought to myself. “He’s only talking about one facet of publishing, but making it read like this is true for all of publishing! What about all of the other acts of publishing that do need XML?”
Huh? Publishing has facets? Well, yeah. I’ve worked in multiple facets of the publishing community: University Press, Journals, Technical Publications, Textbooks for K-12, Higher Education, and Continuing Education, as well as Enterprise Publishing at some very large companies.
I agree completely with Paul Topping’s comments on the Sperberg article. Accessibility and Process-ability are two huge reasons to not write off XML as a fad or lost cause in the publishing community.
The relevance of markup is a contextual to the line of business doing the publishing, and what the business intends to do with the content in the future.
One thing is certain: if you cannot programmatically get to a piece of content, whether by metadata, or walking a markup tree, the potential to reuse that piece of content or provide value-added any sort of processing is greatly diminished.
In addition to Paul’s comments regarding accessibility and calculations, I would like to point out that other other publishing processes exist that regularly benefit from marked-up content. One example that comes to mind is a pivot table that is not based on numerical data—like the kind of data that is often collected for clinical data reports in large clinical trials.
Another example comes directly from learning content—Test Questions and Answers. Not only can you deliver this content more intelligently with markup, the IMS QTI standard has designed markup to handle automated processing of test question scoring, as well as logical delivery of further test questions based on real time information submitted from users.
But not everyone needs or wants to consider going to this level of detail if the financial side of things does not make sense, or the need to reuse doesn’t exist (which I find hard to believe these days with the number of different ways to publish a single book), or the need for a process adding additional value to content is just not there.
The business process for publishers who work with narrative content, that is fiction, and other “soft” side subjects—basically, anything that is not related to science or math, or content such as certification test preparation guides—does not currently require identifying content in minute detail to get an acceptable end result: a hard bound, paper back, and/or e-book (which, just to reiterate, are products of the now, but not necessarily the sum-total of products that will be available in the future).
The near future ROI for soft-side publishing does not exceed the costs of converting narrative content from authored originals to XML, cleaning up the content so that it can be consistently processed, and building automated workflows required to maintain and process the content in the future.
Unless… Unless you want to do something innovative like automatic composition of content that aggregates relevant content from a single source based on audience (think of the original version of Marley & Me vs. the children’s book Bad Dog, Marley!, or a student/instructor version of an English literature anthology), or the ability to single source products with dramatically different presentations (think of Cesar’s Way Deck: 50 Tips for Training and Understanding Your Dog and Cesar’s Way: The Natural, Everyday Guide to Understanding and Correcting Common Dog Problems; the first is a deck of cards, the second, a bound book. Much of the material could have come from the same source.)
But narrative, soft-side content is only one facet of the publishing community.
Educational publishers are required, by law, to create accessible versions of their products. What has become the de facto standard for accessibility in the U.S.? The DAISY standard, which is, you guessed it, has more than one part based on XML.
The FDA has its own standards in which they want to receive New Drug Application submissions from pharmaceutical companies. The officially approved standards? Yep. XML.
As a good friend and former co-worker once said, “When the federal government stands between your product and the market, you tend to do what they say…”
I could provide additional examples from other industries, but I think you guys get the point. XML is here, and it is useful in all sorts of publishing scenarios – if the business has the requirements to go there. So instead of asking “Why XML?” Lots of people are out there asking “Why not XML?” It’s my experience that people don’t really think about what they can do with XML content until they have some to play with and experiment upon. Experimentation can lead to innovation. Innovation can lead to new ways to realize ROI. Dismissing a technology because it’s not useful in one facet of publishing, or because the content is authored in Word (or on a typewriter – some professors are really old school), or because it can be expensive, does not mean that there isn’t a community out there that can greatly benefit from the technology right now, as is the case with books converted to the DAISY standard to provide accessibility to information. There’s also an innovative community out there that is thinking about how they will come up with new and different approaches to looking at, analyzing and learning from content, and how this innovation can be turned into published products.
By the way: The normative narrative content model in the ePub can be either XHTML or the Daisy DTBook standard. The metadata and processing parts of ePub are either XML-based or XML-related.
Bio: Jean Kaplansky is an avid P- and E- book reader who just happens to also have a history of working in publishing, with publishing tools, or for someone who wants to get something published. Jean is currently a Sr. Consultant for the PTC Arbortext Business Unit. The opinions expressed in this post should in no way be attributed in any way to Jean’s employer.



Previous

SUBSCRIBE TO RSS
Comments:
An excellent article, but as a Mac user, I’m in a bind. Macs may be state-of-the-art when in comes to more recent media like digital video, but OS X is between twenty and thirty years out of date in how it deals with text.
Look at how text applications on Macs handle text. The ruler bar and formatting menus are like the WordStar that came with my Kaypro in the early 1980s. Look at the castrated version of RTF that OS X understands. It’s ignorant of named character and paragraph styles, something Word 5.1 had in its RTF in the early 1990s. Named styles, the ability to tag something as a chapter heading independent of any formatting attached, is a rudimentary form of XML. It should have been in OS X as early as 10.3. It’s not even in 10.6.
Mac developers would be delighted to add modern text-handling features to their applications. The developer of Scrivener, a marvelous tool for creating book drafts, has repeatedly stated that he would love to include named styles. It would make it much, much easier to export text to other applications for formatting. But like many other Mac developers, he doesn’t have the time to create features that Apple hasn’t included in its developers tools.
The result is bizarre. More and more Mac applications (including iWork) store data in XML-like formats. But users working with text can’t get at that hidden XML in some user-friendly way.
In short, there are a lot of benefits to formatting books in XML. But most of us aren’t going to take advantage of them if we have to use a specialized XML editor. Tagging text by what it is, not merely what is looks like, has to be built into the core of the operating system. It has to become a well-designed and easy to use part of virtually every text application.
A good article, thank you for clearing up some of the questions I had after reading Roger’s article.
I do think though that saying ‘FDA has its own standard, also XML’ is rather like saying ‘Apple has its own programs developed using Objective-C; Microsoft has its own programs developed using C-sharp.’ These different XML variants are not the same, cannot be handled using the same tools in many cases (though the top-notch XML editors can import any DTD or schema you can access) and thus, can hardly be considered all the same, the way vanilla HTML-4.01 is all the same.
I see that there are plug-ins and translators to take in MS-Word or OpenOffice.org files, and output Daisy. But I can’t see these working very well outside of the simple narrative fiction or non-mathematical nonfiction texts; else how to mark a series of paragraphs in your .doc file as ‘Q&A XML form’? or ‘XML Pivot table’? That’s where it starts to get tricky, it seems to me, until all academic publishers agree upon not only a standard XML markup language, but also one or more standard tools, or maybe a Word or OpenOffice.org template. Even then, all the authors would have to use these templates, and understand how to use them.
And how many authors even write in styles, still, to this day, rather than just prettying up their articles and books according to what ‘looks right’ on their own PCs?
Also, your examples of test questions, and the like, must all be coded as such. Doesn’t this require that the text authors have programs capable of authoring Q&A, and other forms? Or are you thinking that publishers will take the Microsoft Word files from authors and pay staff to translate into XML of whatever flavor the publisher uses? And then how much would this cost, how much would it add to the cost of processing the text? And then, what sort of calculations must the publisher make to determine whether these costs would be worthwhile?
@pond – You have many questions of great relevance! Please let the Teleread.org editors know if you would me to take a shot at answering your questions via email – they can pass your addy on to me.
I was going to start answering here, but the length and details of the answers are probably more than the average telereader wants to know. Either that, or I’m going to have to start a series…
Thanks!