Dorn-port1.JPG

I had a thought today after reading of Barnes & Noble’s new iPad app, which allows customers to loan/borrow purchased books. I haven’t heard whether the annotations go along with the lending, but it strikes me that academics needing to cite locations in ebooks and those interested in annotation technology both need a way to refer to locations within electronic documents.

The problem for academics looking for citation conventions is that we’re all used to page numbers, which give us a way to identify a location manually by flipping through pages (or by hunting for a letter or other archival document within a file folder). Do we really need that sort of human-navigated location specificity? If we can search for text inside a document, we certainly don’t. But the reference format is needed, and I think there would be an easy way to create another convention that would serve both academic purposes and ereader technology:

371324/3/1346372044/139823463

What’s that, you ask?

location/file number (within envelope, 1 if no envelope)/file size/file checksum (using some conventional algorithm)

Given a particular edition (i.e., uncorrupted file in a recognized format with a file size and checksum), this would give a precise location. With a different edition, the approximate location within a file and the first part of the quoted passage should be sufficient for finding the passage quickly. Let’s call the three numbers a brief spot location reference and the numbers plus the quotation the spot location reference. What if you’re referring to a passage?

371324-375241/3/1346372044/139823463

I know I’ll be torn limb-from-limb by my fellow historians, until I point out the following:

371324-375241/3/1346372044/139823463/
When Patto/d her hat./
This passage shows the protagonist’s commitment to blah blah blah yadda yadda yadda./
Sherman Dorn/20100527080312-0500

That’s the range reference, the first and last ten characters of the (theoretical) passage, annotation text, annotation author, and timestamp of annotation. And there, ladies and gentlemen, is a format for annotating electronic materials. It does not require changing the EPUB format, just tracking a file of annotations and ereader software that can put the annotation in the right place (the start and end of the passage for disambiguation). They can be shared, accumulated, analyzed, etc.

There may be important reasons why this wouldn’t work, but I can’t think of them at the moment.

Editor’s Note:  this is reprinted, with permission from Sherman Dorn’s blog.  PB

9 COMMENTS

  1. I have been working on a similar problem from a different angle. Perhaps the two might be combined?

    I start off with a unique ID for each edition (have a look at http://wwui.org/cgi-bin/wwui.cgi generates unique ids based on IP, date, time, in an adapted base32).

    I then generate ID plus element sequence for each major item (tables, paras etc,.)

    Now when some paragraphs are copied they come over with their reference information.

    I thought this would be useful for academic referencing, plus things like email quoting so that a simple search would place the quote back into its context.

    There is no reason why a relative reference system (like xpath) and an absolute one (para-reference) could not be used in different contexts.

    Thanks from a fellow historical researcher..

  2. Citations also need to be human readable, easy to remember and not overly long. Suppose that a print page has on average 40 lines, so a 500 page book has only 2000 contexts. Maybe each page of 50 lines can be divided into 4 quadrants (A,B,C,D) to denote the relative position on the page (a=top, b=middle-top, c=middle-lower, d=bottom). There could be an optional number after the quadrant designation to refer to the word number in that quadrant.

    So a typical reference would be 200-A (the top quadrant of the virtual page 200) or 58-B.21 (page 58, middle-top quadrant, 21st word).

    obviously, this isn’t perfect. It doesn’t take into account how images and tables would figure into the page definition. Also, you’d have to decide what constitutes 40 lines of text (because if font size is increased or decreased obviously the number of lines on a page will vary). It needs to be easy for an ebook reader to calculate, plus (ideally) there should be built-in-guides whose visibility can be toggled which allows the reader to see where he or she is. Also, there would need to be a manual override whereby a publisher could hardcode manually (i know html 5 allows this).

    Anyway, I would love there to be a standard for this, and fact, now that we have so many ebook readers, I have to wonder why a standard for annotations/citations hasn’t been developed. I expect the textbook market is clamoring for it.

  3. Citations also need to be human readable, easy to remember and not overly long. Suppose that a print page has on average 40 lines, so a 500 page book has only 2000 contexts. Maybe each page of 50 lines can be divided into 4 quadrants (A,B,C,D) to denote the relative position on the page (a=top, b=middle-top, c=middle-lower, d=bottom). There could be an optional number after the quadrant designation to refer to the word number in that quadrant.

    So a typical reference would be 200-A (the top quadrant of the virtual page 200) or 58-B.21 (page 58, middle-top quadrant, 21st word).

    obviously, this isn’t perfect. It doesn’t take into account how images and tables would figure into the page definition. Also, you’d have to decide what constitutes 40 lines of text (because if font size is increased or decreased obviously the number of lines on a page will vary). It needs to be easy for an ebook reader to calculate, plus (ideally) there should be built-in-guides whose visibility can be toggled which allows the reader to see where he or she is. Also, there would need to be a manual override whereby a publisher could hardcode manually (i know html 5 allows this).

    Anyway, I would love there to be a standard for this, and fact, now that we have so many ebook readers, I have to wonder why a standard for annotations/citations hasn’t been developed. I expect the textbook market is clamoring for it.

  4. Interesting.

    @Rober Nagle
    The problem of course being that the “page” is a unit that only does exist in pdf-files at best. In epub or mobi/prc and so on a change of spacing, size of letters or other parameters throws page relying citation in a total muddle.

    So we are on the way towards citation which is machine-readable rather than readable by humans.I don’t see me typing the citation mentioned above by hand so there would have to be a mode of automated transfer from the document I want to cite to the document I need to put the citation in. Now how to deal with this? ;o)

  5. Chapter and verse (or section & paragraph number, etc.) is a time-tested and easily workable solution. The other easy-to-implement approach is percentage through the text. Either one works, can be used today, and serves the purpose. In a world with full-text search, I’m not sure I see the need for a more complex system than either of these.

    Regards,
    Jack Tingle

  6. Jack Tingle

    It is easy enough to make the citation system to suit the writing. That is not a problem, it is linking internal ids to an edition Id in a coherant manner.

    Hence id= “XXXXXXXXXXXXX(edition ID).I.344 can become title= “l.344” or “Act I line 344”.

    Let the writing structure and its history dictate the actual implementation, but if I use line 344, I should be able to get “Shakspeare, W., Hamlet” in the citation even if the fragment was posted to me by email from a friend.

    For that to work any identifiable fragment should have a measningful identity embedded in its elements — in otherwords we really need a system that includes a register of editions.

The TeleRead community values your civil and thoughtful comments. We use a cache, so expect a delay. Problems? E-mail newteleread@gmail.com.