In a series of blog posts yesterday whose tone can only be described as “gleeful,” the Authors Guild has been showing that specific books aren’t orphans. So far, they’ve found copyright owners or literary agents for J.R. Salamanca’s The Lost Country, Albert Bandura’s Adolescent Aggression, and James Gould Cozzens’s Confusion. They didn’t track down Walter Lippmann’s The Communist World and Ours, but it appears that someone else did. The legwork involved wasn’t particularly intensive: some Google searches, some queries of standard copyright-related databases, and some phone calls.

This would be a dog-bites-man story, except for the fact that all of these books were on HathiTrust’s list of orphan works candidates. Oops. All of these books had gone through HathiTrust’s workflow, which was supposed to carry out “due diligence” to determine whether these works were likely to be orphans.

Once is a mistake, twice bad luck, and three times is a sign of a broken process. The Authors Guild’s experiment demonstrates that HathiTrust’s orphan-tagging workflow cannot be relied on to identify genuinely orphan works with sufficient confidence to be usable. Out of 166 books originally on the list, at least four have been identified as non-orphans. A 2.5% false positive rate isn’t going to be acceptable.

The workflow itself isn’t described in particularly much detail, despite HathiTrust’s promise to “post as much of the project’s internal documentation as appropriate on this page.” It calls for:

  • A check that the book is not available on Amazon or Bookfinder.
  • A check that the author isn’t on the “live list.”
  • “Look for copyright holder contact information.”
  • “Attempt email contact.”
  • “Attempt phone contact.”

Whatever those last three steps comprise, it isn’t working. Whatever databases they’re checking for contact information aren’t sufficient.

On Twitter, Justin Grimes referred to these findings as “The ‘one example’ rule for invalidating arguments.” It’s true that these are individual books, not necessarily representative of the broader corpus of books scanned by Google and held by HathiTrust libraries. But this was also a sample chosen by HathiTrust itself. This was the libraries’ chance to put their best foot forward, to show that their process could be trusted, to show that there are real orphans out there. The results were not reassuring.

Legally, there are reasons why these non-orphans may not matter much in this case. Paul Aiken, Executive Director of the Authors Guild, has said that the lawsuit is primarily about the large-scale digitization (millions of books), not the much smaller Orphans Works Project (hundreds). The Authors Guild may have a hard time making legal claims specifically about the Project, for procedural reasons I’ll get into in future posts. Still, these discoveries are, as Eric Hellman said in a comment, “Major egg on the elephant’s face!”

And, looking to the broader picture, these revelations will discredit other efforts to make genuine orphan works more accessible. No one will ever be able to make the orphan works argument again without opponents bringing up the HathiTrust orphans that weren’t. Copyright owners will always regard such efforts with suspicion, as a pretext just for distributing the books, copyright be damned. And the idea of a “diligent search” sounds a lot less reassuring now that HathiTrust’s initial searches have been shown to be ineffective in multiple cases. The title of this post may be an exaggeration, but not by much.

I hope to update this post to deal with any responses from HathiTtust and the libraries, and with further developments.

Reprinted under Creative Commons Attribution License from The Laboratorium