Why You (A Humanist) Should Care About Optical Character Recognition
Yesterday David Smith and I announced the release of “A Research Agenda for Historical and Multilingual Optical Character Recognition,” a report funded by the Andrew W. Mellon Foundation and conducted in consultation with the NEH’s Office of Digital Humanities and the Library of Congress. These groups realized that many of the digital humanities projects they support struggle with similar issues related to the quality of their source text data. They asked us to survey the current state of OCR for historical and multilingual documents, and to recommend concrete steps that researchers, implementors, and funders could take to make progress in creating more reliable and representative OCR data over the next five to ten years.
We were fortunate to work with a generous, brilliant community of scholars from the humanities, libraries, computer science, and industry to research this topic and craft a series of recommendations we hope could indeed move this area forward dramatically. If there’s an idealism to some of our recommendations, that’s because in the course of researching and writing this report we became convinced that concerted work on OCR, particularly in certain historical or language collections, could have dramatic results. Please read the report—it was a labor of love for us and so many who helped us, and we think there are lots of inspiring ideas from the community reflected there. The report came out yesterday and we’re already seeing the beginnings of a broader conversation that illuminates work we (perhaps inevitably) missed in our initial research. As I wrote on Twitter, I’ll be thrilled if the report facilitates needed but missed discussions between research communities, including those discussions we missed in the report itself.
This post doesn’t attempt to recap the entire report. Instead, here I want to address my humanities colleagues who might be uncertain why a report about something called optical character recognition should matter to them. In my 2017 Book History article, “‘Q i-jtb the Raven’: Taking Dirty OCR Seriously” I argued that OCR constitutes an important intellectual subject for book historians, but here I make a more direct and practical argument that humanities scholars should be aware of and even participate in current research aimed at improving OCR algorithms. As we articulate in the report, the expertise of humanities scholars is desperately needed, in collaboration with computer scientists, if we want make substantive progress improving OCR for the kinds of materials we most care about. Below, I answer a series of questions I hope will explain (in non-technical language) what OCR is, why it should matter to humanists, and what we might contribute to its future.
What is OCR?
When someone scans a page from a book, newspaper, or other textual source, the computer does not initially recognize that the image includes text. The scan is essentially a picture of the page: an image file such as those created when you take a snapshot on your phone’s camera. A human looking at that picture would see readable text in the image, but would not be able to use their computer’s search functions to find those same words. A researcher hoping to do more complex forms of data mining would be out of luck, at least if they hoped to do something with the text of the page—count word frequencies, build a topic model, &c. &c.—as opposed to its visual elements.
Optical character recognition (OCR) software is a type of artificial intelligence software designed to mimic the functions of the human eye and brain and discern which marks within an image represent letterforms or other markers of written language. OCR scans an image for semantically-meaningful material and transcribes what language it finds into text data. Typically OCR is used in situations where manual transcription would be too costly or time consuming—a subjective designation to be sure—such as when a large corpus has been scanned. Relative to manual transcription, OCR is a quick and affordable means for creating computable text data from a large collection of images.
Who Uses OCR?
You, probably, perhaps without realizing. If you ever search in a large-scale digital archive such as Google Books, HathiTrust, EEBO, ECCO, or a historical newspaper archive such as Chronicling America, then your searches are operating on OCR-derived text data. That text data is often hidden in various ways by the interfaces of digital archives (a practice I question strongly here), but nonetheless if you use these kinds of archives—as I would argue most humanities scholars do, at least occasionally—then you rely on the output of OCR algorithms. In other words, OCR is a fundamental element of our digital research infrastructure that’s also easily overlooked because we tend to focus on the images of historical pages rather than the underlying text data that helped us find them. There are many historical texts that have been hand-transcribed and encoded through a standard such as TEI, such as digital editions like those of the Women Writers Project, but a substantial percentage—in fact a large majority, by sheer numbers of pages—of digitized humanities texts are constituted through OCR. In other words, even if you’ve never heard of OCR it may nonetheless be important or even essential to your research and teaching.
Isn’t That a Good Thing?
Mostly, yes, I would argue. OCR has enabled a proliferation of research across media that were previously much trickier to delve into. The use of historical newspapers, to cite an example close to my heart, has exploded in scholarship of the past two decades, largely due to the kinds of access enabled by searchable text data, which can make vast collections navigable, tractable. The access provided by search is not an unmitigated good (see Ian Milligan’s article “Illusionary Order” for a good primer on its potential dangers) but it is, to my mind, an overall good.
But it is also true that OCR can produce, in a beautiful term I’ve borrowed from computer science, “errorful” data. The admittedly simplified version is this: OCR was largely developed to process typewritten, English-language, mid-twentieth-century business documents. With that kind of input, OCR is remarkably reliable, transcribing with accuracy in the upper 90 percents. Turn an OCR engine toward historical documents, however, with distinct typography, complex layouts, torn pages, smeared ink, and any number of features those OCR engines were not trained to discern, then the reliability of OCR transcription declines precipitously. A famous example of this is the German fraktur typeface—what we sometimes call blackletter in English—which was used in most German books and newspapers through the early 20th century and which OCR engines have not historically been well trained to recognize or transcribe. For multilingual documents and languages outside English, particularly those do not use Latin script, the problem becomes even more acute and error rates significantly higher.
What does this mean for humanities research? Well it could mean that when you search for a given word or phrase in a large-scale archive, you miss potential matches because instead of transcribing “quoth the raven” the OCR engine transcribed “q i-jtb the raven.” In that example, drawn from my Book History article, the error resulted because of uneven inking in the original newspaper that made the “u” and “o” letterforms look just different enough from those letters in its training data that the OCR engine misrecognized them. For some tasks, even fairly high error rates may be acceptable: a word that’s truly a keyword in a particular document is likely to be repeated, and the more often a word is repeated the more likely it is that some instances will be transcribed correctly. One major problem we outline in our report is that we do not have a clear sense of what thresholds are acceptable for various tasks. How good must OCR be for us to rely on the results of a keyword search? Of a topic model? We have largely resulted on instinct in making these sorts of judgments, but we might be able to do better.
So It’s Useless Then?
No. For me, it is so essential that our conclusions about OCR not end with throwing up our hands in despair. My own scholarly interest in OCR largely began from a creeping frustration with how the technology was invoked in humanities presentations. It’s become a conference trope to display a slide showing the terrible OCR underlying a given archive as a kind of public shrug, “what can you do, amiright?” Everyone laughs and moves on without thinking all that hard about the OCR or the assumptions baked into that moment of collective apathy. Rarely does the presenter engage with precisely why and how that particular OCR is useless: what are you seeking to learn from it? What level of reliability would you need to learn what you want to learn? What tasks might be possible with existing OCR?
It’s even less common for the presenter to consider what might be done to improve the situation: that’s the task, such presentations seem to concede, for computer scientists and corporations, and they don’t care about our concerns. I don’t want to claim that every corporation managing an OCR archive cares deeply about the concerns of humanities researchers, particularly those working in areas or fields they might consider niche. But if researching and writing this report taught me anything, it’s that there is significantly more opportunity for interdisciplinary cooperation around these issues than we typically credit, and more goodwill about our concerns and areas of interest than we imagine in our moments of conference cynicism. This work has made me hopeful, not just that OCR can be improved but that humanities scholars can be central contributors to its future.
In the recommendations of the report, you will see that the most pressing research in the field will require extensive development of training corpora: accurate transcriptions of materials in a given domain that can be used to train an OCR system how to recognize the textually-meaningful elements in images drawn from that domain. In other words, the most pressing OCR research will require the expertise of humanities domain specialists: book and textual historians, scholars of languages and writing systems, scholars of particular genres or historical periods. In response to the report, Michael Gossett wrote on Twitter, “One sees an entire future of collaborative scholarship here.” I agree strongly. There is enormous potential in OCR research for meaningful, important collaboration across a range of fields, but particularly across the humanities, libraries, and computer science. For me this is the central reason humanists should care about OCR: not to bemoan its current state but to help imagine its future.