Why You (A Humanist) Should Care About Optical Character Recognition
Yesterday David Smith and I announced the release of âA Research Agenda for Historical and Multilingual Optical Character Recognition,â a report funded by the Andrew W. Mellon Foundation and conducted in consultation with the NEHâs Office of Digital Humanities and the Library of Congress. These groups realized that many of the digital humanities projects they support struggle with similar issues related to the quality of their source text data. They asked us to survey the current state of OCR for historical and multilingual documents, and to recommend concrete steps that researchers, implementors, and funders could take to make progress in creating more reliable and representative OCR data over the next five to ten years.
We were fortunate to work with a generous, brilliant community of scholars from the humanities, libraries, computer science, and industry to research this topic and craft a series of recommendations we hope could indeed move this area forward dramatically. If thereâs an idealism to some of our recommendations, thatâs because in the course of researching and writing this report we became convinced that concerted work on OCR, particularly in certain historical or language collections, could have dramatic results. Please read the reportâit was a labor of love for us and so many who helped us, and we think there are lots of inspiring ideas from the community reflected there. The report came out yesterday and weâre already seeing the beginnings of a broader conversation that illuminates work we (perhaps inevitably) missed in our initial research. As I wrote on Twitter, Iâll be thrilled if the report facilitates needed but missed discussions between research communities, including those discussions we missed in the report itself.
This post doesnât attempt to recap the entire report. Instead, here I want to address my humanities colleagues who might be uncertain why a report about something called optical character recognition should matter to them. In my 2017 Book History article, ââQ i-jtb the Ravenâ: Taking Dirty OCR Seriouslyâ I argued that OCR constitutes an important intellectual subject for book historians, but here I make a more direct and practical argument that humanities scholars should be aware of and even participate in current research aimed at improving OCR algorithms. As we articulate in the report, the expertise of humanities scholars is desperately needed, in collaboration with computer scientists, if we want make substantive progress improving OCR for the kinds of materials we most care about. Below, I answer a series of questions I hope will explain (in non-technical language) what OCR is, why it should matter to humanists, and what we might contribute to its future.
What is OCR?
When someone scans a page from a book, newspaper, or other textual source, the computer does not initially recognize that the image includes text. The scan is essentially a picture of the page: an image file such as those created when you take a snapshot on your phoneâs camera. A human looking at that picture would see readable text in the image, but would not be able to use their computerâs search functions to find those same words. A researcher hoping to do more complex forms of data mining would be out of luck, at least if they hoped to do something with the text of the pageâcount word frequencies, build a topic model, &c. &c.âas opposed to its visual elements.
Optical character recognition (OCR) software is a type of artificial intelligence software designed to mimic the functions of the human eye and brain and discern which marks within an image represent letterforms or other markers of written language. OCR scans an image for semantically-meaningful material and transcribes what language it finds into text data. Typically OCR is used in situations where manual transcription would be too costly or time consumingâa subjective designation to be sureâsuch as when a large corpus has been scanned. Relative to manual transcription, OCR is a quick and affordable means for creating computable text data from a large collection of images.
Who Uses OCR?
You, probably, perhaps without realizing. If you ever search in a large-scale digital archive such as Google Books, HathiTrust, EEBO, ECCO, or a historical newspaper archive such as Chronicling America, then your searches are operating on OCR-derived text data. That text data is often hidden in various ways by the interfaces of digital archives (a practice I question strongly here), but nonetheless if you use these kinds of archivesâas I would argue most humanities scholars do, at least occasionallyâthen you rely on the output of OCR algorithms. In other words, OCR is a fundamental element of our digital research infrastructure thatâs also easily overlooked because we tend to focus on the images of historical pages rather than the underlying text data that helped us find them. There are many historical texts that have been hand-transcribed and encoded through a standard such as TEI, such as digital editions like those of the Women Writers Project, but a substantial percentageâin fact a large majority, by sheer numbers of pagesâof digitized humanities texts are constituted through OCR. In other words, even if youâve never heard of OCR it may nonetheless be important or even essential to your research and teaching.
Isnât That a Good Thing?
Mostly, yes, I would argue. OCR has enabled a proliferation of research across media that were previously much trickier to delve into. The use of historical newspapers, to cite an example close to my heart, has exploded in scholarship of the past two decades, largely due to the kinds of access enabled by searchable text data, which can make vast collections navigable, tractable. The access provided by search is not an unmitigated good (see Ian Milliganâs article âIllusionary Orderâ for a good primer on its potential dangers) but it is, to my mind, an overall good.
But it is also true that OCR can produce, in a beautiful term Iâve borrowed from computer science, âerrorfulâ data. The admittedly simplified version is this: OCR was largely developed to process typewritten, English-language, mid-twentieth-century business documents. With that kind of input, OCR is remarkably reliable, transcribing with accuracy in the upper 90 percents. Turn an OCR engine toward historical documents, however, with distinct typography, complex layouts, torn pages, smeared ink, and any number of features those OCR engines were not trained to discern, then the reliability of OCR transcription declines precipitously. A famous example of this is the German fraktur typefaceâwhat we sometimes call blackletter in Englishâwhich was used in most German books and newspapers through the early 20th century and which OCR engines have not historically been well trained to recognize or transcribe. For multilingual documents and languages outside English, particularly those do not use Latin script, the problem becomes even more acute and error rates significantly higher.
What does this mean for humanities research? Well it could mean that when you search for a given word or phrase in a large-scale archive, you miss potential matches because instead of transcribing âquoth the ravenâ the OCR engine transcribed âq i-jtb the raven.â In that example, drawn from my Book History article, the error resulted because of uneven inking in the original newspaper that made the âuâ and âoâ letterforms look just different enough from those letters in its training data that the OCR engine misrecognized them. For some tasks, even fairly high error rates may be acceptable: a word thatâs truly a keyword in a particular document is likely to be repeated, and the more often a word is repeated the more likely it is that some instances will be transcribed correctly. One major problem we outline in our report is that we do not have a clear sense of what thresholds are acceptable for various tasks. How good must OCR be for us to rely on the results of a keyword search? Of a topic model? We have largely resulted on instinct in making these sorts of judgments, but we might be able to do better.
So Itâs Useless Then?
No. For me, it is so essential that our conclusions about OCR not end with throwing up our hands in despair. My own scholarly interest in OCR largely began from a creeping frustration with how the technology was invoked in humanities presentations. Itâs become a conference trope to display a slide showing the terrible OCR underlying a given archive as a kind of public shrug, âwhat can you do, amiright?â Everyone laughs and moves on without thinking all that hard about the OCR or the assumptions baked into that moment of collective apathy. Rarely does the presenter engage with precisely why and how that particular OCR is useless: what are you seeking to learn from it? What level of reliability would you need to learn what you want to learn? What tasks might be possible with existing OCR?
Itâs even less common for the presenter to consider what might be done to improve the situation: thatâs the task, such presentations seem to concede, for computer scientists and corporations, and they donât care about our concerns. I donât want to claim that every corporation managing an OCR archive cares deeply about the concerns of humanities researchers, particularly those working in areas or fields they might consider niche. But if researching and writing this report taught me anything, itâs that there is significantly more opportunity for interdisciplinary cooperation around these issues than we typically credit, and more goodwill about our concerns and areas of interest than we imagine in our moments of conference cynicism. This work has made me hopeful, not just that OCR can be improved but that humanities scholars can be central contributors to its future.
In the recommendations of the report, you will see that the most pressing research in the field will require extensive development of training corpora: accurate transcriptions of materials in a given domain that can be used to train an OCR system how to recognize the textually-meaningful elements in images drawn from that domain. In other words, the most pressing OCR research will require the expertise of humanities domain specialists: book and textual historians, scholars of languages and writing systems, scholars of particular genres or historical periods. In response to the report, Michael Gossett wrote on Twitter, âOne sees an entire future of collaborative scholarship here.â I agree strongly. There is enormous potential in OCR research for meaningful, important collaboration across a range of fields, but particularly across the humanities, libraries, and computer science. For me this is the central reason humanists should care about OCR: not to bemoan its current state but to help imagine its future.