A C19 Reprint Discovery Engine (or, Where I Think This Hawthorne Stuff May Eventually Go)

Things are moving for “The Celestial Railroad” project. After the slow work of last year—which can be forgiven, I hope, as I was a brand-new faculty member—this year I have two undergraduate assistants helping me transcribe and encode the hundreds of paratexts—the texts that introduced, commented upon, quoted, or invoked what may have been Hawthorne’s most popular early story. We’re building the archive in the background of this website, and I hope to publish most of the “Celestial Railroad” reprints and paratexts this summer.

Which leads me to the question, “What’s next?” I’ve been thinking quite a bit about this over the past year, and have considered a few possible directions for this research. Exploring the extensive reprinting history I’ve uncovered for this one story—a non-canonical story by a hyper-canonical author—has convinced me that similar textual narratives must exist for many stories and poems—both by canonical and by forgotten authors.

Once I publish my Hawthorne research this summer, I want to start working on something much bigger: a reprint discovery engine for nineteenth-century periodicals archives. I imagine a tool not unlike the Google Ngram Viewer, but focused on textual reprint and reference. This project would likely start by investigating a database like the Library of Congress’ “Chronicling America” collection, which is open and includes “an extensive application programming interface (API) which you can use to explore all of our data in many ways.”

I imagine the reprint discovery tool developing in two stages:

In its first stage, the tool likely would require base texts for each inquiry. Users would enter, say, the text of Poe’s “Purloined Letter” and the tool would automatically break the short story into n-grams—sequences of words or letters. Then, the tool would automatically query a periodical archive for each n-gram sequence. Why so many queries? As I found with the Hawthorne project, simple title searches are insufficient, as reprints were often untitled or retitled by newspaper and magazine editors. In addition, title searches won’t return quotations from or references to the base text in other kinds of articles: such as the sermons or religious articles I found that quoted just a line or two from “The Celestial Railroad.” The tool should allow readers to tweak the length of the n-gram sequences on the fly—in my OS X-bound imagination, I see a slider—so that an inquiry could be broadened or narrowed based on the results returned. Such a tool would allow users to discover not only reprints of their chosen text, but also the paratexts essential to understanding the reception history of the story or poem.
In the tool’s second stage, I would hope to automate the first part of the reprint discovery process: the discovery of base texts. The problem with the tool I’ve outlined in stage 1 is that it would likely only be used for texts scholars already find interesting—stories or poems that scholars suspect are worth searching periodicals archives for, because they have some sense of an existing history of widespread reprinting and/or reference. If the tool itself could dig into the archive in search of base texts, however, then we might discover texts that were widely reprinted and referenced but have since fallen out of our cultural memory. Such a tool could generate significant new scholarship, as important new texts and authors resurfaced and demanded further study. How might this work technically? I’m not certain. Perhaps the tool would crawl through the entire archive database, breaking the archive itself into n-grams and then looking for matches. I’ll need a programmer to tell me whether that’s in the realm of possibilities, or whether there’s another approach that would be more fruitful.

This all leads me to three questions for the digital humanities community:

First, am I missing an existing tool that would enable this sort of discovery? I don’t want to spend time figuring out how to reinvent a tool that already exists (or nearly exists, and merely wants tweaking).
Second, does the tool I’ve described sound useful and compelling? Does this meet a need for scholars in literature, history, religious studies, and/or periodical studies? If you were reviewing this grant proposal, what would you say about the tool’s “potential contributions to the field?”
Third, would you be interested in collaborating to build such a tool? I will, of course, list this project on DHCommons, but if you’re reading this and thinking either “this idea perfectly dovetails with my own research project” or “I could write an algorithm to do that in an afternoon,” please send me an email!

There are, of course, many more possibilities growing from such a tool. As I mentioned in my last post, thinking about nineteenth-century reprinting culture immediately leads to geospatial questions. Perhaps this reprint discovery tool could map search results, so that users could navigate results geographically. Indeed, such a tool might help untangle the complicated web of nineteenth-century reprinting culture, visualizing relationships between publications that frequently borrowed from one another and suggesting relationships scholars had not previously spotted. Perhaps I will speculate on geospatial possibilities in another post. For now, if you have ideas or suggestions for a C19 reprint discovery engine, please share them in the comments.