The following was presented at the MLA 2020 Convention in the panel “Architextures of Knowledge,” sponsored by the Society for Textual Scholarship.

Recurated and Reedited

Jerome McGann begins his 2014 book, A New Republic of Letters, channeling an imperative he first articulated at least thirteen years earlier: “Here is surely a truth now universally acknowledged: that the whole of our cultural inheritance has to be recurated and reedited in digital forms and institutional structures.” McGann despairs that current digital archives are, by and large, being created by for-profit entities, using errorful automated processes such as optical character recognition (OCR), and by workers trained in the technical aspects of digitization, but not in the historical and cultural dimensions of the artifacts being digitized. What is required, McGann argues, is a sustained, collective scholarly intervention:

Digitizing the archive is not about replacing it. It’s about making it usable for the present and the future. To do that we have to understand, as best we can, how it functioned—how it made meanings—in the past. A major task lying before us—its technical difficulties are great—is to design a knowledge and information network that integrates, as seamlessly as possible, our paper-based inheritance with the emerging archive of born-digital materials (21-22)1

What does it mean to make the archive “usable for the present and the future,” and how do notions of usability change due to digitization? The massive scale and cost of the effort McGann advocates seem impossible in the current age of educational austerity—which is particularly felt in the humanities—and yet the imperative to design multiple paths into both “our paper-based” and “born digital” inheritances seems clear.

In the longer version of this paper, I do some work here to dispel the notion that scale and cost are challenges unique to the digital age, pointing out that the majority of our paper-based inheritance was never curated or edited prior to the digital age. We can discuss this claim more in the Q&A if folks would like. The fact is that we will not have the time and labor required to digitize all books to the standards we might wish, a task even less imaginable for materials such as newspapers, pamphlets, tracts, and periodicals. In such a reality, digital scholarly editing remains an essential endeavor, but it must proceed in parallel with other experiments that meet the mass digitized archive where it is, leverage the unique affordances of digital media to explore its contours, and identify areas of scholarly interest from a well of largely undifferentiated content.

Toward Speculative Bibliography

This paper proposes speculative bibliography as a complementary, experimental approach to the digitized archive, in which textual associations are constituted propositionally, iteratively, and (sometimes) temporarily, as the result of probabilistic computational models. A speculative bibliography enacts a scholarly theory of the text, reorganizing the archive to model a particular idea of textual relation or interaction. Many computational processes already create data we might identify as speculative bibliographies: algorithms that detect the relative prevalence of “topics” across documents, identify sequences of duplicate text in a corpus, or more simply list texts that share search terms.

Our work on the Viral Texts project at Northeastern University, for instance, uses text mining techniques to trace the reprinting of material in nineteenth-century newspapers across the globe.[^ViralTexts] To simplify things just a bit, our method posits that the editorial practice of reprinting can be modeled in this way:

“if passages of text from separate pages contain at least five matching phrases of five-words length and their words overlap each other by at least 80%, they should be considered ‘the same’ and clustered together.”

Essentially, we have developed a textual model that is agnostic on questions of author, title, genre, or similar categories which are largely absent from either the nineteenth-century newspaper page or the metadata of twenty-first century digitized newspaper archives. To write that another way, it is a method that does, I believe, account how the nineteenth-century newspaper “made its meanings,” as well as the ways that current digital newspaper archives mask those meanings. This kind of pattern matching—identifying overlapping strings of characters across tens of millions of newspaper pages—would be far beyond the capabilities of a human researcher or the operational capacities of an individual archive. The project relies on digital archives that federate newspapers physically distributed around the world, which means an analog effort along these same lines would require more time than any researcher possesses, an incredibly intricate indexing system, and enormous geographic mobility. Speculative bibliography recognizes that a unique affordance of digital media is the ability to rapidly reorganize or reconfigure its contents and seeks to identify meaningful patterns for exploration within collections that are often messy and unevenly described. In the Viral Texts model, textual relationships are determined by the formal structures internal to the texts themselves, but our algorithm is nonetheless a bibliographic argument.

Why “Bibliography”?

To oversimplify just a bit, we might describe bibliography as a system for modeling textual relationships. The bibliographer decides that these texts belong together because they share certain metadata (e.g. author, genre, era of publication) while these others might belong together because they share formal material features (e.g. octavo format, dos-à-dos binding). In many bibliographic traditions, these relationships are mapped out quite methodically and procedurally—dare I write algorithmically?—which is perhaps why bibliography can seem “dry as dust” to outside observers. However, bibliographers share a conviction Jonathan Senchyne summarizes beautifully in his new book that “Material textuality means…the material presence of something is itself figurative and demands close reading” alongside a text’s linguistic content (6)2

The constellation I want to gather under the sign of speculative bibliography comprises computational and probabilistic methods that map relationships among documents, that sort and organize the digital archive into related sets. I employ bibliography to insist that such methods belong to the textual systems they transform, and should themselves be objects of research and scrutiny, described with the same rigor that bibliographers and book historians describe historical technologies of knowledge production.

Earlier in the digital age, Thomas Tanselle proposed a definition for what he called an “electronic edition” of a text:

in order to include modern methods of book production which do not involve actual type setting, an edition should be defined as all copies resulting from a single job of typographical composition. Thus whether printed from type (set by hand or by machine), or plates, or by means of a photographic or electronic process, all copies that derive from the same initial act of assembling the letterforms belong to the same edition (18)3

Where previous bibliographers had focused on composition as the setting of metal type (whether moveable characters of cold type or a line of hot type), Tanselle recognized that typing at a computer keyboard was also an act of inscription, committing a particular arrangement of typographic characters—”assembling the letterforms”—to memory. In a recent article in Book History, I pivot from Tanselle to argue that humanities scholars need to take optical character recognition (OCR) seriously as a material and cultural artifact. OCR software scans digitized page images, attempting to recognize the letterforms on the images and transcribe them into a text file. I argue that we might consider OCR a species of compositor, setting type in a language it can see but not comprehend, and thus that OCR data derived from a historical text is a new edition—a copy “resulting from a single job of typographical composition“—of that text.4 It is a kind of offset composition, in which the programmer sets the rules for recognition that the program will follow to create many editions.

This paper expands that frame further to think about code—such as that underlying reprint detection or classification—as another job of typographical composition that inscribes a theory of textual relationship, at least when its objects are bibliographic or textual. That last caveat is important, because just as we wouldn’t claim all written analysis as bibliographic, we shouldn’t claim all code as bibliographic. Annette Vee points to the double valence of code when she writes that

Programming has a complex relationship with writing; it is writing…because it is symbols inscribed on a surface and designed to be read. But programming is also not writing, or rather, it is something more than writing. The symbols that constitute computer code are designed not only to be read but also to be executed, or run, by the computer.

Vee continues in a line that echoes (though I suspect accidentally) Tanselle, writing that “programming is the authoring of processes to be carried out by the computer” (20).5 Denis Tenen argues something similar when he writes “Unlike figurative description, machine control languages function in the imperative. They do not stand for action; they are action” (94).6 A program becomes speculative bibliography when its action operationalizes a theory of textual relationship. If OCR is a species of compositor, such algorithms might be species of editor, set loose with one unwavering principle of selection apiece. Speculative bibliography is action and documentation.

Consider a classification algorithm. We use these in the Viral Texts project to sort millions of individual reprints into generic categories: fiction, news, poetry, science writing, domestic writing, etc. Classification is typically a supervised task, in which researchers tag a set of documents—the training dataset—as belonging to the various genres they hope to identify. From this training data, different classification algorithms can be used to compare unknown texts against known. Some classification algorithms use words to determine belongingness: e.g. domestic fiction will use words like “eye,” “heart,” “mother,” or “tear” in much higher proportion than we would expect from random chance, while news articles will disproportionately use words like “ship,” “president,” “bill,” or “debate.” I’m making these lists up, because in reality they depend entirely on a specific research corpus and researchers’ initial genre classifications, but word-based classification works roughly in this way. There are also other classification methods that use topic or vector space models to establish relationships among the words in different texts.

I want to consider these processes as displaced forms of editing that operate with less precision but greater speed and scale than solely human endeavor: a kind of offset editing. Editors create models of textual relationship through tagging their training data and then operationalize those models across a wider textual field than they could edit alone. In our case we spend a few weeks manually tagging several hundred texts per genre in order to classify millions of unknown texts in an afternoon. Importantly, these methods are not binary, but probabilistic across all genres for which we create a training set, so that one text might be classified as 79% likely to be poetry and 65% likely to be religious. If we later seek out popular newspaper poetry or religion (or religious poetry) we would find such a text. Of course, with such a method there are many false positives or false negatives: texts that a human reader would recognize as an account of a cricket match, for instance, but that the classifier identifies as poetry, or texts that a human observer would recognize as poetry but that a classifier fails to identify as likely to be such. If such methods are bibliographic because they posit textual relationships and paths through mass digital archives, they are speculative because the paths they posit are probabilistic, experimental, and iterative.

Why “Speculative?”

With speculative bibliography, I seek an intellectual frame that recognizes the practical, theoretical, and historiographical potential of exploratory computation without resorting to dehistoricized, idealized notions of “big data,” to negotiate a middle ground between strong theories of “distant reading” or “cultural analytics” on the one hand and the “scholarly edition of a literary system” more recently advocated by Katherine Bode. I am fully convinced by Bode’s argument that,

Contrary to prevailing opinion, distant reading and close reading are not opposites. These approaches are united by a common neglect of textual scholarship: the bibliographic and editorial approaches that scholars have long depended on to negotiate the documentary record (19).7

Bode rightly points out that most “data-rich literary history” projects fail to fully delineate “the broader relationship between the literature of the past and the disciplinary infrastructure used to investigate it” (52) She advocates instead for the “scholarly edition of a literary system” in which

A curated dataset replaces the curated text…In the form of bibliographical and textual data, it manifests—demonstrates and, specifically, publishes—the outcome of the sequence of production and reception, including the current moment of interpretation, described in the critical apparatus. The model it provides is stable: it is published and accessible for all to use, whether for conventional or computational literary history. But that stability does not belie or extinguish its hypothetical character. Rather than showing a literary system, it presents an argument about the existence of literary works in the past based on the editor’s interpretation of the multiple transactions by which documentary evidence of the past is transmitted for the present (53).

Bode models this in the carefully-curated datasets she compiles about Australian newspaper fiction, which I would point to as an exemplar for computational work going forward.

However, I do want to carve out space for approaches to the digital archive that are bibliographically informed while being experimental, exploratory, even playful. The term “speculation” has a long history in the digital humanities, as a term that pairs the technical and ludic. In an essay from the 2004 volume that named the field of digital humanities, Johanna Drucker and Bethany Nowviskie write about the tensions inherent in the term “speculative computing”:

Speculative computing is a technical term…It refers to the anticipation of probable outcomes along possible forward branches in the processing of data. Speculation is used to maximize efficient performance…Logic-based, and quantitative, the process is pure techne, applied knowledge, highly crafted, and utterly remote from any notion of poiesis or aesthetic expression. Metaphorically, speculation invokes notions of possible worlds spiraling outward from every node in the processing chain, vivid as the rings of radio signals in the old RKO studios film logo. To a narratologist, the process suggests the garden of forking paths, a way to read computing as a tale structured by nodes and branches.

For Drucker and Nowviskie, speculative computing is evocative almost despite itself, “conjuring images of unlikely outcomes and surprise events, imaginative leaps across the circuits that comprise the electronic synapses of digital technology.” Prediction is interpretation in this framework: a model of thought instantiated in code. For the digital humanities, this idea is important because “Speculative approaches make it possible for subjective interpretation to have a role in shaping the processes, not just the structures, of digital humanities.”8

In the longer version of this paper, I here turn to Nowviskie’s more recent work, drawing on Afrofuturism, to imagine “speculative collections” that “activate imaginations—both their users’ imaginations and those of the expert practitioners who craft and maintain them?”9 as well as Lauren Klein’s mandate for scholars of early American literary history to develop “A speculative aesthetics of early American literature” (439).10 We can delve into these in the Q&A if folks are interested

What I am calling speculative bibliography is speculative in both senses Nowviskie and Drucker elicit. It is an anticipatory processing of bibliographic data in order to maximize possible paths of discovery—not all operations produce meaningful literary-historical insights, but some do—and it is also an imaginative act that asks how the archive might make meanings if differently arranged and juxtaposed.

Speculative Editions

In the Viral Texts project, we reorder the historical newspaper archive to see not individual newspapers over time, but the tendrils of textual repetition, quotation, circulation, and theft that linked papers and readers together. Since the beginning of the project we have wrestled with how best to make our data available to other scholars to argue with or against. On the one hand, when we publish an article, its arguments are based on particular texts: the hundreds of reprints of the poem “Beautiful Snow,” for instance, or of a listicle outlining the habits of successful young businessmen. The reprints from which we developed our argument were identified by a particular run of our algorithm and exist as data in a spreadsheet, itself a historically-specific textual artifact. We recognize that scholars reading our 2016 American Periodicals article should be able to refer to the 276 reprints of “Beautiful Snow” we consulted when writing it, and we dutifully published a spreadsheet alongside that article that includes all of the texts we cite in it. Ultimately, however, our argument is not about any particular reprint of “Beautiful Snow,” but about the event of that poem across the country and the world: the many reprints, the readers who loved it, the poets who parodied it and parodied its parodies, the editors who debated its authorship. And our picture of that event continually shifts as we refine our reprint detection methods and add new historical newspaper data to our inquiries. I want readers to find those 276 reprints from 2016, of course, but I also want them to find the 291 reprints we know of in 2020, or the 500 (he wrote optimistically) we will know of in 2022.

As we develop the Viral Text project book (Going the Rounds, University of Minnesota Press), we are experimenting with an approach that weaves together textual editing and computational speculation. For each work that we write about—by which I mean a set of witnesses that we would identify as reprints of the same work—we create a clean transcription from one witness we specifically reference. These transcriptions, which we refer to as “anchor texts,” become stable points of reference incorporated into each new iteration of our algorithm: a seed around which reprints of that particular text will be clustered in subsequent analyses. In each new dataset we create, these transcriptions are the reference points that allow researchers to quickly home in on familiar texts, while allowing textual clusters to shift as we experiment with the parameters of our algorithm or expand the source data we analyze. Thus when the book is published, readers will be able to find the texts on which its arguments are based in our public database, but also to see how our picture of nineteenth-century newspaper reprinting is evolving in real time.

By providing a stable bibliographic reference point within a speculative computing environment, these transcriptions also enable us to better understand and critique the effects made by changes to our algorithm. From the computer science perspective of our project, we have always wrestled with the lack of “ground truth”—a well-known and described dataset in which to test whether a method is returning reliable results. No index or hand-tagged archive of nineteenth-century newspaper reprinting exists—even at a relatively small scale—that we could use to ensure we are finding the reprints we should be finding before applying our methods across a larger, unknown collection. Even from the CS side, then, our work is speculative, and we have relied on estimates of recall due to the fundamental incompleteness and uncertainty of historical data. Anchor text transcriptions allow us to directly compare textual clusters from experiment to experiment and see precisely how changed parameters affect our results.

By taking a speculative approach to building bibliographies, the Viral Texts project puts into practice a method for identifying sets of formally-related texts worthy of study, by virtue of their duplication, from the massive archive of nineteenth-century newspapers. To expand our scope a bit, we might imagine other algorithms trained to recognize particular historical fonts, or to identify woodcuts within a collection of historical books. As a complement to Bode’s “scholarly edition of a literary system,” I would propose the speculative edition: all texts associated through a single computational model. We should not take the probabilities underlying speculative bibliography as the truth about literary history, but as propositions that demand testing and argumentation. Given the scope of our digitized collections, however, I would argue that many branching, speculative bibliographies will be necessary to identify fruitful paths of scholarly inquiry, and must proceed in dialogue with the careful editing and curation undertaken by textual scholars. Too often we separate our data from our analysis of the data, as if the one exists simply to illuminate the other. To resist this impulse, we need to integrate our computational experiments into our archival interfaces, provided as alternative paths into and through material.

  1. McGann, Jerome. A New Republic of Letters: Memory and Scholarship in the Age of Digital Reproduction. Cambridge: Harvard University Press, 2014. 

  2. Senchyne, Jonathan. The Intimacy of Paper in Early and Nineteenth-Century American Literature. Amherst: University of Massachusetts Press, 2019. 

  3. Tanselle, G. Thomas. “The Bibliographical Concepts of ‘Issue’ and ‘State.’” The Papers of the Bibliographical Society of America 69, no. 1 (1975): 17–66. 

  4. Cordell, Ryan. “‘Q i-Jtb the Raven’: Taking Dirty OCR Seriously.” Book History 20 (2017): 188–225. 

  5. Vee, Annette. Coding Literacy. Cambridge: MIT Press, 2017. 

  6. Tenen, Dennis. Plain Text: The Poetics of Computation. Palo Alto: Stanford University Press, 2017. 

  7. Bode, Katherine. A World of Fiction: Digital Collections and the Future of Literary History. Ann Arbor: University of Michigan Press, 2018. 

  8. Drucker, Johanna, and Bethany Nowviskie. “Speculative Computing: Aesthetic Provocations in Humanities Computing.” In Companion to Digital Humanities, edited by Susan Schreibman, Ray Siemens, and John Unsworth, Hardcover. Blackwell Companions to Literature and Culture. Oxford: Blackwell Publishing Professional, 2004. 

  9. Nowviskie, Bethany. “Speculative Collections.” In Bethany Nowviskie, 2016.

  10. Klein, Lauren F. “Speculative Aesthetics.” Early American Literature 51, no. 2 (July 13, 2016): 437–45.