Mr. Penumbra, Distant Reading, and Cheating at Scholarship

My Technologies of Text course is capping this semester reading Robin Sloan’s novel, Mr. Penumbra’s 24-Hour Bookstore, which Matt Kirschenbaum deemed “the first novel of the digital humanities” last year. Mr. Penumbra is a fine capstone because it thinks through so many of our course themes: the (a)materiality of reading, the book (and database) as physical objects, the relationship between computers and previous generations of information technology, &c. &c. &c. I will try not too spoil much of the book here, but I will of necessity give away some details from the end of the first chapter. So if you’ve not yet read it: go thou and do so.

Rereading the book for class, I was struck by one exchange between the titular Mr. Penumbra—bookstore owner and leader of a group of very close readers—and the narrator, Clay Jannon—a new bookstore employee curious about the odd books the store’s odd club members check out. In an attempt to understand what the club members are up to, Clay scans one of the store’s logbooks, which records the comings and goings of club members, the titles of the books they checked out, and when they borrowed each one. When he visualizes these exchanges over time within a 3d model of the bookstore itself, visual patterns of borrowing emerge, which seem, when compiled, to reveal an image of a man’s face. When Clay shows this visualization to Mr. Penumbra, they have an interesting exchange that ultimately hinges on methodology:

Half-smiling, he holds his glasses at an angle and peers down at the screen. His face goes slack, and then he says, quietly: “The Founder.” He turns to me. “You solved it.” He claps a hand to his forehead and his face splits into a giddy smile. “You solved it already! Look at him! Right there on the screen! [...] “How did you do it?” he continues. He’s so proud, like I’m his grandson and I just hit a home run, or cured cancer. “I must see your notes! Did you use Euler’s method? Or the Brito inversion? There is no shame in that, it clears away much of the confusion early on...” “Mr. Penumbra,” I say, triumph in my voice, “I scanned an old logbook [...] because Google has this machine, it’s superfast, and Hadoop, it just goes—I mean, a thousand computers, like that!” I snap for emphasis. I don’t think he has any idea what I’m talking about. “Anyway, the point is, we just pulled out the data. Automatically.”

At first Mr. Penumbra is quiet, but then he responds to Clay’s news:

“Oh, yes, I know,” he says sharply, and his eyes flash at me. “I see it now. You cheated—would that be fair to say? And as a result, you have no idea what you have accomplished.” I look down at the desk. That would be fair to say. When I look back up at Penumbra, his gaze has softened. “And yet...you did it all the same.” He turns and wanders into the Waybacklist. “How curious.” “Who is it?” I ask suddenly. “Whose face?” “It is the Founder,” Penumbra says, running a long hand up along one of the shelves. “The one who waits, hiding. He vexes novices for years. Years! And yet you revealed him in—what? A single month?” Not quite: “Just one day.” Penumbra takes a sharp breath. His eyes flash again. They are pulled wide and, reflecting the light from the windows, they crackle electric blue in a way I’ve never seen. He gasps, “Incredible.”

As I read this conversation, I was immediately reminded of so many exchanges I’ve seen at conferences about projects that use computational methods—whether text mining, network graphs, or geospatial visualizations—to expose patterns in literary-historical texts. When I talk about our Viral Texts project, for instance, I typically begin by describing the archival challenge: in brief, there are just so many nineteenth-century periodicals that no scholar can read them all. I then discuss how we’ve leveraged the pattern-finding powers of the computer to begin addressing this problem, automatically—there’s that word from Mr. Penumbra—uncovering more than 40,000 reprinted texts in one large-scale digital newspaper archive and using that data to visualize the spread of texts around the country or the strength of connections among publications.

At the risk of sounding uncharitable here—a risk I hope to address in the following paragraphs, so please stick with me‐often the response to this work from scholars in my discipline can sound not unlike Mr. Penumbra’s initial response to Clay’s visualization—”I see it now. You cheated…[a]nd as a result, you have no idea what you have accomplished.” Often such responses come, as one might expect, from scholars who spent significant time researching nineteenth-century reprinting in the archive, reading newspapers one by one and taking careful notes. That two junior scholars—one a few measly years out of graduate school—are claiming to have identified 40,000 reprinted texts in a single year’s work, and without stepping foot into an actual newspaper archive, seems a lot like cheating.

If someone actually articulated such a caricatured version of our work—and I am deliberately overstating things here to cast a more subtle problem into sharper relief—I could quibble with details of that caricaturization. I brought to the project a disciplinary understanding of reprinting that shaped the early development of the duplicate-detection algorithm. We used known sets of widely-reprinted texts—typically drawn from the incredible work of book historians, bibliographers, and literary critics—to ensure we were capturing reprints we would expect to find, as well as new reprintings. We continually tweak the algorithm based on what it fails to find. We’re still not great, for instance, at identifying reprinted lyric poems, because such texts simply don’t include enough 5-grams (sequences of 5 words) to be identified using our current algorithm. Working through such problems and perfecting our computational methods requires that we draw on literary-historical knowledge and literary-historical methods. Finally, I have spent a good deal of time in physical archives, actually reading newspapers.

But, Mr. Penumbra’s comments do get at a central methodological difference that I think is worth attending to more closely. Because Mr. Penumbra is right: perhaps not that Clay cheated, despite Clay’s own concession to this charge, but that Clay’s methodology for finding the founder did not help him understand what he has accomplished. The pattern Clay uncovers in his visualization is “actually” embedded in codes, which are contained in the books the club members check out. The club members read the books—or perhaps more accurately, they study the books, which are not written to be read as narrative‐decipher one part of the code, and then move on to the next book. Illuminating the entire pattern takes years of study, but along the way the club members are also initiated into the Unbroken Spine, which is the name of this monkish order of bibliophiles and code breakers. To become full members of the Unbroken Spine, these readers must understand the codes, which is to understand the books, which is to understand the Unbroken Spine’s history and purpose, and so forth. By contrast, Clay does not read the books or crack the code within them. Instead he works with the Unbroken Spine’s metadata, “not reading” the books but tracking the readers of those books. He comes to the correct conclusion, a fact Mr. Penumbra acknowledges with his “you did it all the same,” by piggybacking on the Unbroken Spine members’ years of difficult labor. And even after he has found the answer in his visualization, Clay does not understand the pieces that constitute that answer. He has looked into a few of the books, and knows they are a code, but he couldn’t crack the code in even one of them if asked.

Of course, I read the Unbroken Spine throughout Mr. Penumbra as an unsubtle metaphor for humanities scholars, reading closely over many years in search of greater understanding: of a historical period, of a genre, of a social movement, &c. And this leads me to two ideas about computational work in literary history that this exchange in Mr. Penumbra suggests. First, I often comment that one of my favorite things about digital humanities projects is the way they “make good” on decades—or even better, centuries—of fastidious record-keeping, particularly in libraries. I get excited when a scholar figures out a way to visualize corpora using an obscure metadata field recorded by generations of librarians and largely ignored until: wow! I’m thinking here of my colleague Benjamin Schmidt’s work visualizing American shipping from a data set compiled and translated through a century of new storage technologies and largely used in environmental research. These eureka moments excite me, but I can understand a more cynical reading, as the work of centuries and generations is distilled into a one-minute video.

Perhaps more to Mr. Penumbra’s point, however, computational methods can reverse the usual order of scholarly investigation in the humanities. Had I gone into the physical archive to study reprinting, I would have read these reprinted texts as I identified them, along with a host of texts which were not reprinted. The act of discovery would have been simultaneously an act of understanding. I would spend years in archives reading, and would emerge ready to build an argument about both the form and content of nineteenth-century reprinting practices.

Computational approaches are often more exploratory, or perhaps screwmeneutical, at least at the beginning of projects. We begin with a big question—can we identify duplicated text within this huge corpus of unstructured text data?—and we try one approach, then another, in an attempt to answer that question. We tweak this parameter and that to see what emerges. When something interesting happens we follow that line for awhile. And new questions suggest themselves as we work.

But in our case, all that exploratory work preceded the bulk of the reading the project has required and will require. Of course we were checking our results along the way, reading this or that cluster of reprinted text to see if the method was working, but it wasn’t until we’d isolated a substantial corpus of reprinted texts that the reading began in earnest. Now that we have identified 40,000+ reprints from before the Civil War, I’m spending significant time with those reprints, thinking about the genres that seem to be most widely reprinted, the ways these texts reflect (or don’t) our ideas about literary production and popular culture in the antebellum period, and studying the ways individual texts changed as they were reprinted across the country. The project’s research assistants are now annotating the text clusters, giving them titles; identifying authors; and assigning tags based on the topics, genres, and ideas reflected in each piece.

In many ways, then, our methodology disambiguated the act of discovery from the act of understanding. We quite quickly uncovered patterns of reprinting in our corpus, and now that the algorithm works well we can even more quickly apply it to new corpora, as we are hoping to do in the near future. And we have been able to create some exciting and suggestive visualizations from those findings, visualizing reprints as signals of influence between publications, for instance, in a network graph. But really making sense of these findings will be the work of years, not days.

Ultimately, I think Mr. Penumbra’s comments get at a central challenge for computational work in the humanities: both for those evaluating computational work from the outside and for those doing computational work. It seems clear to me how “distant reading” methods could seem like “cheating,” bypassing some of the work and time typically required to analyze large swaths of historical or literary material using a machine: “I mean, a thousand computers, like that!” But of course, if the question at the heart of the analysis is good, and the methods uncover real and substantive results, they shouldn’t be dismissed on essentially moral grounds, because the researchers didn’t work hard enough. At the same time, those undertaking such projects should recognize when their methods do lead to gaps in understanding because they invert the typical order of humanities scholarship. In Clay’s case, it is only after he creates his visualization of the Unbroken Spine’s Founder—in other words, only after he solves the bigger puzzle—that he begins to understand the details of the group and its mission, and eventually to contribute to that mission. Perhaps this is a model for some DH projects, which tell a truth, but tell it slant. In my case, I am striving to be more transparent about both what we have learned and what we are still learning in the Viral Texts project. And even if the computational work stopped tomorrow, we would have far more to learn than we have yet learned. Understanding is always a little bit out of reach, whether or not you work with a computer.