Mr. Penumbra, Distant Reading, and Cheating at Scholarship
My Technologies of Text course is capping this semester reading Robin Sloanâs novel, Mr. Penumbraâs 24-Hour Bookstore, which Matt Kirschenbaum deemed âthe first novel of the digital humanitiesâ last year. Mr. Penumbra is a fine capstone because it thinks through so many of our course themes: the (a)materiality of reading, the book (and database) as physical objects, the relationship between computers and previous generations of information technology, &c. &c. &c. I will try not too spoil much of the book here, but I will of necessity give away some details from the end of the first chapter. So if youâve not yet read it: go thou and do so.
Rereading the book for class, I was struck by one exchange between the titular Mr. Penumbraâbookstore owner and leader of a group of very close readersâand the narrator, Clay Jannonâa new bookstore employee curious about the odd books the storeâs odd club members check out. In an attempt to understand what the club members are up to, Clay scans one of the storeâs logbooks, which records the comings and goings of club members, the titles of the books they checked out, and when they borrowed each one. When he visualizes these exchanges over time within a 3d model of the bookstore itself, visual patterns of borrowing emerge, which seem, when compiled, to reveal an image of a manâs face. When Clay shows this visualization to Mr. Penumbra, they have an interesting exchange that ultimately hinges on methodology:
Half-smiling, he holds his glasses at an angle and peers down at the screen. His face goes slack, and then he says, quietly: âThe Founder.â He turns to me. âYou solved it.â He claps a hand to his forehead and his face splits into a giddy smile. âYou solved it already! Look at him! Right there on the screen! [...] âHow did you do it?â he continues. Heâs so proud, like Iâm his grandson and I just hit a home run, or cured cancer. âI must see your notes! Did you use Eulerâs method? Or the Brito inversion? There is no shame in that, it clears away much of the confusion early on...â âMr. Penumbra,â I say, triumph in my voice, âI scanned an old logbook [...] because Google has this machine, itâs superfast, and Hadoop, it just goesâI mean, a thousand computers, like that!â I snap for emphasis. I donât think he has any idea what Iâm talking about. âAnyway, the point is, we just pulled out the data. Automatically.â
At first Mr. Penumbra is quiet, but then he responds to Clayâs news:
âOh, yes, I know,â he says sharply, and his eyes flash at me. âI see it now. You cheatedâwould that be fair to say? And as a result, you have no idea what you have accomplished.â I look down at the desk. That would be fair to say. When I look back up at Penumbra, his gaze has softened. âAnd yet...you did it all the same.â He turns and wanders into the Waybacklist. âHow curious.â âWho is it?â I ask suddenly. âWhose face?â âIt is the Founder,â Penumbra says, running a long hand up along one of the shelves. âThe one who waits, hiding. He vexes novices for years. Years! And yet you revealed him inâwhat? A single month?â Not quite: âJust one day.â Penumbra takes a sharp breath. His eyes flash again. They are pulled wide and, reflecting the light from the windows, they crackle electric blue in a way Iâve never seen. He gasps, âIncredible.â
As I read this conversation, I was immediately reminded of so many exchanges Iâve seen at conferences about projects that use computational methodsâwhether text mining, network graphs, or geospatial visualizationsâto expose patterns in literary-historical texts. When I talk about our Viral Texts project, for instance, I typically begin by describing the archival challenge: in brief, there are just so many nineteenth-century periodicals that no scholar can read them all. I then discuss how weâve leveraged the pattern-finding powers of the computer to begin addressing this problem, automaticallyâthereâs that word from Mr. Penumbraâuncovering more than 40,000 reprinted texts in one large-scale digital newspaper archive and using that data to visualize the spread of texts around the country or the strength of connections among publications.
At the risk of sounding uncharitable hereâa risk I hope to address in the following paragraphs, so please stick with meâoften the response to this work from scholars in my discipline can sound not unlike Mr. Penumbraâs initial response to Clayâs visualizationââI see it now. You cheatedâŚ[a]nd as a result, you have no idea what you have accomplished.â Often such responses come, as one might expect, from scholars who spent significant time researching nineteenth-century reprinting in the archive, reading newspapers one by one and taking careful notes. That two junior scholarsâone a few measly years out of graduate schoolâare claiming to have identified 40,000 reprinted texts in a single yearâs work, and without stepping foot into an actual newspaper archive, seems a lot like cheating.
If someone actually articulated such a caricatured version of our workâand I am deliberately overstating things here to cast a more subtle problem into sharper reliefâI could quibble with details of that caricaturization. I brought to the project a disciplinary understanding of reprinting that shaped the early development of the duplicate-detection algorithm. We used known sets of widely-reprinted textsâtypically drawn from the incredible work of book historians, bibliographers, and literary criticsâto ensure we were capturing reprints we would expect to find, as well as new reprintings. We continually tweak the algorithm based on what it fails to find. Weâre still not great, for instance, at identifying reprinted lyric poems, because such texts simply donât include enough 5-grams (sequences of 5 words) to be identified using our current algorithm. Working through such problems and perfecting our computational methods requires that we draw on literary-historical knowledge and literary-historical methods. Finally, I have spent a good deal of time in physical archives, actually reading newspapers.
But, Mr. Penumbraâs comments do get at a central methodological difference that I think is worth attending to more closely. Because Mr. Penumbra is right: perhaps not that Clay cheated, despite Clayâs own concession to this charge, but that Clayâs methodology for finding the founder did not help him understand what he has accomplished. The pattern Clay uncovers in his visualization is âactuallyâ embedded in codes, which are contained in the books the club members check out. The club members read the booksâor perhaps more accurately, they study the books, which are not written to be read as narrativeâdecipher one part of the code, and then move on to the next book. Illuminating the entire pattern takes years of study, but along the way the club members are also initiated into the Unbroken Spine, which is the name of this monkish order of bibliophiles and code breakers. To become full members of the Unbroken Spine, these readers must understand the codes, which is to understand the books, which is to understand the Unbroken Spineâs history and purpose, and so forth. By contrast, Clay does not read the books or crack the code within them. Instead he works with the Unbroken Spineâs metadata, ânot readingâ the books but tracking the readers of those books. He comes to the correct conclusion, a fact Mr. Penumbra acknowledges with his âyou did it all the same,â by piggybacking on the Unbroken Spine membersâ years of difficult labor. And even after he has found the answer in his visualization, Clay does not understand the pieces that constitute that answer. He has looked into a few of the books, and knows they are a code, but he couldnât crack the code in even one of them if asked.
Of course, I read the Unbroken Spine throughout Mr. Penumbra as an unsubtle metaphor for humanities scholars, reading closely over many years in search of greater understanding: of a historical period, of a genre, of a social movement, &c. And this leads me to two ideas about computational work in literary history that this exchange in Mr. Penumbra suggests. First, I often comment that one of my favorite things about digital humanities projects is the way they âmake goodâ on decadesâor even better, centuriesâof fastidious record-keeping, particularly in libraries. I get excited when a scholar figures out a way to visualize corpora using an obscure metadata field recorded by generations of librarians and largely ignored until: wow! Iâm thinking here of my colleague Benjamin Schmidtâs work visualizing American shipping from a data set compiled and translated through a century of new storage technologies and largely used in environmental research. These eureka moments excite me, but I can understand a more cynical reading, as the work of centuries and generations is distilled into a one-minute video.
Perhaps more to Mr. Penumbraâs point, however, computational methods can reverse the usual order of scholarly investigation in the humanities. Had I gone into the physical archive to study reprinting, I would have read these reprinted texts as I identified them, along with a host of texts which were not reprinted. The act of discovery would have been simultaneously an act of understanding. I would spend years in archives reading, and would emerge ready to build an argument about both the form and content of nineteenth-century reprinting practices.
Computational approaches are often more exploratory, or perhaps screwmeneutical, at least at the beginning of projects. We begin with a big questionâcan we identify duplicated text within this huge corpus of unstructured text data?âand we try one approach, then another, in an attempt to answer that question. We tweak this parameter and that to see what emerges. When something interesting happens we follow that line for awhile. And new questions suggest themselves as we work.
But in our case, all that exploratory work preceded the bulk of the reading the project has required and will require. Of course we were checking our results along the way, reading this or that cluster of reprinted text to see if the method was working, but it wasnât until weâd isolated a substantial corpus of reprinted texts that the reading began in earnest. Now that we have identified 40,000+ reprints from before the Civil War, Iâm spending significant time with those reprints, thinking about the genres that seem to be most widely reprinted, the ways these texts reflect (or donât) our ideas about literary production and popular culture in the antebellum period, and studying the ways individual texts changed as they were reprinted across the country. The projectâs research assistants are now annotating the text clusters, giving them titles; identifying authors; and assigning tags based on the topics, genres, and ideas reflected in each piece.
In many ways, then, our methodology disambiguated the act of discovery from the act of understanding. We quite quickly uncovered patterns of reprinting in our corpus, and now that the algorithm works well we can even more quickly apply it to new corpora, as we are hoping to do in the near future. And we have been able to create some exciting and suggestive visualizations from those findings, visualizing reprints as signals of influence between publications, for instance, in a network graph. But really making sense of these findings will be the work of years, not days.
Ultimately, I think Mr. Penumbraâs comments get at a central challenge for computational work in the humanities: both for those evaluating computational work from the outside and for those doing computational work. It seems clear to me how âdistant readingâ methods could seem like âcheating,â bypassing some of the work and time typically required to analyze large swaths of historical or literary material using a machine: âI mean, a thousand computers, like that!â But of course, if the question at the heart of the analysis is good, and the methods uncover real and substantive results, they shouldnât be dismissed on essentially moral grounds, because the researchers didnât work hard enough. At the same time, those undertaking such projects should recognize when their methods do lead to gaps in understanding because they invert the typical order of humanities scholarship. In Clayâs case, it is only after he creates his visualization of the Unbroken Spineâs Founderâin other words, only after he solves the bigger puzzleâthat he begins to understand the details of the group and its mission, and eventually to contribute to that mission. Perhaps this is a model for some DH projects, which tell a truth, but tell it slant. In my case, I am striving to be more transparent about both what we have learned and what we are still learning in the Viral Texts project. And even if the computational work stopped tomorrow, we would have far more to learn than we have yet learned. Understanding is always a little bit out of reach, whether or not you work with a computer.