On Ignoring Encoding

Lately we've seen a spate of articles castigating the digital humanities—perhaps most prominently, Adam Kirsch's piece in New Republic, "Technology Is Taking Over English Departments: The False Promise of the Digital Humanities." I don't plan in this post to take on the genre or refute the criticisms of these pieces one by one; Ted Underwood and Glen Worthy have already made better global points than I could muster. My biggest complaint about the Kirsch piece—and the larger genre it exemplifies—would echo what many others have said: these pieces purport to critique a wide field in which their authors seem to have done very little reading. Also, as Roopika Risam notes, many of these pieces conflate "digital humanities" with the DH that happens in literary studies, leaving digital history, archeology, classics, art history, religious studies, and the many other fields that contribute to DH out of the narrative. In this way these critiques echo conversations happening with the DH community about its diverse genealogies, such as Tom Scheinfeldt's The Dividends of Difference, Adeline Koh's Niceness, Building, and Opening the Genealogy of the Digital Humanities, or Fiona M. Barnett's "The Brave Side of Digital Humanities."

Even taken as critiques of only digital literary studies, however, pieces such as Kirsch's problematically conflate "big data" or "distant reading" with "the digital humanities," seeing large-scale or corpus-level analysis as the primary activity of the field rather than one activity of the field, and explicitly excluding DH's traditions of encoding, archive building, and digital publication. I have worked and continue to work in both these DH traditions, and have been struck by how reliably one is recongized—to be denounced—while the other is ignored or disregarded. The formula for denouncing DH seems at this point well established, though the precise order of its elements sometimes shifts from piece to piece:

  1. Juxtapose Aiden and Michel's "culturomics" claims with the stark limitations of the Ngrams viewer.
  2. Cite Stephen Ramsay's "Who's in and Who's Out," specifically the line "Do you have to know how to code? I’m a tenured professor of digital humanities and I say 'yes.'" Bemoan the implications of this statement.
  3. Discuss Franco Moretti on "distant reading." Admit that Moretti is the most compelling of the DH writers, but remain dissatisfied with the prospects for distant reading.
These critiques are worth airing, though they're not particularly surprising—if only because the DH community has been debating these ideas in books, blog posts, and journal articles for a long while now. Matt Jockers' Macroanalysis alone could serve as a useful introduction to the contours of this debate within the field.

More problematically, however, by focusing on Ramsay and Moretti, these pieces ignore the field-constitutive work of scholars such as Julia Flanders, Bethany Nowviskie, and Susan Schreibman. This vision of DH is all Graphs, Maps, Trees and no Women Writers Project. All coding and no encoding.

When Kirsch gestures towards encoding in his piece, the gesture simply dismisses its importance or pertinence to the larger discussion of digital humanities. In Kirsch's piece, for instance, he claims,

Within this range of approaches, we can distinguish a minimalist and a maximalist understanding of digital humanities. On the one hand, it can be simply the application of computer technology to traditional scholarly functions, such as the editing of texts. An exemplary project of this kind is the Rossetti Archive created by Jerome McGann, an online repository of texts and images related to the career of Dante Gabriel Rossetti: this is essentially an open-ended, universally accessible scholarly edition. (my italics)
For Kirsch, digital humanities equals big data, so digital humanities work that's not about big data isn't digital humanities, but "simply" textual scholarship masking itself as digital humanities. In a few lines, Kirsch invokes and trivializes—through words such as "simply" and "essentially"—what is arguably the longest-standing and most influential thread of digital humanities' history in literary studies: the preservation, annotation, and representation of historical-literary works for the new medium of our time. Under the banner of "encoding," I mean to write not only of TEI markup, but of a wider range of practices that have focused on digital preservation and publication. Alongside the TEI, then, we might think of Neatline, which Bethany Nowviskie argues "was carefully designed by humanities scholars and DH practitioners to emphasize what we found most humanistic about interpretive scholarship, and most compelling about small data in a big data world." Even more recently, we might think of Andrew Stauffer's Book Traces project, which aims to crowdsource the identification of unique physical books "in danger of being discarded as libraries go digital," a project that would seem at odds with a purely techno-solutionist version of DH. And while I speak primarily from the archiving and encoding tradition in digital literary studies, I suspect archive building has occupied a similar primary space in the genealogy of digital history. I don't have the numbers for this, but I strongly suspect that far more hours of labor and even, yes, far more financial support has gone into encoding and archival projects than into data analysis over the past decades of DH history. Certainly many, many, many DH "origin stories" begin, "I got a job as a graduate student doing encoding for project X, Y, or Z."

Perhaps more importantly, however, this evolving, amorphous, decades-long work has significantly reshaped the horizons of literary-historical research for countless colleagues and students, both within and without DH. As Amy Earhart and others have shown, this collective project has not always opened the canon in the ways we might hope, and those engaged in this work must do more to make digital publication a space for the recovery of lost or underrepresented voices. We remain far from Jerome McGann's oft-rearticulated vision that "that the whole of our cultural inheritance has to be recurated and reedited in digital forms." And as Lisa Spiro and Jane Segal show, the true impact of this digital archival work is often elided when our colleagues use digital archives for teaching and research—and use them they do—but cite those materials as if they visited the physical archive. Nevertheless, the very idea of archival access means something different today than it did a few short decades ago, and the work that produced this new reality is a primary foundation of the digital humanities. Moreover, decades of conversations and collaborations around archives and encoding led to the development of standards that resonate far beyond research universities. This work has not been "simply" the application of computer technology to traditional scholarly functions; something like the TEI is one of the best examples of humanistic scholarship applied to computer technology. If you believe that encoding is simply the mechanical application of tags to documents, I encourage you to attend a WWP seminar or workshop, where you will be swiftly disabused of that notion. A project like the Rosetti archive is not simply or essentially "an open-ended, universally accessible scholarly edition," it is "an open-ended, universally-accessible scholarly edition!!!!" which is a thing that did not exist before humanities computing/digital humanities. Now so many exist that we often, to our discredit, treat them as passé.

At an event here at Northeastern University last spring, Matthew Jockers and Julia Flanders were kind enough to stage a "debate" about scale and analysis in the digital humanities. The organizers of this symposium asked Julia and Matt to stage this exchange as a debate in large part to highlight what we saw as a false dichotomy between big and small data in DH work. Julia and Matt's conversation—which I still chastise myself for failing to record—was one of the best articulations I've seen or read of the two poles of inquiry between which much DH work proceeds. I simply cannot read this exchange and see a field unreflective about its methods or unaware of both its potential and its limitations. To sample only one exchange:

Exciting, indeed! And you can expect a call from me next Monday. . . But you know, it occurs to me that you and I have been drinking out of the same kool aid firehose for a good number of years. It might be worthwhile to pause here and acknowledge a few of the real challenges associated with this kind of work. I worry a lot, for example, about how even our big data corpora are still really small, at least when it comes to making claims about “Literature” with a capital “L.”
One thing we wrestle with at the WWP is the problem of what our collection really represents. Back when the project was first envisioned, we thought that we could actually capture all of the extant women’s writing in English before 1830, so representativeness wasn’t so much of a problem. But (I guess we should be glad) that turned out to be wildly wrong—there were orders of magnitude more eligible texts than we had imagined, far more than we’ll ever likely capture before the heat death of the universe at the rate we’re going.
So now, when we offer tools for text analysis that operate on the whole collection, we have the question of what this collection can actually tell us: about genre, about authorship, about periodization, about anything. It’s a mid-size collection, about 350 texts from a wide range of genres, topics, periods, etc., and clearly there’s some very useful information to be gained from studying it, but precisely what kinds of conclusions can one draw? I like very much Steve Ramsay’s idea that the point of such tools is to permit exploration, to pique our interest and prompt further discovery, but if we were to provide tools for statistical analysis, I think they could easily be misleading given the nature of the sample.
That said, I think representativeness is a very vexed question for any collection—even if one is acutely aware of the problem, as the corpus linguists are, it seems that the best one can do is be very, very transparent about one’s collection development strategy, and hope that the user reads the documentation. But both of these conditions seem fragile... and as text analysis tools become more novice-friendly, I think they’re more likely to be used in a novice way. So how do you handle this?
At some point during my work on the 19th century novel, I had to make a decision to quit collecting texts and start analyzing them. How I got to that point is another matter, but when I began the project I had 950 books and when I made that decision to quit collecting I had 4,700 books. I mined that data and I wrote the last two chapters of my book. About the time I was getting ready to submit the final manuscript, I discovered that there were not 4,700 books. There were actually 3,346. It turned out that the materials my colleagues and I had collected included many multi-volume novels that had not been stitched together and also a good number of duplicates that we had acquired from different sources. When I sorted this all out, I had 3,346 books, and I ended up having to completely rewrite those last two chapters.
This is not the DH that gets quoted in pieces like Kirsch's: not the scholar who analyzes thousands of books computationally and the scholar who encodes the minute details of individual texts, engaged in sincere and generative dialogue about the affordances and limitations of their respective approaches. Far easier to cite the field's most grandiose claims and be done with it. But this dialogue, too, is DH—and not a minor or marginal part of the field.

Textual encoding has never been as sexy as text analysis, at least for those looking at DH work from outside the field. In many ways, encoding inherited the stigma of scholarly editing, which has in English Departments long been treated as a lesser activity than critique—though critique depends on careful scholarly editing, as text analysis depends on digitization and encoding. You may find encoding or archival metadata development boring or pedantic—certainly some do—but you cannot pretend that encoding is less a part of the digital humanities than coding. Indeed, for me and many others, one of the earliest appeals of DH was that the field attempts to make more transparent the relationships among preservation, presentation, access, and interpretation. In short, any vision of digital humanities that excludes or dismisses the close and careful work of digital preservation, editing, and publication is simply false.