The Scissors, the Paste-Pot, and the Large Language Model
The text below was developed over a few speaking engagements this past year, including UICB’s Annual Nancy Brownell Lecture on the History of the Book, a talk for Freie Universität’s Dahlem Humanities Center and the Bloomsbury Chapter Stevenson Lecture. This research and my previous post, “Toward a Bibliography of AI Systems,” are both part of new work seeking to apply bibliographical and book historical insights to generative AI, and will be developed further toward more formal publication.
I. Introduction
“Wanted—a printer says a contemporary. Wanted—a mechanical curiosity, with brain and fingers; a thing that will set up so many type a day—a machine that will think and act, but still a machine; a being who takes the most systematic and monotonous drudgery—yet one that the ingenuity of man has never supplanted, mechanically; that’s a printer.” This article, which imagines a printer as an intelligent machine, was reprinted in at least 91 newspapers across the United States between 1860 and 1892. In the US at this time, newspaper content was explicitly not protected by intellectual property law, and editors swapped papers for common use through what were called “exchanges.” In the United States, editors took advantage of favorable postal rates to trade their papers with each other for reciprocal use. Compositors filled daily, weekly, or more occasional issues with selections from the other newspapers on their exchange lists, as well as from the magazines, books, and other media that their editors sifted through for content. The hybridity of the nineteenth-century paper reflects their diverse sources. A piece like “Wanted—A Printer” jostled for attention on the same page as short or serialized fiction; poetry, alongside hard news; travel and general interest pieces; practical information and advice; jokes, anecdotes, and miscellaneous facts; domestic, philosophical, and spiritual advice; and opinion alongside sometimes trite, sometimes profound aphorisms. Like a twenty-first century social media feed, the newspaper was a single medium for nearly every genre of content, often presented without evident organization.
This text is typical of the selections we discover through the Viral Texts project, which I co-founded with David Smith at Northeastern University eight years ago and continue to work on at the University of Illinois. I will not belabor a discussion of our text mining methods today, but for those curious we link to several articles on our website at https://viraltexts.org that describe these methods in detail, as does chapter two of our book in progress, which is available in draft form from the University of Minnesota Press. In brief, we have developed a suite of tools for fuzzy duplicate-text detection to identify clusters of likely-reprinted texts in digitized collections of historical newspapers. Growing from early experiments on pre-Civil War American newspapers held in the Library of Congress’ Chronicling America collection, Viral Texts now mines data from more than 20 large-scale historical periodicals archives, including papers from North America, Europe, and Australia and from the nineteenth and early-twentieth centuries. We use this data to explore different facets of nineteenth-century periodical culture, from the popular genres and authors that primarily circulated through newspaper reprinting, to the topics that generated the most discourse in the paper, to the ways reprinting can help us understand the flow of information across the country in the period.
In this talk, I seek to connect this messy textual ecosystem to twenty-first textual technologies such as large language models, such as ChatGPT. In a 2023 Book History article, Sarah Bull links the rise of “book-making” in the nineteenth century, as a category separate from authorship, to “the birth of the content generator” and, eventually, the ideas of textual content that underlie twenty-first century technologies like LLMs. By separating “two discrete categories of labor, the intellectual/creative and the bodily/mechanical,” Bull argues, nineteenth-century writers, editors, and publishers “popularized the idea that all text and anything that could be rendered into text was content—alienable filler for a medium, in this case, the book—and that this material could be generated pretty much automatically.”1 I will argue that the newspaper exchange system offers a similar analog, as a method of textual production founded on the selection, adaptation, and recontextualization of existing prose and poetry.
As most newspapers were produced in small, local print shops by a few compositors and editors, that text was processed by the very “machine that will think and act, but still a machine” cited in my opening example. If we consider the full network of papers that constituted the exchange system—and the millions of densely-composed pages being exchanged through it each year—the nineteenth-century newspaper constituted perhaps the largest text generation platform in human history, and it offers both lessons and warnings for our current discussions of large language models. Like LLMs, nineteenth-century newspapers blended original and unoriginal writing in ways that were difficut for their readers to immediately apprehend, and that reality led to similar—and warranted—anxieties about veracity, attribution, and reliability to what we face today.
II. Scissors & Paste Pots
Nineteenth-century newspaper editors frequently discussed their exchanges and the importance of reprinting to their medium. For instance, on June 6, 1888, the editor of Raleigh, North Carolina’s News & Observer praised the new woodcut adorning the front page of a neighboring paper from Durham. “We have the Daily Tobacco Plant,” they wrote, “with a beautifully engraved colored head piece in which appropriately appear the scissors and the paste pot and which makes it the handsomest daily in America, so far as we know, or for that matter the world.” Looking at the Daily Tobacco Plant, we see the details praised by the News & Observer: to the left, the editor’s ink-pot and quill sit atop a pile labeled “clippings”; the editor’s scissors run across the page, their handle looped through the capital “D” in “Daily”; while the paste-pot sits on the other side, waiting to paste up chosen clippings, so the paper’s compositors can set them in type for reprinting. By noting that these icons “so appropriately appear” as the headpiece for a daily newspaper, the News & Observer highlights the material reality of newspaper production across much of the US during the period, which comprised primarily reprinted material from other newspapers, rather than original writing, and were lauded as much for canny selection as for literary skill.
In the Viral Texts Project, we take our website’s own head piece from an illustration that appeared in Harper’s Weekly in January of 1874, which depicts the editor of the fictional Podonk Weekly Bugle considering whether to accept two chickens as payment for a year’s subscription when, as the accompanying article notes, “payment of a little hard cash” would be more useful. While it is condescending to its subject—as the title Podonk Weekly Bugle immediately suggests—the article nonetheless praises the small, rural papers that made up the majority of the US newspaper network and were “the only means by which many of [their] readers receive intelligence of what is going on in the great world outside.” In the illustration, we see again the scissors and paste pot, on the editor’s desk, sitting atop clippings from other newspapers that he is considering for inclusion in the Bugle. In his garbage can are clippings that did not make the cut—pun intended—while behind the subscriber we see a compositor using a pasted-up clipping to set type to reprint a chosen selection.
In a popular selection that made the rounds in the 1860s, often called out as from “an exchange,” an “editor of a country newspaper” rebuffs charges of poverty by declaring, among other things, “We have a good office, a paste-pot.” Another selection provided a recipe for making paste, and noted “Next to scissors, paste is an invaluable editorial assistant” before praising the creator of this new recipe as “a Godfrey, a Franklin, a Fulton, a Davy, or a Morse.” While certainly some contemporaries criticized “scissors and paste” journalism, the frequency with which practitioners highlighted—even richly illustrated—the scissors and the paste pot signal they valued selection and aggregation as distinct editorial and literary practices. Reprinting was certainly widespread. Using our reprinting data from the Viral Texts project, we estimate that papers in the period averaged more than 50% reprinted content, with some considerably more or less, and this is a conservative estimate based only on detected reprints.
Nineteenth-century editors’ attitude toward text reuse is exemplified in a selection that circulated in the last decade of the century, though often abbreviated from the version I cite here, which insists that “an editor’s selections from his contemporaries” are “quite often the best test of his editorial ability, and that the function of his scissors are not merely to fill up vacant spaces, but to reproduce the brightest and best thoughts…from all sources at the editor’s command.” While noting that sloppy or lazy selection will produce “a stupid issue,” this piece claims that just as often “the editor opens his exchanges, and finds a feast for eyes, heart and soul…that his space is inadequate to contain.” This piece ends by insisting “a newspaper’s real value is not the amount of original matter it contains, but the average quality of all the matter appearing in its columns whether original or selected.” Looking at reprints of this piece, we see another key element of newspaper exchanges; texts were not only copied verbatim, but were modified as they circulated. Sometimes texts were shortened, and only particular sections circulated widely. Details were changed to fit particular a paper’s geographic location, audience, political stance, or special interests. Material might be added to contextualize or editorialize.
III. Large Language Models
We can see in the nineteenth-century newspaper exchanges a massive system for recycling and remediating culture. I do not wish to slip into hyperbole or anachronism, and will not claim historical newspapers as a precise analogue for twenty-first century AI or large language models. But it is striking how often metaphors drawn from earlier media appear in our attempts to understand and explain these new technologies. The most famous critique of LLMs, Bender et al’s “On the Dangers of Stochastic Parrots,” describes an LLM as “a system for haphazardly stitching together sequences of linguistic forms it has observed in its vast training data.”2 Ted Chiang turns to more recent copying technologies, comparing ChatGPT to a model of Xerox photocopiers that stored data in a compressed JPEG format, which subtly blurred images—a effect magnified when copies were made of copies.3 A widely-shared and cited 2022 tweet from Cosmo Wenman—who describes himself as an artist and “open access activist”—looks even further back to characterize AI image synthesis as:
like movable type for visual concepts. It is going to facilitate a radical expansion and diffusion of the power to communicate with visual rhetoric unlike anything the world has ever seen.
Whether we are talking about ChatGPT for written language or something like Stable Diffusion for images, such arguments emphasize how these systems allow users to abstract an overall idea into discrete, atomized concepts that can be iteratively combined, like sorts from a typecase. To generate this image, for instance, we might imagine that I drew from my conceptual typecase the idea of “a woodcut,” “a robot,” “setting movable type,” and “in a historical print shop” to produce this hybrid image. The metaphor claims these systems make concepts, rather than the alphabet, almost infinitely recombinable.
Importantly, large language models are do not work like simpler autocomplete algorithms, which suggest the next word in a string based on the probability of one word following another. LLMs include also “attention,” which means these models are trained to understand words’ place within much longer strings of text—what literary scholars might call context—so they can, as Stephen Wolfram explains, do things like “captur[e] the way, say, verbs can refer to nouns that appear many words before them in a sentence.”4 Second, as Ouyang et al show, models like ChatGPT have been trained on additional, human-derived metadata, such as that generated by labelers who evaluated the helpfulness (or unhelpfulness) of responses to their prompts.5 While in theory LLMs remix their source data more fully than newspaper editors cutting and pasting texts wholesale, we have seen in discussions of plagiarism that these systems’ “probabilistic information about how [words] combine” often lead them to combine words in precisely the way they were combined in a text from their training data. This effect is so strong that a team of researchers was able to query which in-copyright works are present in GPT4’s training data by asking ChatGPT to complete sentences from those works with character names the model could only know if that work was present in the training data.6
This aspect of large language models points to a salient comparison with nineteenth-century newspaper reprinting, which is the fuzzy application of copyright and intellectual property to the medium. The content in US newspapers was not explicitly protected by copyright until 1909, though some publishers began affixing copyright notices to particular articles (or even whole issues) in the late nineteenth century. As Will Slauter demonstrates, however, most editors “understood that newspapers were mutually dependent” and did not seek to enforce charges of plagiarism against each other.7 While there was a general expectation (and sometimes social pressure) to cite source papers when selecting, those citations were uneven at best, and even citations were reprinted in ways that could be unreliable, such that a mistaken citation could get picked up by the exchange system and repeated. Drawing on our reprinting data from Viral Texts, we can say with confidence that 1. explicit citation was not as common as injunctions from editors might lead us to believe, 2. lack of citation is never a guarantee of originality, and 3. even explicit citation is often incorrect. Here we might think of the hallucinated citations often created by ChatGPT, which similarly “work” when verification will prove too time consuming.
The fuzziness of newspaper copyright, however, led to problems when material from media like books, which were protected by copyright, found their way into newspapers, where the dynamics of the exchange system would quickly lead to their wide distribution. As Meredith McGill has shown, authors such as Nathaniel Hawthorne and Charles Dickens complained loudly about being widely read and wildly popular in the newspapers without benefiting financially, while other authors like Fanny Fern sought to take advantage of reprinting to build a reputation they could capitalize on when selling books.8 The distributed nature of newspaper exchanges made it difficult to assert even clear-cut cases of plagiarism from books, so while many texts that began in books were reprinted in newspapers, few legal actions were taken. In 2024, the list of lawsuits against generative AI seems to grow daily, with organizations such as the Authors Guild and the New York Times arguing their protected works have been used without permission as training data and can be reflected back in the model’s generative output. AI companies and proponents argue that their technology is an example of transformative use,” drawing on (among other things) a landmark 2015 finding in favor of Google Books, against the Authors Guild, which found “Google’s making of a digital copy to provide a search function is a transformative use, which augments public knowledge by making available information about Plaintiffs’ books without providing the public with a substantial substitute for matter protected by the Plaintiffs’ copyright interests in the original works or derivatives of them.”9
By characterizing Google Books—and, as AI defenders are now arguing, services like ChatGPT—as information architecture that “augments public knowledge,” these companies echo the arguments offered for newspaper exchange in the Post Office Act of 1792, which codified and underwrote the exchange system by allowing “every printer of newspapers” to “send one paper to each and every other printer of newspapers within the United States, free of postage;” made mailing newspapers cheap for other purposes; and made it illegal for postal employees to interfer with newspaper mailing. These laws were advocated by, among others, the US’ first Postmaster General, Benjamin Franklin, specifically to ensure that information—or, more accurately to the period—“intelligence”—would circulate widely.
The most important contrast between the nineteenth-century newspaper exchange system and twenty-first century tech systems is that their commercial contexts are inverted. The Post Office Act of 1792 ensured that the postal system would not be a commercially-driven enterprise. The exchange system allowed small, local publications to work collectively to gather and distribute information, and prevented government interference in that cooperative arrangement. In the late nineteenth century the growth of wire services and corporate consolidation of newspaper publication led to louder calls—and eventually successful lobbying—for explicit protection of newspaper content as intellectual property. In other words, the exchange system ended in large part because newspapers became primarily corporate rather than civic media, and journalism became both a profession and a category of authorship to be protected as it became a salable commodity.
By contrast, AI systems largely began as corporate products. AI scholar and game designer Mike Cook suggests that much of our collective anxiety about AI is not rooted in the technology itself but instead rooted in open access materials or protected intellectual property being repackaged for corporate profit, or being used to harm workers. Writing about responses to “an AI tool that could automatically colour in anime line art,” Cook argues, “what I saw…wasn’t that they didn’t like the technology itself, but rather that they were worried it would be used to put people out of work, because capitalism inevitably will see this as a route to increased profit” and become “a weapon against us.”10 Thinking back to the metaphor of generative AI as “movable type for concepts,” recalls Elyse Graham’s writing on “The Printing Press as Metaphor”, which highlights the commercial imperatives underlying such rhetoric:
Another context that illuminates the marketplace value of the printing press metaphor is that of Silicon Valley. The distinctive incentive structures of Silicon Valley—start-up culture, venture capital financing, payment in stock options, IPOs—are geared toward the rhetoric of revolutionary change. For a company seeking investors to claim that the near future will be radically changed from the present, and that this change will be predicated on the autonomous force of changing technologies, implies that the future is manageable and claims ownership of that terrain. It recommends investors to the company and consumers to its products. To claim that technological change is gradual, iterative, unpredictable, and never fully complete, and that the future of technology will depend at least as much on social and cultural factors as on machinery viewed in isolation, would obviously be a weaker sell.11Â
In reality, generative AI is more like movable type where the precise location of different characters in the drawer is obscured, or setting type in a language you do not speak fluently. The resulting composition may turn out as you expect, but there may also have been words lost in translation—misheard, misspelled, or just missed. Of course, you may simply be reproducing—without knowledge or attribution—someone else’s work, or unknowingly perpetuating misinformation. While nineteenth-century editors sometimes commented on the veracity of the stories they selected and reprinted, they more often simply marked them as “going the rounds,” a signal of circulation rather than veracity, which was difficult to assure in an environment where texts could be and were adapted to every political, geographical, social, or rhetorical end. As we argue in the introduction to our book, which is titled Going the Rounds, that phrase “sometimes licensed editors to circulate material of dubious quality while washing their hands of responsibility for its veracity.” We might compare this to the term “hallucinations” used for AI systems in 2024. While the discussion of “hallucinations” does acknowledge some limitations for generative AI, the phrase also displaces responsibility by positing a temporarily discombobulated subjectivity rather than a flawed technical pipeline.
IV. Mixed Media Metaphors
In my own robot compositor example, no combination of “a woodcut,” “a robot,” “setting movable type,” and “in a historical print shop” produced anything close to what my imagination wanted, but instead generated vaguely woodcut-like robots in generic workshop settings. To produce my desired “robot compositor” image, I instead started with a historical woodcut as the seed image, which perhaps puts us closer to the realm of pastiche—or reprinting, with a difference—than “movable type for images.”
As Graham argued, this “gradual, iterative, unpredictable, and never fully complete” vision would certainly be “a weaker sell” than “movable type for concepts.” In many ways this image depicts the “machine that will think and act, but still a machine” imagined in the “Wanted—A Printer” selection with which I began this piece. It also bears an uncanny resemblance to “The New Steam Compositor” that appeared in Andrew W. Tuer’s 1884 book of printing jokes, Quads for Authors, Editors, and Devils, an illustration intended to satirize the rhetoric of technological progress in the late nineteenth-century, and the ways that progress would in reality be felt by laborers.
While much of our attention in Viral Texts centers editors and writers, the compositor—the person would reset chosen selections in type for each local newspaper—helps us draw a more nuanced line between the nineteenth-century print shop and the twenty-first century language model. In a selection from the Freeman’s Champion (Prairie City, Kansas) the compositor is described as an automaton who “must remember the impressions his eye caught” of the copy before him and “proceed mechanically to pick up the individual letters of which every word is composed” without distraction, a daily reality which takes a physical toll and “brings the compositor to a premature old age.” An article on “Fast Type Setting” in the Conservative newspaper of M’connelsville, Ohio recommends a series of physical habits that will help a compositor “not make any false motions” and perform their duty with robotic precision and speed. An earlier piece in the Arkansas Advocate (Little Rock, Arkansas Territory) outlines the “Miseries of a Compositor,” noting that “We hear a good deal of the miseries of editors” but far less “of their humble coadjutors, the compositors.” This piece insists, “The employment of a compositor is of a two-fold nature, mechanical and mental” and notes that the compositor “has a certain number of squares expected from him as a day’s work…whether his copy be clear or obscure, legible or illegible, punctuated or not.”
In contrast to many of the above examples, an obituary in the Port Tobacco Times for Charles W. Alcott, described as “a model compositor,” praises the fact “that when a piece of manuscript of poor chirography was handed him, he had the intelligence to discover its defects, and supply any omissions which the writer may have made,” going on to bemoan that most “compositors pursue their vocation as mere automatons—picking up type mechanically as it were—and never pausing to exercise the reasoning faculties.” In a widely reprinted joke, which appeared, among many other places, in the Opelousas Journal, editors are lampooned for thinking typesetting required no skill and could be done without compositors, a mistake revealed in the horrendous typesetting in the selection. Here, then, the compositor’s humanity and intelligence are praised, though as other pieces from the Benton Record(Benton, Montana Territory) and Morning Appeal (Carson City, NV) show, for many in the print industry the very phrase “intelligent compositor” was a joke and scapegoat. The “intelligent compositor,” in these tellings, is someone who exercises too much editorial discretion and corrects copy beyond what the author would wish or the text would warrant, to the point of spoiling a text’s meaning or a joke’s punchline by “fixing” regional dialect or an orthographic pun. In this reprinted joke from the Wheeling Daily Intelligencer, for instance, a woman fainted “dead away” reading “Babies are fashionable this season.” The mistake is “all the fault of the intelligence compositor,” who was supposed to set, “Rubies are fashionable this season.”
In both twenty-first century rhetoric about artificial intelligence and nineteenth-century jokes about printers or compositors, we can trace anxiety about—even contrasting desires for—the line between “the mechanical and the mental.” Nineteenth-century editors wanted compositors who worked tirelessly and automatically, but also compositors who exercised discretion, even ingenuity. If we want to draw a direct comparison between large language models and the print shop, the LLM is perhaps more a species of compositor than an editor, as its textual resetting is more granular, individual words and phrases rather than—usually, at least—whole snippets. Likewise if we cast a system like ChatGPT as a “mechanical” tool, we deflect or flatten questions about biases in its training data, intellectual property theft, or the ethics of its use in academic or professional writing. Ultimately, AI systems remix existing cultural artifacts, rather than producing new ideas. This is a common critique but also a means of locating these technologies within the continuum book historians and bibliographers study, and in fact places them in dialogue with a range of historical media that are also based in overlay, pastiche, and remix, such as commonplace books, scrapbooks, palimpsests, or of course newspaper reprinting. None of these practices are identical to the species of text reuse produced by an LLM, but the practices share family resemblances.
V. LLM Analyses
In this final section, I want to suggest a few ways the family resemblances between reprinted nineteenth-century newspaper texts and large language models might facilitate scholarly analysis, both of and with LLMs. Much has been written about the lackluster output of services like ChatGPT for student (and other) writing. As models designed to produce text probabilistically, based on existing text in their training data, language models tend to generate average or mundane prose, at least thus far. I want to suggest, however, that for historians, literary scholars, and other humanities researchers, this quality of LLMs can be a boon to tasks also based on normative patterns across large-scale historical newspaper collections, such as genre classification, textual segmentation, or topic identification. As Ted Underwood argues, “Historians are already in the habit of finding meaning in genres, nursery rhymes, folktale motifs, ruins, political trends, and other patterns that never had a single author with a clear purpose.” When considered not as objective representations of all language, but instead as highly contextual “models of culture,” LLMs become not unknowable oracles but instead bound, describable, and comparable cultural artifacts that we can both study and use.
At the core of many of the debates around AI models of culture—whether text, image, audio, or video—is anxiety about corporate control, proprietary software, and “black-box” systems. Such machine learning models are trained on massive amounts of data, but OpenAI and other companies typically do not disclose the exact composition of their models’ training data, making it difficult for users to ascertain the reliability, representativeness, or limitations of the models, or their relevance to different domains. That the companies building these models wish to obscure their provenance and functions, however, does not mean those things are unavailable for study and analysis. In a talk last year, I discussed a series of approaches based on bibliography and data archaeology for reverse engineering AI systems.
Throughout this talk I have suggested that historical newspaper reprinting we study in Viral Texts might help illuminate LLMs, and in that earlier talk I discuss how the reprint detection methods we use for tracing historical reprinting offer an approach to reverse engineering an LLM like GPT. I will not repeat that information here, but it suggests that since LLM-generated prose is, as Bender et al argue, “stitched together” from existing prose, methods used to study historical text reuse can help use understand the seams in that stitching.
In newer experiments, I draw on the LLM not to generate new text, but instead to compare and annotate, two of the “basic functions common to scholarly activity across disciplines” that John Unsworth named “scholarly primitives.” These are both tasks of pattern recognition that might be productively automated. In the case of Viral Texts our initial reprint-detection methods are largely content-agnostic, seeking only to identify passages of text that are duplicated across pages. This generates many millions of prospective reprints, which we then sift through in various ways in search of meaningful texts and patterns which we can analyze with both computational and humanistic methods. In experiments with project RAs Jonathan D. Fitzgerald and Avery Blankenship, I have experimented with methods for computational genre analysis to identify texts useful for particular studies of poetry, or fiction, or news, and so on. The scale of our data makes a computational approach to genre necessary.
These experiments were largely supervised genre classification, where we train a model for a genre of interest by hand-tagging a corpus of texts from that genre. The model learns from the training data the linguistic or structural features of the genre, and when applied to an unknown text will assign a probability that the unknown text belongs to the trained genre. This is a well-established method for genre detection, but requires significant time and labor creating the training set and is difficult to adjust on the fly, as (for example) new genres come to light. While GPT-4 was not trained primarily on nineteenth-century newspaper writing, it was, I would argue, trained on similar, modern genres. And like the output of an LLM, the point of genre is to group similar texts together. Though scholars disagree about the precise boundaries, every genre label is an argument about patterns, similarities, and commonalities. In other words, the normative features of an LLM are precisely why they might work well for genre analysis at scale, at least so long as the genre categories are relatively broad.
I wrote a script that prompts GPT4 to assign a genre to each text provided from a pre-determined list of possible genres.12 I use a pre-determined list to prevent it responding in a verbose way that would be difficult to work with computationally, and also to prevent too long a list of niche or subgenres, though it could be interesting to see what genres it would assign without such a constraint. It should be said that even with these constraints, the API occasionally returned a genre not in my list, or a response with additional prose (e.g. “the genre of the provided text would”) that was not requested.
For the most part, however, in this experiment GPT4’s labels tend to align with those a human researcher would likely make. Hand-checking a list of 300 random clusters assigned genres by GPT4, I only strongly disagreed with 3 of its assignments, and I might have quibbled with another 6 or so, as texts that crossed genres, but which I would not say were wrong. As with standard genre classification experiments where you might ask more than one expert to label each text, to find points of agreement or divergence, I could imagine comparing human and LLM-derived labels, or even using multiple models and using labels where they agree. Expanding the inquiry a bit, I used GPT4 to label genre for the 1000 largest clusters in our Viral Texts data. Though we work to filter out advertisements, they inevitably leak through, so I am not surprised it was the biggest category. We might also not be surprised to see “political” in the second spot. From here, however, we see some of the unique features of nineteenth-century newspapers, as “advice,” “poem,” “religious,” “humor,” and “literary” occupy the five next positions ahead, even, of “news.”
I do think these experiments benefit from the generic similarities between newspapers and the web, and similar experiments on texts more divergent from LLMs’ training data would likely see less reliable results. Currently scholars are building large language models trained on period or genre-specific data, which will be better suited to these kinds of tasks. But even without fine-tuning, these results are promising. Importantly, my goal with these experiments is not to prove a theory or build a model of genre. These labels will not contribute to claims about where the line between humor and anecdote actually can or should be drawn. In the nineteenth-century paper, anecdotes are often humorous, so this line is not easily determined. Instead, my aims are practical. We have an enormous well of data about nineteenth-century newspaper reprinting, out of which we often want to identify and more closely explore particular subsets. An initial pass labeling by genre would be an enormous boon to such exploration, allowing us—for instance—to more easily filter to a subset of clusters likely to be poems, fiction, jokes, and so on. These results will prompt me to experiment more with LLMs as an exploratory tool for large historical datasets. In particular, I hope to run similar experiments with models finetuned on historical newspaper data to see if they are both more reliable and more amendable to scholarly nuance. Comparison among models will be essential to this work moving forward, ideally among both corporate models and alternatives managed by non-profit or academic organizations.
Which brings me to another, similar experiment, which takes advantage of LLMs’ usefulness summarizing provided texts, as Underwood has shown. The Viral Texts project is currently collaborating with Washington University’s Racial Violence Archive on “The Virality of Racial Terror in US Newspapers, 1863-1921, which seeks “to trace the the circulation of reports about anti-Black violence in US newspapers in the nineteenth and early twentieth centuries.” One of our initial goals is to develop methods to automatically identify whether stories in our historical newspaper data are likely to be about an incident of racial violence. In some ways this is a similar problem to genre detection, but is complicated by the fact that many historical newspaper archives, such as Chronicling America, do not divide their data by article, but instead by page, which means a report about an incident of racial violence may be nested within a much larger text that includes many other topics and genres. We are approaching this problem from several methodological angles—most of which I cannot describe today—but in one very preliminary but intriguing experiment, I used GPT4 to summarize newspaper pages, identifying any sections of text that seemed to report an incident of racial violence. In this case, I used pages from the New National Era, a newspaper published in Washington D.C. by J. Sella Martin and Frederick Douglass in the 1870s, advocating for reconstruction, reporting on the US Black community and its efforts toward equality, documenting continued prejudice and violence against the Black community, and sharing general news.
GPT4 does a reasonably good job identifying passages within larger text that may address incidents of racial violence. It identifies two such texts in the 24 April 1873 issue, for instance. It first highlights the article “Louisiana,” which describes “The brutal and cowardly massacre in Grant Parish of colored men by white men in obedience to the teachings of such negro-hating journals as the New York Tribune.” Elsewhere on that same page, the article “Shall the Doctrine be Universally Applied?” quotes the Boston Globe’s argument for retributive violence against Native Americans and asks whether “the whole rebel population” of the South will be similarly punished for their unrelenting violence against Black Americans, which the article lists only in part. In some ways GPT4 is perhaps too broad in its interpretation of this task, selecting in other instances text a historian might categorize as discrimination, but not violence. Likely this particular task would benefit greatly from fine-tuning or a custom language model, as the subtleties of nineteenth-century newspaper language can make this information retrieval task more challenging for a model trained on the twenty-first century internet. However, if the goal for this work is to process thousands or even millions of newspaper pages and identify smaller, discrete articles worthy of closer scholarly attention by a human expert, these results are, to my mind promising.
Conclusions
It is not lost on me that in this final experiment I am asking GPT-4 to act as a kind of exchanges editor: to comb through a digital pile of newspaper pages, identify and cut out texts of interest to my research, and paste them into a dataframe for further analysis. In comparing historical practices of unoriginal writing and contemporary AI, I do not seek to collapse historical distinctions or wave away concerns about the material and ethical implications of AI. Instead, I seek to cut through some of the marketing hype and position AI technologies as material-cultural phenomena which we can apprehend and critique—and teach our students to apprehend and critique—by adapting our existing qualitative and quantitative toolkits.
I would argue that our path forward with large-language models and other AI models will rest in pedagogy. As communities of scholars and practioners assemble more robust understandings of these tools, their training data, and the workings of their generative algorithms, we can help students dissect those same workings, evaluate the ways in which such tools might (or might not) assist their work, and advocate for needed changes and regulation. Looking toward longer histories of unoriginal writing can help contextualize debates that seem novel and thus unknowable. Generative AI is not quite movable type for concepts, but there is a link between these systems and the printing house we can use to cut through the hype.
-
Bull, Sarah. 2023. “Content Generation in the Age of Mechanical Reproduction.” Book History, pg. 325 ↩
-
Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜” In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–23. FAccT ’21. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3442188.3445922. ↩
-
Chiang, Ted. 2023. “ChatGPT Is a Blurry JPEG of the Web.” The New Yorker, February 9, 2023. https://www.newyorker.com/tech/annals-of-technology/chatgpt-is-a-blurry-jpeg-of-the-web. ↩
-
Wolfram, Stephen. 2023. “What Is ChatGPT Doing … and Why Does It Work?” February 14, 2023. https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/. ↩
-
Ouyang, Long, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” arXiv. https://doi.org/10.48550/arXiv.2203.02155. ↩
-
Chang, Kent K., Mackenzie Cramer, Sandeep Soni, and David Bamman. 2023. “Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4.” arXiv. https://doi.org/10.48550/arXiv.2305.00118. ↩
-
Slauter, Will. 2019. Who Owns the News?: A History of Copyright. Stanford University Press. ↩
-
McGill, Meredith L. 2007. American Literature and the Culture of Reprinting, 1834-1853. University of Pennsylvania Press. ↩
-
Authors Guild v. Google, Inc. 2015. United States Court of Appeals for the Second Circuit. ↩
-
Cook, Mike. “You Don’t Hate AI. You Hate Capitalism.” mike cook on cohost, May 11, 2024. https://cohost.org/mtrc/post/5864615-you-don-t-hate-ai-y. ↩
-
Graham, Elyse. “The Printing Press as Metaphor.” Digital Humanities Quarterly 10, no. 3 (2016). http://www.digitalhumanities.org/dhq/vol/10/3/000264/000264.html. ↩
-
Note that this script is pretty rough-and-ready, and assumes you are working with data in the format of our Viral Texts clusters. It also requires an OpenAI API key. I’ve added a spreadsheet of our larger clusters so readers can see the typical format and try to reproduce these results. Also note that the
Sys.sleep
commands in the loops here are important, as they introduce a pause in between API requests, so that you don’t flood OpenAI’s servers and get kicked off. ↩