What Has the Digital Meant to American Periodicals Scholarship?

Note: this is a peer reviewed, uncopyedited post-print of an article that appears in American Periodicals 26.1 (2016), which is now available at Project Muse. It is part of a phenomenal forum on Digital Approaches to Periodical Studies that includes essential pieces (in order of appearance) by Elizabeth Hopwood, Benjamin Fagan, Kim Gallon, Jeffrey Drouin, and Amanda Gailey.

What has digitization meant for periodical studies, and what might it mean in the future? We should first consider how the digital archive changes notions of access, both political and practical. James Mussell notes that “the conditions that permitted newspapers and periodicals” to become the central medium of discourse in the nineteenth century—“their seriality, abundance, ephemerality, diversity, heterogeneity—posed problems for those who wanted to access their contents” in print forms.[1] The periodicals archive is vast and largely unindexed. In ways so basic and fully transformative that we easily overlook them, digitization and its attendant technology, keyword search, have already changed periodicals scholarship entirely, allowing researchers to easily identify topics of interest across swathes of newspapers, magazines, and related materials, and to just as easily incorporate those media as evidence for historical, literary, or other claims. As Ted Underwood reminds us, “[a]lgorithmic mining of large electronic databases has been quietly central to the humanities for two decades. We call this practice ‘search,’ but ‘search’ is a deceptively modest name for a complex technology that has come to play an evidentiary role in scholarship.”[2] Though other forms of computational analysis will certainly influence periodicals research in the near future, the most dramatic methodological shift has already happened.

Continue reading

Reprinting, Circulation, and the Network Author in Antebellum Newspapers

This is a pre-print version of this article. The final, edited version appears in American Literary History 27.3 (August 2015). An accompanying methods paper co-written by me, David Smith, and Abby Mullen can be found on the Viral Texts Project site.

I. Introduction[1]

When Louis F. Anderson took over the editorship of the Houma Ceres in 1856, he admitted that he was “not…very distinguished as a ‘knight of the gray goose quill,'” but assured his new readers that “our pen will not lead us into difficulty” because “our ‘principal assistant,’ the scissors, will be called into frequent requisition—believing as we do, that a good selection is always preferable to a bad editorial” (June 28, 1856).[2] Thus, Anderson sums up a set of attitudes toward the production, authorship, and circulation of newspaper content within a system founded on textual borrowing. In the antebellum US context, circulation often substituted for authorship; the authority of the newspaper rested on networks of information exchange that underlay its production. “Nothing but a newspaper can drop the same thought into a thousand minds at the same moment,” Alexis de Tocqueville writes, describing circulation as a technology—like the rail and telegraph—compressing space and time, linking individuals around the nation by “talk[ing] to you briefly every day of the common weal” (111). In both examples, the newspaper’s primary value stems from whom and how it connects. Continue reading

Two (of Three) Ways of Looking at C19 Newspaper Exchange Networks

I wrote the following as part of my preparation for next week’s second meeting of the NHC Summer Institute in Digital Textual Studies next week. The post assumes a modest working understanding of network graphs and their terminology. For a primer on humanities network analysis, see the links for my network analysis workshop or, more specifically, see Scott Weingart’s ongoing series Demystifying Networks, beginning, appropriately enough with his introduction, his second post about degree, and possibly his post on communities.

Introduction

In previous work in American Literary History, I argued that reprinted nineteenth-century newspaper selections should be considered as authored by the network of periodicals exchanges. Such texts were assemblages, defined by circulation and mutability, that cannot cohere around a single, stable author. As part of this argument, I demonstrated how social network analysis (SNA) methods might employ large-scale data about reprinting to illuminate lines of influence among newspapers during the period. In that early network modeling, I represented individual newspapers from our reprinting data—at the time drawn primarily from the Library of Congress’ Chronicling America collection—as nodes, connected by edges that represented texts printed in common between papers. Those edges were weighted by frequency of shared reprints. The working assumptions behind those models were these: 1.) the fact that two newspapers reprint this or that text in common says very little about their relationship, or lack thereof, during the period and 2.) that when two newspaper printed hundreds, thousands, or even tens of thousands of texts in common, this fact is a strong signal of a potential relationship between them.

A selection from a single cluster in the Viral Data. Each line represents a specific reprint from the larger cluster, which is identified by the ID in the first column. You can browse the cluster data I used for these experiments. These are themselves experimental clusters using a new version of the reprint-detection algorithm, and are not yet suitable for formal publication.

Our data about reprinting in the Viral Texts Project is organized around “clusters”: these are, essentially, enumerative bibliographies of particular texts that circulated in nineteenth-century newspapers, derived computationally through a reprint detection algorithm that we describe more fully in previous publications.1 From these chronologically-ordered lists of witnesses, we derive network structures by tallying how often publications appear in the same clusters. When two publications appear together in a particular cluster, they are considered linked, with an edge of weight 1. Each subsequent time those same publications appear together in other clusters, the weight of their edge increases by 1; ten shared reprints results in a weight of 10, one hundred shared reprints in a weight of 100. Thus the final network data shows strong links between publications that often print the same texts and weaker links between publications that occasionally print the same texts.

This method works reasonably well for ascertaining potential lines of influence among nineteenth-century periodicals. That two newspapers happen to share a few texts in common says very little: the mechanics of nineteenth-century reprinting were dynamic and varied enough that nearly any two newspapers were bound to occasionally print the same texts, whether or not those particular newspapers had any direct or near-removed relationship. The general network model we have used to this point does assume, however, that when two newspapers share hundreds, thousands, or even tens of thousands of texts in common, these alignments can be strong indicators that allow us to hypothesize a close relationship between them. However, we cannot conclude direct influence from this network model, if only because our population of newspapers is incomplete, even fragmentary: limited to those newspapers which have been digitized and whose data we can access. Our text mining draws on several thousands of newspapers, but these represent a small fraction of the papers that would have been extant in the nineteenth century, which means there are potentially far more nodes missing from our data than nodes present. Thus we cannot draw firm conclusions directly from our network graphs; instead, we treat them as indicators which can direct more focused research. When we see a strong link between two papers in the model, this points us back to the archive to discern what the nature of that link might be: are these papers politically affiliated? Geographically close? Were their editors friends, or relatives?

This iterative process from the model to the archive and back has been enlightening and useful, though from the beginning it has been clear that refinement of our data and model are both possible and necessary. For instance, how does time play into these network models? The world of nineteenth century newspapers was incredibly volatile, with new publications constantly appearing and old ones disappearing. In addition, even ongoing papers were constantly changing their names (sometimes due to mergers or schisms), swapping their political affiliations, and sometimes even moving locations. Which is to say, we would expect the dynamics of our network model to shift dramatically over the nineteenth century as particular papers waxed and waned in influence, or as relationships among those papers shifted.

Recently, then, I have begun working to generate more nuanced network data from our reprinting evidence, taking into account a range of variables that might influence how we understand the relationships among newspapers within clusters. These variables include textual overlap, time lag, and geographical distance, measured between each possible pair of reprints within each cluster. Considering these modifications in turn, we produce not one network model, but instead multiple models of nineteenth-century newspaper network relations that can be compared and contrasted. Each network variable used for our models incorporates specific domain knowledge about the operations of nineteenth century reprinting—in other words, insights drawn from literary-historical periodicals scholarship—toward a more subtle and specific SNA for understanding historical network operations. Of course, each of these models is a simplification of the phenomena it represents, honed on a particular facet of the newspaper exchanges at the expense of others. By considering the network through multiple facets we can gain a fuller—though never complete—picture of how nineteenth century newspaper reprinting evidences historical network relationships.

The experiments that follow are just that: experiments. These are not yet polished network analyses, but attempts to push against the limitations of our initial models in intellectually generative ways. I hope these experiments will be suggestive of ways the domain knowledge of literary historians might more forcefully shape network investigations. If I have made any grievous omissions or errors I will beg the indulgence of the blog medium and ask for suggestions toward improvement. In particular, I am sure there is literature in network science that treats these questions—though perhaps not in the historically inflected ways necessary here—and I would appreciate suggestions about where I should be looking for other models.

The Data

This investigation begins from the cluster data generated by the Viral Texts Project’s reprint-detection algorithm. In the early stages of the project, we used the Library of Congress’ Chronicling America newspaper archive, from which we discovered approximately 1.8 million clusters of reprinted texts, varying in size from clusters of 2 witnesses—which comprise a very long tail in the dataset—to those involving 100+ witnesses. These early results are presented in an alpha database and, perhaps more usefully for others interested in data mining or visualization, on the Viral Texts Github. More recently, however, as part of my work on an ACLS Digital Innovation Fellowship, we have expanded our corpora toward a wider and international investigation nineteenth-century reprinting. Our current source corpora include Cornell University’s and the University of Michigan’s Making of America magazine and journal archives; ProQuest’s American Periodicals Series Online; and Gale’s 19th Century U.S. Newspapers; the Australian National Library’s Trove Historical Newspapers; Gale’s 19th Century British Library Newspapers; and German-language newspapers from the State Libraries of Berlin, Munich, and Hamburg, as well as the Austrian National Library’s Austrian Newspapers Online.

This much expanded corpus results in much expanded datasets of reprint clusters: indeed, our current data on reprinting is orders of magnitude bigger than the clusters uncovered in Chronicling America alone. For these network experiments, however, I focus on a subset of that data related to a separate paper I’m writing, about the influence of the early Scientific American in nineteenth century periodical exchanges. From its founding in 1845 Scientific American was closely aligned with the newspapers, exchanging more frequently with the newspapers than most other magazines. Perhaps more importantly, however, Scientific American exchanged in both directions with newspapers: it was a both a frequent source of popularly-reprinted selections and it reproduced popular selections from other newspapers. Between its founding and the end of the nineteenth century, Scientific American reprinted at least 59,159 texts in common with its newspaper contemporaries. Interestingly, it was our early network models that led me to be interested in Scientific American in the first place. When we first began incorporating the magazines and journals from Making of America into our study, the majority of those publications clustered together quite closely in the resulting network graphs. The one exception was Scientific American, whose frequent and heavily-weighted edges with newspapers caused it to cluster with newspapers more strongly than with magazines and journals. Going into our computational analysis I was not thinking about Scientific American at all, but when I noticed its close affinity to our newspapers through reprinting I began looking at it more closely. I won’t write more about the literary-historical aspects of this investigation here, instead saving that for another article focused on both formal and topical affiliations between the early Scientific American and contemporary newspapers.

The experiments I describe below draw on the subset of 59,159 clusters that include Scientific American. Practically, this is a manageable, corpus-scale dataset that allows me to test ideas relatively quickly: refine, then iterate. Perhaps more importantly, however, this dataset creates ego networks, which is the term for networks focused on a particular node. With ego network data we can expect one constant across the graphs we generate: given the exclusive focus on clusters that include Scientific American, we can expect Scientific American to be the node with the highest degree and centrality in each network graph, however we modify our calculations of weight between nodes. Though Scientific American‘s precise measurements will change, as we adjust the edge weights between nodes to account for lag, textual similarity, or geographic distance, the network statistics derived for other nodes in the network will change more drastically, leading to three distinct graphs that can be usefully compared and contrasted.

From Clusters to Edges

To move from cluster data—essentially computationally-derived enumerative bibliographies, in which the details of each observed reprinted are listed on a separate line—to network edge data—in which each line lists a potential alignment of newspaper pairs, based on a shared text—requires some processing, which I typically do using the R programming language. The first few steps of this process are the same for two of the three investigations (and mostly the same, save one additional step described below, for the third):

library(dplyr)
#import a folder of CSV files into one dataframe
files <- dir("./")
SciAm <- do.call(rbind,lapply(files,read.csv))
#Ensure that R reads the date field as a date rather than a text string
SciAm$date <- as.Date(SciAm$date, "%Y-%m-%d")
#in the data Scientific American is sometimes represented with an extra space; this fixes it
SciAm$title[grepl("Scientific American", SciAm$title)] <- "Scientific American"

#extracting network and text data
SciAmSimple <- select(SciAm, cluster, date, title, series, text)
SciAmPairs <- full_join(SciAmSimple, SciAmSimple, by = "cluster")
#select only edges moving between an older and newer reprinting
SciAmDirected <- filter(SciAmPairs, date.x < date.y)

I am still transitioning to using R, and no doubt someone more experienced with the language could condense these operations to 1-2 lines. However, these modest steps accomplish quite a lot. First, our reprint-detection algorithm does not export one, single CSV (Comma Separated Value) file of cluster data, partly because it would be such a large file it would be difficult to work with. In this case, however, I have already filtered my data once, so that it only comprises clusters that include Scientific American. These clusters can be read into R as a single data frame. The first few lines of this code, then, cycle through the folder containing the cluster CSVs and read them into a single data frame. The next few lines clean up the data a bit, as you can read in the comments in the code block above.

The next three steps move us from essentially bibliographic data—that is, cluster data organized as separate lines for each observed reprint—toward network data—that is, data organized around relationships. To start this process, we simplify the larger data frame to just those columns necessary either establishing which reprints indicate potential network relations or which allow us to nuance those relationships. These are, briefly:

  • cluster: the ID number of each reprint cluster, assigned by the reprint detection algorithm. We will use this column to determine which reprints listed here should be considered edges, or potential links, between newspapers.
  • date: the date of each individual reprint identified. We will use this column to determine how much time passed between each pair of shared reprints in the data set.
  • title: the newspaper title that printed each identified reprint. These are human readable and will eventually label the nodes in our network data.
  • series: a unique identifier for each publication in our data. In most cases these are drawn from the metadata provided by the archive from which the publication was accessed, but in a few cases we had to create identifiers for publications in archives without clear ids in the metadata. We will use these ideas to join our reprint data with a gazetteer created by Viral Texts RA Thanasis Kinias, in order to determine the geographic location of each reprint.
  • text: the OCR text data of each identified reprint. We will use this to determine the edit distance between pairs of reprinted texts.
Image of pairwise data, which will become the source and targets for our network edge table. Each line represents one pair of shared texts between publications.
A small example of pairwise data generated by the code above. Each line represents one pair of shared texts between publications. With a few column name changes, such reprinting data becomes network data, with source and target nodes.

The full_join in our code joins our cluster data to itself using the cluster column, which means that it creates a data frame with one line for every potential pair of reprints within each cluster: in other words, we’re creating one line for each potential edge in our network. These lines will include all the columns in the data set for each of the two reprints it now represents; we have created a wide representation of our clusters then, though this process dramatically expands the data frame’s width and length. A cluster of only 5 reprints, for instance, can be paired in 20 unique ways (1-2, 1-3, 1-4, 1-5, 2-1, 2-3, 2-4, 2-5, 3-1, 3-2, 3-4, 3-5, 4-1, 4-2, 4-3, 4-5, 5-1, 5-2, 5-3, 5-4) and many of our clusters are significantly larger than 5. The final step in the code above, however, helps pare down the data set a bit and is rooted in the historical phenomenon we are modeling: exchanges among nineteenth century newspapers. If our model must assume that a shared text between two newspapers is a signal of potential influence, however small, we can also assume such influence only runs forward in time. That is, we likely should not assume that the Daily Dispatch printing a text before the Sunbury American indicates any influence, however small, from the American to the Dispatch, but we might assume the reverse, particularly if Dispatch —> American proves to be a trend, as we will investigate below. The final line of code above, however, filters the data to only those lines in which the date of the first reprint is less than—or, earlier in time than—the date of the second reprint.

From here, we could easily use R to tally raw weight for each edge.

SciAmEdges <- SciAmDirected %>% group_by(title.x,title.y) %>%
  summarize(weight = n())

In short, we would combine each line that lists a given combination of two titles into a single line, tallying a new weight column that increases by 1 for each observation of that combination. So, if in our data frame above there are 42 instances in which the Dispatch printed a text that the American later reprinted, this would result in a single line in our new data frame:

Source Target Weight
Daily Dispatch Sunbury American 42

These raw weights give us a baseline from which to test the different lenses I will describe below. One practical reason to strive for better optics, however, is that our data often results in the hairiest of “hairball” network graphs.

An unfiltered detail of the Scientific American ego network graph. The density of edges makes it difficult to read: this is, in network parlance, a “hairball.”

Because our data is drawn from found connections between papers, our graphs tend to be quite densely connected, even more than I expected going into the Viral Texts Project. That is, I expected more distinct communities that shared texts which did not circulate more widely, while in contrast we are finding that reprinted texts quite often diffused across the exchange system.

Indeed, though we can generate visual graphs in Gephi using our modified network outputs below, we will ultimately be most interested in the values for the weight as they are modified, and how those modified weights change the network statistics for our nodes. Which nodes appear to be most central (or have the highest degree, etc.) when we adjust our weights by time lag, and how do those compare with the nodes that appear to be most central when we adjust our weights by geographic distance, or the edit distance between the versions of the texts they published? Can we triangulate from these three models of our network to discern the links that truly seem indicative of historical connections rather than data artifacts?

Weighing by Publication Lag

"A Curious Calculation," as it appeared in the Sunbury American and Shamokin Journal (15 May 1847).
“A Curious Calculation,” as it appeared in the Sunbury American and Shamokin Journal (15 May 1847).

While texts circulated among newspapers over many years in the nineteenth-century—indeed, sometimes for decades—they were not typically reprinted steadily through the decades of their lives. Instead, we observe peaks of circulation, fallowness, and recirculation as texts moved in and out of the exchange network. Consider the following graph of cluster 5701252, a scientific-religious tidbit that tries to explain just how big a billion truly is: “For to count a billion, [Adam, having started when created] would require 9,512 years, 34 days, 5 hours, and 20 minutes.” This piece was first printed in Scientific American on April 10, 1847 and printed in at least 175 other publications around the world through at least December 22, 1899. However, if we visualize those reprintings over time through a histogram, we can see the cycles of its publication over time: the largest following its initial publication, another around 1855, for example, and another (following a several years gap) around 1874.

Reprints of Cluster 5701252, or "A Curious Calculation," plotted over time.
Reprints of Cluster 5701252, or “A Curious Calculation,” plotted over time.

This is a typical pattern for the reprints we have identified: cycles of attention and inattention as a given text moved through exchanges, was forgotten, and was then revived by an editor years later, to echo again through the exchanges, sometimes as if a new text altogether and sometimes with a memory of its earlier circulation. Ideally, we would want to account for these temporal clusterings in our network models, treating jointly reprinted texts nearby in time as stronger potential signals of connections among newspapers than texts reprinted at a long temporal distance. The latter might yet be signals of connection, but our understanding of nineteenth century exchange dynamics suggests it is less likely to be so than those shared texts nearer in time.

In other words, we want to model the network to account for those likelihoods without entirely discounting connections that span larger time gaps. We should not discard edges with a long time lag, but we should treat them as less important individually (though they might yet be important in aggregate). To do this, we can use the following:

SciAmDirected$lag <- SciAmDirected$date.y - SciAmDirected$date.x
SciAmDirected$lagWeight <- 1 / as.numeric(SciAmDirected$lag)
SciAmEdges <- SciAmDirected %>% group_by(title.x,title.y) %>%
  summarize(meanLag = mean(lag), Weight = sum(lagWeight), rawWeight = n())

Here, rather than each shared reprint between two publications increasing the weight of their shared edge by 1, it increases by 1 divided by the lag—measured in days—between each text’s publication in its source and target papers. In other words, the longer the lag in any given pair, the smaller the weight increase for the overall connection. If two publications share a great many texts, even despite a large lag in most instances, the weight of their shared edge will still increase, and the signal of a potential connection will remain. However, it will increase less than it would were the same publications frequently reprinting texts in common near the same time. When we sum up all the edges to create this network model, we retain both a raw weight—the total number of shared reprints for any two publications—as well as a weight adjusted for lag. Looking at the strongest twenty edges as sorted by raw and lag-adjusted weights, we can see how this shifts our view of the network.

As nodes in the network this process produces, these publications will also have stronger calculations for network measures such as degree and centrality. When we factor in publication lag, we might note that publications in New York and New England occupy a good many of the top connections with the New York-based Scientific American: though not all, as the tantalizing (for different reasons) examples of the Milwaukee Weekly Sentinel and Sydney Morning Herald (Australia) show. In the latter case, we can see a stark difference between the raw weight (3967 shared texts in the data set) and the lag-adjusted weight (71.2152435), thanks to a mean lag of 815.18 days between texts being published in Scientific American and appearing in the Sydney Morning Herald. Nonetheless, these publications share so many texts that their connections bears out even when we adjust for lag, which to my mind is a strong indicator of a more than casual link between these two publications that demands further study.

When weighing by lag, the following newspapers have the highest degree, which measures all incoming and outgoing edges for each node in the network:

Publication Degree
Scientific American 1034
Milwaukee Weekly Sentinel 702
The Cleveland Morning Herald 565
The Universal Gazette 561
Vermont Chronicle 552
Bangor Weekly Courier 495
The Vermont Freeman 489
Evening Bulletin 488
New York Evangelist 484
Maine Farmer 477

When sorted by out degree, the picture changes slightly. Out degree measures only outgoing links, which in this context might signal papers that were frequent sources, rather than receivers, of reprinted texts:

Publication Out Degree
Scientific American 542
Milwaukee Weekly Sentinel 395
The Universal Gazette 338
The Cleveland Morning Herald 334
North American and United States Gazette 300
Vermont Chronicle 295
Bangor Weekly Courier 291
New York Evangelist 270
The Weekly Star 270
The Vermont Freeman 261

To cite one more metric, we might look at betweenness centrality, which measures how frequently a node appears on the shortest path between the other nodes in the network. The top nodes based on this measure are:

Publication Betweenness Centrality
Scientific American 226084.5378
Milwaukee Weekly Sentinel 47904.08621
Evening Bulletin 39383.88192
The Universal Gazette 33650.84385
The Sydney Morning Herald 28940.72233
The Weekly Star 27086.2622
Vermont Chronicle 25770.31755
Boston Investigator 19331.98607
The Wheeling Daily Intelligencer 19040.22511
The Cleveland Morning Herald 17782.64068
Yorkville Enquirer 17640.67485

Were this a full network analysis, we might dig further into these statistics to ascertain why these publications are so measured. We might investigate, for example, whether papers such as the Sydney Morning Herald or Yorkville Enquirer are hubs connecting Australian, US, and/or UK newspapers in our corpora, which might account for their high betweenness scores. For these preliminary experiments, though, we might instead ask how these measurements for a network privileging temporal lag compare with the same measurements for a network that privileges another factor, such as geographic distance.

The Scientific American ego network, adjusted for lag, results in more separation and an easier-to-read visualization.
The Scientific American ego network, adjusted for lag, results in more separation and an easier-to-read visualization.

Weighing by Geographical Distance

In some ways, geographic distance has not been as limiting a factor as I expected when we began work on the Viral Texts Project. Newspaper texts circulated around the globe, and far more quickly than I would have anticipated. The “Curious Calculation” article described above was first printed in Scientific American on April 10, 1847, and appeared in a number of UK papers as early as July 3 and Bell’s Life in Sydney and Sporting Reviewer—the earliest Australian reprinting we have identified—on December 4. These are certainly longer spans than we are accustomed to in the internet age, but given the miles and leagues these texts had to travel in the mid-nineteenth century these speeds are impressive.

Nevertheless we do frequently see in our network graphs, as we would expect, more frequent and stronger connections between newspapers nearby geographically than those farther dispersed. In this experiment I privilege and even accentuate those effects by weighing the graph’s edges by physical distance. Here is the pertinent (and no doubt painfully messy) code:

library(dplyr)
library(geosphere)

SciAmSimpleGeo <- select(SciAm, cluster, date, series, title)
#import gazetteer file
gazetteer <- select(read.csv(file = "./geoData/vgaz_out_sorted.csv", header = TRUE), series, latitude, longitude)
#join gazetteer with clusters
SciAmGeo <- left_join(SciAmSimpleGeo, gazetteer, by = "series", match = "all")
#Temporary: omit lines with missing lat/long data
SciAmGeo_complete <- na.omit(SciAmGeo)

#create pairwise data with lat/longs
SciAmGeoPairs <- full_join(SciAmGeo_complete, SciAmGeo_complete, by = "cluster")
#select only edges moving between an older and newer reprinting
SciAmGeoDirected <- filter(SciAmGeoPairs, date.x < date.y)

#calculate geographical distance between papers in each edge
SciAmGeoDirected$edgeDist <- distHaversine(matrix(c(SciAmGeoDirected$longitude.x, SciAmGeoDirected$latitude.x), ncol = 2), matrix(c(SciAmGeoDirected$longitude.y, SciAmGeoDirected$latitude.y), ncol = 2))
#An edgeDist value of 0 (as when 2 papers are printed in the same city) confuses the calculation, so we need to replace those 0 values with 1 so their weights will be unaffected by distance in the adjusted calculation
SciAmGeoDirected$edgeDist[SciAmGeoDirected$edgeDist==0]<-1

#adjust weight for distance
SciAmGeoDirected$distWeight <- 1 / as.numeric(SciAmGeoDirected$edgeDist)
SciAmGeoEdges <- SciAmGeoDirected %>% group_by(title.x,title.y) %>%
  summarize(meanDist = mean(edgeDist), distWeight = sum(distWeight), rawWeight = n())
SciAmGeoEdges$distEffect <- SciAmGeoEdges$rawWeight / SciAmGeoEdges$distWeight

I won’t belabor my explanation this time, as most of these steps echo those above. There are important differences worth explaining, however. First, this code makes use of a gazetteer prepared by Viral Texts RA Thanasis Kinias, which includes the latitude and longitude of most publications in our study (we are currently identifying and adding the few that are missing). I merge that gazetteer with the cluster data by publication IDs, so that each line in the data frame includes the geographic location of the reprinting it describes. Next, this code makes use of the geosphere R library to calculate the physical distance between each pair of reprintings. This distance is calculated as the crow flies, and so is a rather blunt calculation. A more sophisticated version of this experiment might attempt to incorporate what we know about postal roads, railroads, or other communications technologies, but for now we will suffice with a raw measure of distance. As we did with lag in the first experiment, we will modify the weights of each edge, dividing the raw weight of 1 for each instance of a given pairing by the geographic distance between the edge’s two publications.

The resulting network weighed by distance looks quite different from that produced by privileging lag, and its statistics are likewise distinct. In quick succession, here are the top nodes by degree, out degree, and betweenness centrality:

Publication Degree
Scientific American 1673
Boston Investigator 1516
Manchester Times 1400
New York Evangelist 1380
The Universal Gazette 1373
Vermont Chronicle 1348
Preston Chronicle 1342
Trewman’s Exeter Flying Post 1335
Hampshire/Portsmouth Telegraph 1328
Bristol Mercury 1327
Publication Out Degree
The Universal Gazette 806
Scientific American 748
Vermont Chronicle 741
Bristol Mercury 721
Liverpool Mercury 716
Raleigh Register, and North-Carolina Weekly Advertiser 707
Caledonian Mercury 705
Boston Investigator 701
Manchester Times 698
Preston Chronicle 696
Publication Betweenness Centrality
Scientific American 79946.95562
The Universal Gazette 48036.30927
Hampshire/Portsmouth Telegraph 34428.81814
Aberdeen Journal 34312.00833
Boston Investigator 26266.5412
Liverpool Mercury 22961.1539
Caledonian Mercury 19717.94059
Vermont Chronicle 19436.25648
Bristol Mercury 18786.61772
Trewman’s Exeter Flying Post 18635.27305

Immediately apparent is the spread of this graph: the network is less tightly clustered than previous iterations, separating (as we would expect) into more distinct geographic groups that mostly align with the different, largely nationally organized, corpora from which we are drawing in the Viral Texts Project. What’s perhaps becomes most interesting, then, are those nodes that sit between those geographic communities, which we might hypothesize served as hubs for the international exchange of information. The Aberdeen Journal, for instance, is grouped into a community with other UK newspapers using Gephi’s modularity algorithm, but it also has quite strong ties (despite the long distances) with Australian papers, and clusters near them in the graph. Perhaps unsurprisingly, then, the Aberdeen Journal has one of the highest betweenness centrality measures in the graph, as it may have served as a hub for the movement of texts between the UK and Australian newspaper communities. We would need to do more work with the paper itself to substantiate this hypotheses (or disprove it), but I offer this as one example of how network models point toward new literary-historical research questions.

If our goal is to triangulate among the different network models in our experiments here, we might focus our attention on those publications that appear on both lists: e.g. the Boston Investigator, the New York Evangelist, or the Vermont Chronicle. Certainly publications important in both graphs have a higher likelihood of being important to our overall understanding of the relationships they represent. However, we might just as easily ask about the differences. What does it mean when a publication appears as important in a graph weighted by time and less important in a graph weighted by geographic space? These kinds of questions should guide our refinements to these methods in the coming months.

Weighing by Approximate String Matching

This final investigation is still in progress. I will write a followup post when I have something more concrete to say about how string matching—a kind of computational bibliographic exercise, of collation—might nuance our calculations of weight as lag and distance do above. As a preview, in this experiment I will be estimating the distance between our observed reprinted texts using Levenshtein distance as implemented by the R package stringdist.

Concluding Questions

In closing I offer no conclusions, only questions and provocations. One of our greatest challenges in modeling the networks uncovered through reprinting in Viral Texts has been that our population is so sparse: far more of the network remains invisible than visible, even when drawing from many thousands of digitized newspapers. We must find ways to reflect those gaps in our knowledge in our models of the system, and to supplement our computational work with the insights of literary historical scholarship that can contextualize—and even inform, as I’ve tried to show here—the graphs and visualizations produced by what computation can reveal. Much, much more remains to be done to refine these models and develop effective methods for bringing them into generative conversation. For one, I hope to develop a clear method and rationale for calculating the effects of these modifications: what are the nodes most dramatically affected by adjusting for lag? For distance? For textual difference? From there, we need a clear way to articulate how such faceted graphs intersect, where they diverge, and what those points of intersection and divergence mean for our understanding of the historical relationships they evidence.

Notes


  1. For more on our reprint detection methods, see our articles in American Literary History (August 2015), Reprinting, Circulation, and the Network Author in Antebellum Newspapers and Computational Methods for Uncovering Reprinted Texts in Antebellum Newspapers. Browse the cluster data I used for these experiments. These are themselves experimental clusters using a new version of the reprint-detection algorithm, and are not yet suitable for formal publication.

Network Analysis Workshop

I regularly run workshops on humanities network analysis. For participants, I’ve compiled some starting instructions, sample data files, and suggested reading below.

Recommended Reading

Tools for Network Analysis

There are many options at various skill levels for humanists interested in network analysis. Here are just a few:

  • If you’re looking for an especially straightforward platform for basic network analyses, you might check out Palladio which adapts the platform designed for Stanford’s Mapping the Republic of Letters project for other scholars’ use. Martin Düring’s tutorial at the Programming Historian focuses on extracting network data from unstructured text and visualizing it in Palladio, and Miriam Posner’s “Getting Started with Palladio” introduces the tool’s network functionalities (along with much else).
  • You can also create basic network graphs using Fusion Tables.
  • If you are running Windows with Microsoft Excel installed, Node XL aims to make generating network graphs from an Excel spreadsheet as easy as creating a pie chart. Unfortunately Node XL is incompatible with Mac versions of Excel.
  • And of course, if you’re comfortable with programming languages there are plenty of methods for generating network graphs by hand. Taylor Arnold and Lauren Tilton write about using R for network analysis in Humanities Data in R and Lincoln Mullen has a growing resource in Digital History Methods in R, including an in-progress chapter on networks.

This Workshop: Gephi

For this workshop, we will be using Gephi, one of the most widely-used tools for network analysis and visualization. You will need to download and install the application before we can get started. If you find it runs slowly (or not at all) you might need to update Java on your system.

Workshop Data

Sample data can be found in this folder. You can download them all as a zip file or download files separately as we need them.

‘Q i-jtb the Raven’: Taking Dirty OCR Seriously

The following is a talk I will deliver on January 9, 2016 for the Bibliography and Scholarly Editing Forum’s panel at MLA 2016. It is part of a longer article in progress.

On November 28, 1849 the Lewisburg Chronicle, and the West Branch Farmer published one of the most popular poems of the nineteenth century, Edgar Allan Poe’s “The Raven.”

The November 28, 1849 Lewisburg Chronicle, and the West Branch Farmer
The November 28, 1849 Lewisburg Chronicle, and the West Branch Farmer

The Lewisburg Chronicle’s “Raven” is one version among many printed after Poe’s death in 1849—“By Edgar A. Poe, dec’d”—interesting as a small signal of the poem’s circulation and reception. It is just such reprinting that we are tracing in the Viral Texts project, in which we use computational methods to automatically surface patterns of reprinting across nineteenth-century newspaper archives.

And so this version of the poem also becomes interesting as a digitized object in the twenty-first century, in which at least one iteration of the poem’s famous refrain is rendered by optical character recognition as, “Q i-jtb the Raven, ‘Nevermore’” (OCR is a term for computer programs that identify machine-readable words from a scanned page image, and is the source for most of the searchable data in large-scale digital archives). What is this text—this digital artifact I access in 2016? Where did it come from, and how did it come to be?
Continue reading

Clouds Like the Locusts

Screen Shot 2015-11-25 at 10.17.48 PM He would never suggest the immigrants should be prevented from coming to America, would the famous preacher. To say that, of course, would be un-American. This is a nation of immigrants, after all: a free market of ideas political and religious. Though the famous preacher must bravely say what, after all, must be said: these immigrants are different. Their minds are shackled to institutions too unlike our own. They are “un-accustomed to self-government” and would only be pawns for those seeking to undermine our democracy. Indeed, the very tenants of these immigrants’ faith virtually forces them to do their clerics’ bidding and be “easily embodied and wielded by sinister design.” Speaking bluntly (though of course objectively, and resignedly), the famous preacher notes their religion is fundamentally “adverse to liberty.” These immigrants could simply never assimilate to American culture. It’s almost unfair of us to let them try, isn’t it? And while he would never write anything remotely prejudiced, would the famous preacher, isn’t it concerning how the laws of a foreign religion seem to be taking over America? It happened in Boston, he heard. And to be historical for a moment, the famous preacher muses, “the world has never witnessed such a rush of dark minded population from one country to another.” The famous preacher means “dark minded” as “ignorant” or “malicious,” of course: which are just facts, not bigotry. But really aren’t these immigrants “Clouds like the locusts of Egypt…rising from the hills and plains” of foreign lands “to settle down upon our fair fields?” I’m just saying, the famous preacher insists, I’m just saying.

These Catholics have got to be stopped.

Scale as Deformance

When I was ten years old my parents bought me a microscope set for Christmas. I spent the next weeks eagerly testing everything I could under its lens, beginning with the many samples provided in the box. I could not bring myself to apply the kit’s scalpel to the fully-preserved butterfly—which is intact still in the microscope box in my parents’ attic—but soon I had exhausted all of the pre-made slides: sections of leaves, insect wings, crystalline minerals, scales from fish or lizard skin. The kit also included the supplies to create new slides. I wanted to see blood—my blood. And so with my mom’s help I pricked the tip of my finger with a very thin needle, so I could squeeze a single drop of blood onto the thin glass slide. I remember how it smeared as I applied the plastic coverslip to the top of the slide, and I remember the sense of wonder as I first saw my own blood through the microscope’s lens. Gone was the uniform red liquid, replaced by a bustling ecosystem of red and white cells, walls and enormous spaces where none had been when I was looking with my unaided eye.

Looking at my blood through a microscope, I learned something new and true about it, but that micro view was not more true than familiar macro images. My blood is red and white cells jostling in clear plasma; my blood is also a red liquid that will run in bright-red rivulets from a pin-prick, or clot in dun-red patches over a wound. At micro-scales beyond the power of my children’s microscope, we could focus on the proteins that comprise the membrane of a red blood cell; at even more macro-scales we might consider a blood bank, organizing bags of blood by type for use in emergency rooms.

Continue reading

Going Viral in Nineteenth-Century Newspapers

The following is an excerpt from my article “Viral Textuality in Nineteenth-Century US Newspaper Exchanges,” which is forthcoming in Vernoica Alfano and Andrew Stauffer (eds.), Virtual Victorians: Networks, Connections, Technologies, May 2015, Palgrave MacMillan. Reproduced with the permission of Palgrave MacMillan. The article draws on the findings of the Viral Texts Project at Northeastern University.

[…]

In the Rossetti Archive, Jerome McGann seeks to represent the “social text” by including all editions of a given work in an online archive, rather than simply the “Reading Text” and “Variorum Text” of the standard critical edition. However, even the social text model remains focused on discrete works—books, most often, though also stories or poems—that can be collated and compared as distinct entities. Virality is messier, aligning fragmentary texts and textual echoes not only through books but also through ephemeral and hybrid media; the latter of these is exemplified by the nineteenth-century newspaper. The “viral text” of a particular poem would include official and unofficial reprintings, but also parodies, quotations, reviews, paraphrases, allusions, and more—what Julia Flanders has named “reception items.” A theory of viral textuality must wrestle with unusually capacious ideas of “the text,” including in its purview the continually shifting penumbrae of readers’ responses that testify to that text’s life within culture(s).

For this reason, virality proves especially useful for thinking about how texts circulated in the increasingly complex mass media ecology of the United States during the nineteenth century. During this time, newspapers and magazines proliferated, and this rapid expansion of the print sphere was accelerated by a system of content sharing among publications. The periodical press in the United States depended on “exchanges,” through which editors subscribed to each other’s publications (paying little to no postage for the privilege), and borrowed content promiscuously from each other’s subscriptions. Texts of all kinds were reprinted—typically without authors’ or publishers’ permission—across books, newspapers, and magazines. Content shared through the exchange system was not protected under intellectual property law. Instead, periodical texts were considered common property for reprinting, with or without modification—much as articles, music videos, and other content are shared online today among blogs and social media sites. And as is the case today, antebellum content creators reacted in disparate ways to these sharing practices. Some writers and editors compared reprinting to theft, decrying a system that popularized writers’ work without supporting them financially. Others exploited the reprinting system in order to build a reputation that could be leveraged toward paid literary employment.

The spread of “viral” content in nineteenth-century newspapers depended on a range of factors, from the choices of editors to the preferences of readers to the material requirements of composing a given day’s issue. The frequently reprinted listicle “Editing a Paper,” for instance, lays out the dilemma that faced nineteenth-century editors considering whether and how much to reprint:

If we publish telegraph reports, people will say they are nothing but lies.
If we omit them, they will say we have no enterprise, or suppress them for political effect . . .
If we publish original matter, they find fault with us for not giving selections.
If we publish selections, folks say we are lazy for not writing more and giving them what they have not read before in some other paper.
(11 July 1863)

The first reprinting of “Editing a Paper” identified by the Viral Texts project appears in the Big Blue Union of Marysville, Kansas, but even here an editorial preface claims that the list has been “going the rounds of the papers. If we knew in what paper it first appeared,” the editor continues, “it would afford us pleasure to give the writer due credit.” This piece and its preface illustrate much about editors’ and, presumably, readers’ attitudes toward reprinting, and how those attitudes might line up with modern ideas of viral media.

Considering nineteenth-century newspaper snippets as “viral media” allows us to frame their spread in terms of “rhetorical velocity,” a term first developed by Jim Ridolfo and Dànielle Nicole DeVoss to describe online composition practices in which writers take reuse and remixing as a given and compose with an eye toward facilitating such reinterpretive acts. Such writers take as their primary assumption that a piece will be recomposed by others—reprinted or otherwise remediated. Ridolfo and DeVoss propose that “when academics uphold distinctions between author and producer, we are left in an uncomplicated, often acontextual space that does not provide the tools we need to best negotiate the ways in which production and authorship become more slippery in digital spaces and within remix projects.” They argue, “The term rhetorical velocity means a conscious rhetorical concern for distance, travel, speed, and time, pertaining specifically to theorizing instances of strategic appropriation by a third party.” In other words, “rhetorical velocity” posits “the text” through multiple dimensions, charting its uses and movements—both social and geographic—alongside its evolving content. What’s more, a piece need not be consciously crafted for a wide audience to have rhetorical velocity; if it is compelling, concise, and easily modified, then it can go viral with or without its creator’s knowledge.

While Ridolfo and DeVoss refer specifically to composing practices online, the frame of rhetorical velocity offers insight into widely reprinted newspaper content during the nineteenth century. Nineteenth-century editors relied on the exchange system to provide engaging content, and they in turn composed (or solicited) original pieces with an eye toward their readers and those of the papers with which they exchanged. In the first post–Civil War issue of the Pulaski Citizen, for instance, editor Luther W. McCord apologizes for the sorry state of “The News” in the paper because “we have no exchanges yet, from which to make up our news items. Our readers can readily appreciate,” the squib continues, “the impossibility of making an interesting paper without something to make it of.” McCord then assures readers that they “hope to have a full list of exchanges by next week and, per consequence, a more readable number of the Citizen” (January 5, 1866). This apology echoes a common notion among editors in the period: newspapers that aggregated content from exchanges were of higher and more consistent quality than newspapers written entirely by locals. In other words, McCord assumes that his primary job will be selecting and propagating writing from elsewhere—contributing to the rhetorical velocity of content written for a distributed network, not for individual newspapers.

We must therefore assume that newspaper editors and writers were concerned with the rhetorical velocity of what they published; a newspaper whose content was regularly reprinted in other newspapers would soon be added to more exchanges, as editors further down the line sought the source of the pieces they encountered in intermediary papers. This would, in turn, increase the popular newspaper’s circulation and subscription fees. Indeed, when considering nineteenth-century newspaper snippets, we might speak of “composing for recomposition” in a more technical way, using “composition” not only in its modern sense, as a near synonym for “writing,” but also as a printers’ term of art. As scholars such as Ellen Gruber Garvey have shown, texts were reprinted in newspapers to help editors compose entire daily or weekly newspapers with small staffs. “By yoking together scattered producers who shared labor and resources by sending their products to one another for free use,” the network of newspapers sustained the proliferation of its medium. In other words, reprinting existed in large part to meet the material needs of publication. Many of the changes introduced into texts as they circulated through the newspaper network—a line removed here, two lines added there—were motivated by these practical considerations, as a given newspaper’s compositors shaped exchange content to fill empty spaces on a nearly composed page. It seems reasonable to presume that as a newspaper’s compositors prepared their pages each day or week, they expected—perhaps even hoped—that other compositors in their exchange networks would later recompose their texts, extending the texts’ rhetorical velocity to reach distant audiences.

[…]

To read the rest (along with fantastic work by other C19 scholars), preorder Virtual Victorians.

How Not to Teach Digital Humanities

The following is a talk I’ve revised over the past few years. It began with a post on “curricular incursion”, the ideas of which developed through a talk at DH2013 and two invited talks, one at the University of Michigan’s Institute for the Humanities in March 2014 and another at the Freedman Center for Digital Scholarship’s “Pedagogy and Practices” Colloquium at Case Western Reserve University in November 2014. I’ve embedded a video from the latter presentation at the bottom of the article. I am hoping to revise just a bit more for the forthcoming Debates in Digital Humanities Series, and so welcome your comments and suggestions on this draft.

“À l’ École,” Villemard (1910)

In late summer of 2010, I arrived on the campus of St. Norbert College in De Pere, Wisconsin. I was a newly-minted assistant professor, brimming with optimism, and the field with which I increasingly identified my work—this “digital humanities”—had just been declared “the first ‘next big thing’ in a long time” by William Pannapacker in his Chronicle of Higher Education column. “We are now realizing,” Pannapacker had written of the professors gathered at the Modern Language Association’s annual convention, “that resistance is futile.” So of course I immediately proposed a new “Introduction to Digital Humanities” course for upper-level undergraduates at St. Norbert. My syllabus was, perhaps, hastily constructed—patched together from “Intro to DH” syllabi in a Zotero group—but surely it would pass muster. They had hired me, after all; surely they were keen to see digital humanities in the curriculum. In any case, how could the curricular committee reject “the next big thing?” particularly when resistance was futile?

But reject it they did. They wrote back with concerns about the “student constituency” for the course, its overall theme, my expected learning outcomes, the projected enrollment, the course texts, and the balance between theoretical and practical instruction in the day-to-day operations of the class.

  1. What would be the student constituency for this course? It looks like it will be somewhat specialized and the several topics seems to suggest graduate student level work. Perhaps you could spell out the learning objectives and say more about the targeted students. There is a concern about the course having sufficient enrollment.
  2. The course itself could be fleshed out more. Is there an implied overall theme relating to digital technology other than “the impact of technology on humanities research and pedagogy”? Are there other texts and readings other than “A Companion to Digital Studies”? How much of the course will be “learning about” as distinct from “learning how to”?

My initial reaction was umbrage; I was certain my colleagues’ technological reticence was clouding their judgement. But upon further reflection—which came through developing, revising, and re-revising this course from their feedback, and learning from students who have taken each version of the course—I believe they were almost entirely right to reject that first proposal.

As a result of these experiences, I’ve been thinking more and more about the problem of “digital humanities qua digital humanities,” particularly amidst the accelerated growth of undergraduate classes that explicitly engage with digital humanities methods. In the first part of this talk, I want to outline three challenges I see hampering truly innovative digital pedagogy in humanities classrooms. To do so, I will draw on my experiences at two very different campuses—the first a small, relatively isolated liberal arts college and the second a medium-sized research university—as well as those of colleagues in a variety of institutions around the country.

As an opening gambit, I want to suggest that undergraduate students do not care about digital humanities. I want to suggest further that their disinterest is right and even salutary, because what I really mean is that undergrads do not care about DH qua DH. In addition, I don’t think most graduate students in literature, history, or other humanities fields come to graduate school primarily invested in becoming “digital humanists,” though there are of course exceptions. Continue reading

“Many Facts in Small Compass”: Information Literature in C19 Newspapers (MLA15 Talk)

slide 1

Ryan Cordell, Northeastern University

MLA 2015 | Vancouver, BC

Download talk slides.

slide 2

My remarks today will be drawn from my work on the Viral Texts project at Northeastern University. In brief, I’m working with a colleague in computer science to automatically identify the most frequently-reprinted texts in digitized archives of nineteenth-century newspapers. We have thus far drawn from the Library of Congress’ Chronicling America collection, but are currently expanding the corpora from which we are drawing to include magazines, as well as a broader selection of American and transatlantic newspapers. We have identified nearly half a million reprinted texts from the LoC’s nineteenth-century holdings. The majority of these were reprinted only a few times, but a significant minority were reprinted in 50, 100, or even 200 newspapers from this one archive.

We went into this project in search of the literature, such as newspaper poetry, that flourished in a print culture founded on textual sharing and through a deeply hybrid and intertextual medium. In the broadest sense, I hoped to expand our ideas of which writers resonated with nineteenth-century readers and create new bibliographies of popular but critically-overlooked literature.

slide 3

On this front the project has been promising. For every reprinted Longfellow poem we find many more by authors such as Elizabeth Akers Allen, Isabella Banks, Charles Monroe Dickinson, Colonel Theodore O’Hara, Emily Rebecca Page, Nancy Priest Wakefield, or John Whitaker Watson—or, perhaps even more likely, by an anonymous author. Such poems circulated within a system of exchanges and selection—newspaper editors cut, pasted, and recomposed content from their exchange partners and sent their papers to be similarly aggregated elsewhere.

But recognizably literary genres have been only a small part of the project. One of the most dramatic outcomes of this work thus far has been to highlight the importance of understudied genres of everyday reading and writing within the ecology of nineteenth-century print culture. These species of writing include political news, travel accounts, squibs, scientific reports, inspirational or religious exhortations, temperance narratives, vignettes, self-help guides, trivia, recipes, and even, to borrow a modern Internet term, listicles, all of which juxtaposed with poems, stories, and news on the page of the nineteenth-century paper. As a general (and perhaps unsurprising rule), the most frequently-reprinted pieces are concise, quotable, and widely relatable texts that would have been easy to recontextualize for different newspapers and new audiences—and that could easily fit gaps in the physical newspaper pages, as editors and compositors needed.

slide 4

My remarks today focus on those genres we might categorize as “information literature”: lists, tables, recipes, scientific reports, trivia columns, and so forth. I want to separate these from news itself, which is certainly a kind of information genre, but which I would mark as stylistically and operationally distinct from the other genres I’ve listed. Here’s one example of information literature, a list of supposed “facts,” primarily about human lives and demographics, which was published under many names in at least 120 different newspapers between 1853 and 1899 (which is approximately one quarter of the nineteenth-century newspapers in Chronicling America). Continue reading