Many of my classes, particularly my graduate Humanities Data Analysis course, introduce students to humanistic applications of text and data analysis. One of the greatest challenges for preparing these classes has been finding or creating useful humanities data sets to analyze: small and manageable enough to work with in the space of a class session, but complex enough to yield results that will help students see the potential for data analyses in relation to research questions they want to answer. Learning exploratory data analysis, for instance, requires tabular data that can be queried and transformed. In the past I’ve followed the lead of Lincoln Mullen and Ben Schmidt and used census data for these particular lessons: it’s easy to see how investigating and visualizing these historical records might yield insights pertinent to humanities work, and there are enough categories to give us room to explore.

In looking for a second option for those classes, however, I found myself experimenting with the Library of Congress’ U.S. Newspaper Directory, which lists all the papers the LoC knows of from 1690 to the present, including basic metadata for each, such as when a given paper was founded and when it ended. In some ways this metadata can be misleading, or at least can highlight the different priorities of library catalogers and humanities researchers. For instance, the LoC creates a new catalog entry each time a newspaper changes its name, even slightly—and newspapers in the nineteenth century changed their names all the time. These name changes can signal a new editor, a merger of two papers (or a schism of one), a switch in political affiliation—or they can signal simply that the name was changed to something more current or locally salient. These shifts can be meaningful, but often are not essentially so for a researcher interested in the dynamics of a particular paper over time. In our attempts to map networks of information exchange during the period, for instance, we have found the division of papers into so many distinct units unhelpful, as the network dynamics of a particlar paper often persist through small name changes. Thus we have experimented with various ways of inferring newspaper “families” under which we can group related papers for the purposes of network analysis and related methods.

As I experimented with the U.S. Newspaper Diretory data, however, I began to wonder whether all of those shifting newspaper titles might be collectively meaningful, pointing to larger trends in newspaper naming that would reflect shifts in the medium, or in the medium’s imagination of itself, over time. My experiments here are computationally quite simple, focused primarily on word frequencies in newspaper titles over time. I’m not entirely certain there’s anything new in the analyses below, either, though I’m aware it can be hard to separate what makes sense from what we already knew. Ultimatlely, I mostly still think this will be an excellent dataset for the classroom, and some of the analyses below good first steps in teaching students to explore suc ha a dataset.But there are some suggestive trends that, if nothing else, evidence at scale large shifts in how the newspaper thought of itself through the nineteenth-century.

We first import a few necessary R packages and a CSV of historical newspaper metadata scraped—to be honest, I don’t remember when or by whom, as it’s been several years—from the LoC’s newspaper directory. If we look at a list of the ten most-used words in new newspaper titles by decade, we can already begin to see some interesting patterns: the importance of German words in new paper titles founded between the 1730s and 1750s, for instance, or the explicity identification with political parties in newspaper titles starting in the first decades of the 19th century.

library(tidyverse)
library(tidytext)
library(plotly)
# load newspapers table and create fields for the starting and ending decade of each (for grouping by decade later on)
papers <- read_csv("./data/US-Newspapers.csv") %>%
  select(title, state, city, start, end, frequency, language) %>%
  filter(start != 9999) %>%
  mutate(end = replace(end, end == 9999, 2014)) %>%
  mutate(startDecade = paste(substring(start, 1,3))) %>%
  mutate(startDecade = as.numeric(paste(startDecade, 0, sep=""))) %>%
  mutate(endDecade = paste(substring(end, 1,3))) %>%
  mutate(endDecade = as.numeric(paste(endDecade, 0, sep=""))) %>%
  unique()
## Parsed with column specification:
## cols(
##   title = col_character(),
##   X2 = col_character(),
##   state = col_character(),
##   city = col_character(),
##   start = col_integer(),
##   end = col_integer(),
##   frequency = col_character(),
##   language = col_character()
## )
# create a small manual list of stopwords

stopWords <- as_data_frame(c("the", "an", "and", "der",
                             "die", "das", "und", "of",
                             "in","aus","dem","or")) %>%
  rename(word = value)


titleWords <- papers %>%
  unnest_tokens(word, title) %>%
  anti_join(stopWords) %>%
  group_by(startDecade, word) %>%
  summarize(count = n()) %>%
  arrange(startDecade,desc(count)) %>%
  top_n(10)
## Joining, by = "word"
## Selecting by count
titleWords
## # A tibble: 328 x 3
## # Groups:   startDecade [33]
##    startDecade word        count
##          <dbl> <chr>       <int>
##  1       1680. affairs         1
##  2       1680. english         1
##  3       1680. new             1
##  4       1680. present         1
##  5       1680. state           1
##  6       1690. occurrences     1
##  7       1690. publick         1
##  8       1700. boston          2
##  9       1700. letter          2
## 10       1700. news            2
## # ... with 318 more rows

It’s a bit easier to suss out interesting trends if we plot this data. Here I plot the top five words used in newspaper titles each decade by percentage (i.e. how often was a word used in proportion to the number of new papers founded in a given decade) rather than raw count. This method paints with a bit of a broad stroke and smooths out potentially meaningful differences in how many newspapers were founded in given decades, but it’s useful for making initial comparisons between title word use in different decades.

titleWords <- papers %>%
  unnest_tokens(word, title) %>%
  anti_join(stopWords) %>%
  group_by(startDecade, word) %>%
  summarize(count = n()) %>%
  mutate(percentage = count / sum(count)) %>%
  mutate(decadeCount = length(startDecade)) %>%
  arrange(startDecade,desc(count))
## Joining, by = "word"
# plot title words by percentage
newPapers <- papers %>%
  group_by(startDecade) %>%
  summarise(newPapers = n())

titleWords <- titleWords %>%
  left_join(newPapers, by = "startDecade") %>%
  mutate(percentage = count/newPapers) %>%
  arrange(startDecade, desc(percentage))

plot <- ggplot(titleWords %>%
                 filter(startDecade >= 1800 & startDecade <= 1950) %>%
                 top_n(5) %>%
                 filter(percentage >= .02)) +
  aes(x=startDecade, y=percentage, color = word) +
  geom_line() +
  geom_point(size = .3) +
  ggtitle("Most Used Words in New Newspaper Titles by Decade, 1800-1950") +
  labs(x="Decades",y="Percentage of Titles",fill="Word",caption="The top words used in the titles of new newspapers during the nineteenth century by decade") +
  theme(plot.title = element_text(family = "Trebuchet MS", color="#666666", face="bold", size=18, hjust=0.5)) +
  theme(axis.title = element_text(family = "Trebuchet MS", color="#666666", face="bold", size=14)) +
  theme(legend.title = element_text(family = "Trebuchet MS", color="#666666", face="bold", size=14)) +
  theme(legend.background = element_rect(color = "#efefef")) +
  theme(plot.caption = element_text(family = "Trebuchet MS", color="#666666", size=10, hjust = 0.5, margin = margin(15, 0, 15, 0))) +
  theme(axis.text = element_text(family = "Trebuchet MS", color="#aaaaaa", face="bold", size=10)) +
  theme(panel.background = element_rect(fill = "white")) +
  theme(panel.grid.major = element_line(color = "#efefef")) +
  theme(axis.ticks = element_line(color = "#efefef"))
## Selecting by newPapers
ggplotly(plot)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

Several trends stand out in this graph. First, we can see the precipitous drop in the words “gazette” and “advertiser” in new newspaper titles over the first half of the nineteenth-century. In the eighteenth century, the word “gazette” in a newspaper title often signaled their status as government-authorized publications—though the word didn’t necessarily mean this—which is precisely not the way American newspapers evolved during the nineteenth century. Increasingly papers were partisan organs, which we can see in the graph in words such as “republican” and “democrat,” but they were largely not government-sponsored publications. And while newspapers in the nineteenth-century continued to carry advertisements, and even in some cases more advertisements than earlier counterparts explicitly named Advertiser, that function was less and less their rhetorical purpose as the nineteenth century progressed.

Around 1850 we see the simultaneous rise of “weekly” and “daily” in the title words of new papers. This is not because papers were weekly and became daily—that progressivist narrative would make sense on its face but does not describe the messier reality—but because the range of newspaper formats and frequencies exploded during the period, and particularly after the introduction of wood pulp paper and the steam press, largely in the 1840s. Many urban papers started printing daily and weekly editions (which are listed in the LoC’s directory as separate papers) while others began incorporating their frequencies into their titles. Less obvious in the graph, but an important part of this same trend, is “evening,” which rises into the top title words in the 1850s. Primarily in urban centers, some papers began printing morning and evening editions, with the evening edition most often marked in the title. These temporal words in titles then, point to the shifting temporalities of the newspaper medium around the middle of the century, as well as to quirks in the LoCs data that treats morning, evening, daily, or weekly editions of the same newspaper as separate, though linked, entities in its newspaper data.

Most striking in the graph, is the steep rise of “news” as a word used in the titles of newspapers; it appears in 1840 and is used in more than 10% of new papers titles in the decade 1900-1910, and more than 20% of new titles in the 1940s. I won’t comment on the twentieth century, but this rise at the end of the nineteenth is striking in an of itself. To close this post, I’ll read a bit farther into the growth of “news” than the evidence likely warrants. In brief, however, the latter half of the nineteenth century sees newspapers moving, first gradually and then rapidly, away from explicit partisan alignment and toward what we would recogize as an ideal of journalistic impartiality. That move is coupled with increased striving toward objectivity in reporting. As scholars such as Viral Texts RA Jonathan Fitzgerald have shown, these ideals of impartiality and objectivity were often more rhetorical than real, but they were nonetheless increasingly valued as ideals as the century progressed.

The names of new newspapers are one place those journalistic ideals are expressed and codified. By using the relatively neutral “news” in their titles, editors signaled that their papers carried neither partisan politicking nor sensational humbug, but instead unpretending reports of what happened. This claim is subtle, embedded in a small word, but its growing appeal over nearly a century testifies to its perceived effectiveness.