Fair is foul, and foul is fair: a tidytext sentiment analysis of Shakespeare’s tragedies

Jun 7, 2018 11 min read R, Text mining

Sentiment analysis can be used for many purposes and applied to all kinds of texts. In this exploratory analysis, we’ll use a tidytext approach to examine the use of sentiment words in the tragedies written by William Shakespeare. I’ve previously used Python for scraping and mining texts. However, I recently stumbled upon the tidytext R package by Julia Silge and David Robinson as well as their excellent book and ressource on combining tidytext with other tidy tools in R. This approach has made my life so much easier!

Loading packages

if(!require("pacman")) install.packages("pacman")
pacman::p_load(tidyverse,tidytext,jtools,gridExtra,zoo,data.table,gutenbergr)

Customizing a ggplot2 theme

Because we’ll create several ggplots to visualize sentiments in Shakespeare’s tragedies, it’ll be worth storing some plotting options in a customized theme. Rather than starting from scratch, we’ll use the APA theme from the jtools package and override some of the theme settings.

my_theme <- function() {
  theme_apa(legend.pos = "none") +
    theme(panel.background = element_blank(),
          plot.background = element_rect(fill = "antiquewhite1"),
          panel.border = element_blank(),                     # facet border
          strip.background = element_blank(),                 # facet title background
          plot.margin = unit(c(.5, .5, .5, .5), "cm")) 
}

Loading our data

Initially when I started working on Shakespeare’s plays, I downloaded the texts as .txt files and cleaned them in several steps. At the bottom of this page, I provide some code showing how I did it.

A more efficient way to gather public domain literary texts, such as Shakespeare’s plays, is to use David Robinson’s gutenbergr package. Let’s first see what’s available by Shakespeare.

shakespeare <- gutenberg_works(author == "Shakespeare, William") 

shakespeare

## # A tibble: 79 x 8
##    gutenberg_id title author gutenberg_autho… language gutenberg_books…
##           <int> <chr> <chr>             <int> <chr>    <chr>           
##  1         1041 Shak… Shake…               65 en       <NA>            
##  2         1045 Venu… Shake…               65 en       <NA>            
##  3         1500 King… Shake…               65 en       <NA>            
##  4         1501 Hist… Shake…               65 en       <NA>            
##  5         1502 The … Shake…               65 en       <NA>            
##  6         1503 The … Shake…               65 en       <NA>            
##  7         1504 The … Shake…               65 en       <NA>            
##  8         1505 The … Shake…               65 en       <NA>            
##  9         1507 The … Shake…               65 en       <NA>            
## 10         1508 The … Shake…               65 en       <NA>            
## # ... with 69 more rows, and 2 more variables: rights <chr>,
## #   has_text <lgl>

We then collect the IDs for the plays that we want and check that they match before we download the texts.

IDs = shakespeare[c(15,23,33,34,53,54,55,56,57,58),]$gutenberg_id

shakespeare %>% 
  filter(gutenberg_id %in% IDs)

## # A tibble: 10 x 8
##    gutenberg_id title author gutenberg_autho… language gutenberg_books…
##           <int> <chr> <chr>             <int> <chr>    <chr>           
##  1         1513 Rome… Shake…               65 en       <NA>            
##  2         1522 Juli… Shake…               65 en       <NA>            
##  3         1533 Macb… Shake…               65 en       Opera           
##  4         1534 Anto… Shake…               65 en       <NA>            
##  5         2259 Cori… Shake…               65 en       <NA>            
##  6         2260 Titu… Shake…               65 en       <NA>            
##  7         2262 Timo… Shake…               65 en       <NA>            
##  8         2265 Haml… Shake…               65 en       Best Books Ever…
##  9         2266 King… Shake…               65 en       Banned Books fr…
## 10         2267 Othe… Shake…               65 en       <NA>            
## # ... with 2 more variables: rights <chr>, has_text <lgl>

It looks like we have the right texts. Let’s download the texts, store them in a tibble and look at the data.

plays = gutenberg_download(IDs,meta_fields = "title")

plays

## # A tibble: 46,795 x 3
##    gutenberg_id text                       title           
##           <int> <chr>                      <chr>           
##  1         1513 ROMEO AND JULIET           Romeo and Juliet
##  2         1513 ""                         Romeo and Juliet
##  3         1513 by William Shakespeare     Romeo and Juliet
##  4         1513 ""                         Romeo and Juliet
##  5         1513 ""                         Romeo and Juliet
##  6         1513 ""                         Romeo and Juliet
##  7         1513 ""                         Romeo and Juliet
##  8         1513 PERSONS REPRESENTED        Romeo and Juliet
##  9         1513 ""                         Romeo and Juliet
## 10         1513 Escalus, Prince of Verona. Romeo and Juliet
## # ... with 46,785 more rows

Creating a data frame of sentiment words

We see that the text variable contains one line of text for each row. Given this format, we can create a new data frame with a row for each word token found in the Bing lexicon of sentiment words. By using this lexicon, sentiment words are simpky assigned a value of positive or negative. Have a look at the other options with ?get_sentiments. We will try using the nrc lexicon later on.

sentiments = plays            %>% 
  group_by(title)             %>%
  mutate(line = row_number()) %>%      # we will use line numbers later
  unnest_tokens(word, text)   %>%      # tokenize words
  #anti_join(stop_words) %>%           # in case we would like to remove stop words
  inner_join(get_sentiments("bing"))   # keep only words found in the Bing lexicon

sentiments

## # A tibble: 16,562 x 5
## # Groups:   title [?]
##    gutenberg_id title             line word       sentiment
##           <int> <chr>            <int> <chr>      <chr>    
##  1         1513 Romeo and Juliet    50 dignity    positive 
##  2         1513 Romeo and Juliet    51 fair       positive 
##  3         1513 Romeo and Juliet    52 grudge     negative 
##  4         1513 Romeo and Juliet    52 break      negative 
##  5         1513 Romeo and Juliet    53 unclean    negative 
##  6         1513 Romeo and Juliet    54 fatal      negative 
##  7         1513 Romeo and Juliet    56 overthrows negative 
##  8         1513 Romeo and Juliet    57 death      negative 
##  9         1513 Romeo and Juliet    57 strife     negative 
## 10         1513 Romeo and Juliet    58 fearful    negative 
## # ... with 16,552 more rows

We could also further subset our data frame by omitting so-called “stop words” defined in the stop_words variable. However, this is perhaps not ideal when we examine the following output showing which stop words match words in our data. Note that our choice here is important and will likely affect the analysis. In particular, it is clear that most of the words in question have a positive valence in the lexicon.

table(sentiments$word[sentiments$word %in% stop_words$word])

## 
## available      best    better     clear    enough      good     great 
##        12       159       151         7        87       762       264 
##  greatest important      like     right  sensible     sorry     thank 
##        15         1       503        81         6        26        21 
##    useful   welcome      well   willing    wonder      work 
##         1       121       606         9        28        25

With that in mind, we can now plot the number of positive and negative words in each play.

sentiments         %>% 
  count(sentiment) %>%
  ggplot(aes(x = sentiment, y = n, fill = title)) + 
  geom_bar(stat = "identity") +
  facet_wrap(~title) +
  scale_fill_viridis_d(option = "B") +
  my_theme()

For the figure below, we’ll use the nrc lexicon, which associates words with basic emotions.

plays                               %>% 
  group_by(title)                   %>%
  mutate(line = row_number())       %>%
  unnest_tokens(word, text)         %>%
  #anti_join(stop_words)            %>%
  inner_join(get_sentiments("nrc")) %>% 
  count(sentiment) %>%
  ggplot(aes(x = sentiment, y = n, fill = sentiment)) + 
  geom_bar(stat = "identity") +
  facet_wrap(~title) +
  scale_fill_viridis_d(option = "B") +
  my_theme() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Using our first data frame created with the Bing lexicon, let’s now rank the plays according to their ratio of negative words.

sentiments                       %>% 
  count(sentiment)               %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(ratio = negative / (negative + positive)) %>%
  ungroup()                      %>%
  ggplot(aes(x = reorder(title, ratio), y = ratio, fill = ratio)) + 
  geom_bar(stat = "identity") +
  scale_fill_viridis_c(option = "B", direction = -1) +
  my_theme() + 
  labs(title = "Plays ranked by ratio of negative sentiment words",
       y     = "ratio negative words",
       x     = "plays") +
  coord_flip()

Sentiments over time

It might also be interesting to examine the ebb and flow of sentiments as each play unfolds. To do so, we can use integer division and find the number of positive and negative words for each chunk of text. Here, we’ll use chunks with 100 words in each and subtract the number of negative from positive words.

sentiments %>%
  # count number of positive and negative words for each chunk of 100 lines
  count(title, index = line %/% 100, sentiment) %>%  
  spread(sentiment, n, fill = 0)                %>%                 
  mutate(sentiment = positive - negative)       %>%
  ggplot(aes(index, sentiment, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~title,scales = "free_x") +
  scale_fill_viridis_c(option = "B") +
  my_theme()

It might also be useful to further examine the “trend”" by adding a line indicating the rolling mean sentiment score. As an example, let’s try this with Romeo and Juliet using a lag of 5 indices.

sentiments %>%
  filter(title == "Romeo and Juliet")           %>%
  count(title, index = line %/% 100, sentiment) %>%
  spread(sentiment, n, fill = 0)                %>%
  mutate(sentiment = positive - negative)       %>%
  mutate(rollMean = rollmean(sentiment, 5, fill = NA)) %>%             # zoo package
  ggplot(aes(index, sentiment, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  geom_line(aes(x = index, y = rollMean),size = 4, colour = "antiquewhite1") +
  geom_line(aes(x = index, y = rollMean), size = 1) +
  scale_fill_viridis_c(option = "B") +
  labs(title = "Romeo and Juliet") +
  my_theme()

Though it requires a bit more work, it might be worthwhile to indicate where the individual acts begin. First, we will create a data frame containing the lines in which individual acts begin. We find the relevant lines with regular expressions in line 4 of the below code looking for the word “act” (or, in some cases, “actus”). For each hit, we assign a number

acts = plays                                  %>% 
  filter(title == "Romeo and Juliet")         %>%
  mutate(line = row_number())                 %>%
  mutate(act = cumsum(str_detect(text, regex("^act |^actus ", ignore_case = T)))) %>% 
  ungroup()                                   %>%
  unnest_tokens(word, text)                   %>%
  #anti_join(stop_words) %>%
  inner_join(get_sentiments("bing"))          %>%
  count(act, index = line %/% 100, sentiment) %>%
  mutate(new_act = act != shift(act,1))       %>%   # add a logical vector indicating where acts begin
  spread(sentiment, n, fill = 0)              %>%
  mutate(sentiment = positive - negative)     %>%
  filter(new_act == T, act > 0)               %>%   # keep only indices where acts begin, remove indices before act 1
  select(index,act)

acts

## # A tibble: 5 x 2
##   index   act
##   <dbl> <int>
## 1     0     1
## 2    13     2
## 3    25     3
## 4    38     4
## 5    45     5

We’ll also create a new data frame with the functions we used to prepare the data for the previous plot.

sentiments2 = plays                       %>% 
  filter(title == "Romeo and Juliet")     %>%
  mutate(line = row_number())             %>%
  ungroup()                               %>%
  unnest_tokens(word, text)               %>%
  #anti_join(stop_words)                  %>%
  inner_join(get_sentiments("bing"))      %>%
  count(index = line %/% 100, sentiment)  %>%
  spread(sentiment, n, fill = 0)          %>%
  mutate(sentiment = positive - negative)

Let’s then plot the same data only with lines and text indicating the beginning of each act.

sentiments2 %>% 
  ggplot(aes(index, sentiment, fill = sentiment)) +
  geom_vline(data = acts[-1,],
             aes(xintercept = index - 0.5),linetype = "dashed") +
  geom_col(show.legend = FALSE,position = "dodge") +
  scale_fill_viridis_c(option = "B") + 
  labs(title = "Romeo and Juliet") +
  annotate("text", x = acts$index + 2, y = 22, label = paste0("act ", acts$act), size = 5) +
  my_theme()

Finding the most common sentiment words

Focusing now on the words themselves, we can easily get an overview of the most common positive and negative words. In this case, we select the ten most common words for each category.

word_counts <- sentiments %>%
  ungroup()               %>%
  count(word, sentiment, sort = TRUE) %>%
  group_by(sentiment)     %>%
  top_n(10)               %>%
  ungroup()               %>%
  arrange(sentiment, -n)  %>%
  mutate(order = rev(row_number()))       # we need the order variable to ensure proper ranking of words

Recall that we chose to keep the sentiment words that matched stop words. In the plot below, we clearly see the consequence of our choice. The top four spots among positive words are populated by stop words.

word_counts %>%
  ggplot(aes(order, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  scale_x_continuous(
    breaks = word_counts$order,
    labels = word_counts$word,
    expand = c(0,0)) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment", x = NULL) +
  coord_flip() +
  scale_fill_viridis_d(option = "B", begin = .2, end = .8) +
  my_theme()

Let’s do the same for each play and store individual plots in a list using a for loop. Note that if there are ties for tenth place, all tied words are plotted.

titles = unique(sentiments$title)
plots = list()
for (i in titles){
  word_counts = sentiments %>%
    filter(title == i)     %>%
    count(word, sentiment, sort = TRUE) %>%
    ungroup()              %>%
    group_by(sentiment)    %>%
    top_n(10)              %>%
    ungroup()              %>%
    arrange(sentiment, -n) %>%
    mutate(order = rev(row_number()))
  
  plots[[i]] = word_counts %>%
    ggplot(aes(order, n, fill = sentiment)) +
    geom_col(show.legend = FALSE) +
    scale_x_continuous(
      breaks = word_counts$order,
      labels = word_counts$word,
      expand = c(0,0)) +
    facet_wrap(~sentiment, scales = "free_y") +
    labs(y = "Contribution to sentiment",x = NULL) +
    ggtitle(i) +
    coord_flip() +
    scale_fill_viridis_d(option = "B",begin = .2,end = .8) +
    my_theme()
  
    #print(plots[[i]])              # in case we want to create individual plots                 
}

Let’s create a grid and plot the first nine plays.

grid.arrange(grobs = plots[1:9], ncol = 3)

Conclusion

The aim of this analysis was simply to illustrate the ease with which one can quantify and explore texts with the tidytext package in combination with other tidy tools. In future posts, I will apply other text mining methods (e.g. word frequency, n-gram analysis and topic modeling) to Shakespeare’s tragedies.

Preparing data from .txt files

I started out with a .txt file containing the text for each play before using gutenbergr to grab the text. Of course, we may want to use the same text mining methods on many other kinds of texts not in the Gutenberg Project catalogue. The following code shows how one can take a collection of .txt files to produce a data frame structured in the same way as before.

##### manual cleaning
playList = list.files(pattern = "txt")        # create a list of documents

df = tibble()
for (play in playList){
  
  text = glue(read_file(play))
  text = str_trim(gsub("[A-Z]{2,}","",text))  # remove uppercase words
  text = tolower(text)                        # all words to lowercase
  #text = removeWords(text,stopwords("en"))   # remove stop words
  tokens = data_frame(text = text) %>%        # tokenize words
    unnest_tokens(word, text)
  
  sentiments = tokens                  %>%
    inner_join(get_sentiments("bing")) %>%
    count(sentiment)                   %>%
    spread(sentiment, n, fill = 0)     %>% 
    mutate(sentiment = positive - negative) 
  
  playDF = tibble(Play = play,Sentiments = names(sentiments),Values = t(sentiments)[,1])
  df = rbind(df,playDF)
}

Sentiment analysis Tidytext gutenbergr