This homework set is intended to provide you with practice making word frequency counts and introductory analysis of texts from Project Gutenberg, which is an online repository of freely available books.

Feel free to ask questions or for help with errors in the Slack channel or during our weekly synchronous meeting.

For these problems, we will be using the gutenbergr library, which can be installed using the following syntax:

install.packages("gutenbergr")
library(gutenbergr)

Here is a quick introduction to some of the functionality included in gutenbergr. The following command provides a data frame of the texts that are available to download:

gutenberg_metadata
## # A tibble: 51,997 x 8
##    gutenberg_id title  author  gutenberg_autho… language gutenberg_books… rights
##           <int> <chr>  <chr>              <int> <chr>    <chr>            <chr> 
##  1            0  <NA>  <NA>                  NA en       <NA>             Publi…
##  2            1 "The … Jeffer…             1638 en       United States L… Publi…
##  3            2 "The … United…                1 en       American Revolu… Publi…
##  4            3 "John… Kenned…             1666 en       <NA>             Publi…
##  5            4 "Linc… Lincol…                3 en       US Civil War     Publi…
##  6            5 "The … United…                1 en       American Revolu… Publi…
##  7            6 "Give… Henry,…                4 en       American Revolu… Publi…
##  8            7 "The … <NA>                  NA en       <NA>             Publi…
##  9            8 "Abra… Lincol…                3 en       US Civil War     Publi…
## 10            9 "Abra… Lincol…                3 en       US Civil War     Publi…
## # … with 51,987 more rows, and 1 more variable: has_text <lgl>

This data frame can be wrangled using the dplyr functions that we learned last week. For example, we can search for all of the books on Project Gutenberg which were authored by Arthur Conan Doyle.

library(tidyverse)
gutenberg_metadata %>%
  filter(author == "Doyle, Arthur Conan") %>%
  select(gutenberg_id, title)
## # A tibble: 123 x 2
##    gutenberg_id title                                                           
##           <int> <chr>                                                           
##  1          108 "The Return of Sherlock Holmes"                                 
##  2          126 "The Poison Belt"                                               
##  3          139 "The Lost World"                                                
##  4          221 "The Return of Sherlock Holmes"                                 
##  5          244 "A Study in Scarlet"                                            
##  6          290 "The Stark Munro Letters\r\nBeing series of twelve letters writ…
##  7          294 "The Captain of the Polestar, and Other Tales"                  
##  8          355 "The Parasite: A Story"                                         
##  9          356 "Beyond the City"                                               
## 10          423 "Round the Red Lamp: Being Facts and Fancies of Medical Life"   
## # … with 113 more rows

The command gutenberg_download() can be used to download full texts from the Project Gutenberg website. The gutenbergr package is mentioned in Text Mining with R, but the default mirror for the package is no longer operational, so using the syntax from the textbook will result in an error. We can manually set the gutenberg_download() function to use an updated mirror. Let’s download The Adventures of Sherlock Holmes, whose gutenberg_id is 1661.

sherlock <- gutenberg_download(1661, mirror = "http://mirrors.xmission.com/gutenberg/")
sherlock
## # A tibble: 12,648 x 2
##    gutenberg_id text                               
##           <int> <chr>                              
##  1         1661 "THE ADVENTURES OF SHERLOCK HOLMES"
##  2         1661 ""                                 
##  3         1661 "by"                               
##  4         1661 ""                                 
##  5         1661 "SIR ARTHUR CONAN DOYLE"           
##  6         1661 ""                                 
##  7         1661 ""                                 
##  8         1661 ""                                 
##  9         1661 "   I. A Scandal in Bohemia"       
## 10         1661 "  II. The Red-headed League"      
## # … with 12,638 more rows
  1. Use the unnest_tokens() function from the tidytext library to create a data frame named tidy_sherlock which is tokenized by word. As a reminder, you can use the arrow operator (<-) to store the result of a series of commands into a new object, and you can print out the contents of an object just by calling its name. The result should look like this:
## # A tibble: 105,426 x 2
##    gutenberg_id word      
##           <int> <chr>     
##  1         1661 the       
##  2         1661 adventures
##  3         1661 of        
##  4         1661 sherlock  
##  5         1661 holmes    
##  6         1661 by        
##  7         1661 sir       
##  8         1661 arthur    
##  9         1661 conan     
## 10         1661 doyle     
## # … with 105,416 more rows

Solution:

library(tidytext)

tidy_sherlock <- sherlock %>%
  unnest_tokens(word,text)

tidy_sherlock
## # A tibble: 105,426 x 2
##    gutenberg_id word      
##           <int> <chr>     
##  1         1661 the       
##  2         1661 adventures
##  3         1661 of        
##  4         1661 sherlock  
##  5         1661 holmes    
##  6         1661 by        
##  7         1661 sir       
##  8         1661 arthur    
##  9         1661 conan     
## 10         1661 doyle     
## # … with 105,416 more rows
  1. Remove any stop words and create a bar graph of the most commonly used words in The Adventures of Sherlock Holmes. (Note: There exist more than one dictionary of stop words. The data frame stop_words is available when you load the tidytext library, but you can also use get_stopwords() to access stop word dictionaries from other sources and other languages.) What are the most commonly used words?

Solution:

tidy_sherlock <- tidy_sherlock %>%
  anti_join(stop_words)

tidy_sherlock %>%
  count(word) %>%
  arrange(desc(n)) %>%
  slice(1:30) %>%
  ggplot() + 
  geom_col(aes(n,reorder(word,n), fill = word), show.legend = FALSE) + 
  labs(x = "Count", y = NULL)

The most common words are “holmes”, “time”, “door”, “matter”, etc.

  1. Repeat Problems 1 and 2, but for The Return of Sherlock Holmes, which has gutenberg_id equal to 108. It would be a good idea to give the text data frame and tidied data frame different names (for example, return_sherlock and tidy_return_sherlock) because we will be using both books in the next problem.

Solution:

return_sherlock <- gutenberg_download(108, mirror = "http://mirrors.xmission.com/gutenberg/")

tidy_return_sherlock <- return_sherlock %>%
  unnest_tokens(word,text) %>%
  anti_join(stop_words)

tidy_return_sherlock %>%
  count(word) %>%
  arrange(desc(n)) %>%
  slice(1:30) %>%
  ggplot() + 
  geom_col(aes(n,reorder(word,n), fill = word), show.legend = FALSE) + 
  labs(x = "Count", y = NULL)

The most common words are “holmes”, “watson”, “sir”, “time”, etc. Perhaps Watson was more important in the remake?

In the upcoming problem, we will compare the relative frequencies of words in The Adventures of Sherlock Holmes and The Return of Sherlock Holmes.

  1. Use the bind_rows() and mutate() functions to concatenate the tidy_sherlock and tidy_return_sherlock data frames together and add a column named title which identifies the book which the word came from. Your result should look similar to what appears below. If you want to make sure that you have words from both texts in the resulting data frame, try using the head() and tail() functions.

Solution:

bind_rows(mutate(tidy_sherlock, title = "The Adventures of Sherlock Holmes"),
          mutate(tidy_return_sherlock, title = "The Return of Sherlock Holmes"))
## # A tibble: 67,793 x 3
##    gutenberg_id word       title                            
##           <int> <chr>      <chr>                            
##  1         1661 adventures The Adventures of Sherlock Holmes
##  2         1661 sherlock   The Adventures of Sherlock Holmes
##  3         1661 holmes     The Adventures of Sherlock Holmes
##  4         1661 sir        The Adventures of Sherlock Holmes
##  5         1661 arthur     The Adventures of Sherlock Holmes
##  6         1661 conan      The Adventures of Sherlock Holmes
##  7         1661 doyle      The Adventures of Sherlock Holmes
##  8         1661 scandal    The Adventures of Sherlock Holmes
##  9         1661 bohemia    The Adventures of Sherlock Holmes
## 10         1661 ii         The Adventures of Sherlock Holmes
## # … with 67,783 more rows
  1. Pipe %>% the previous data frame into a mutate command mutate(word = str_extract(word, "[a-z']+")). As mentioned in the text book, this regular expression removes brackets that are used for emphasis in the text. Then use the count() function to count the number of times that a word appears in each book. Again, your output should match what is seen below.
## # A tibble: 14,795 x 3
##    title                             word           n
##    <chr>                             <chr>      <int>
##  1 The Adventures of Sherlock Holmes a              1
##  2 The Adventures of Sherlock Holmes abandoned      3
##  3 The Adventures of Sherlock Holmes abandons       1
##  4 The Adventures of Sherlock Holmes abbots         1
##  5 The Adventures of Sherlock Holmes aberdeen       2
##  6 The Adventures of Sherlock Holmes abhorrent      1
##  7 The Adventures of Sherlock Holmes abiding        1
##  8 The Adventures of Sherlock Holmes abjure         1
##  9 The Adventures of Sherlock Holmes abnormal       1
## 10 The Adventures of Sherlock Holmes abnormally     1
## # … with 14,785 more rows

Solution:

bind_rows(mutate(tidy_sherlock, title = "The Adventures of Sherlock Holmes"),
          mutate(tidy_return_sherlock, title = "The Return of Sherlock Holmes")) %>%
  mutate(word = str_extract(word, "[a-z']+")) %>%
  count(title,word)
## # A tibble: 14,795 x 3
##    title                             word           n
##    <chr>                             <chr>      <int>
##  1 The Adventures of Sherlock Holmes a              1
##  2 The Adventures of Sherlock Holmes abandoned      3
##  3 The Adventures of Sherlock Holmes abandons       1
##  4 The Adventures of Sherlock Holmes abbots         1
##  5 The Adventures of Sherlock Holmes aberdeen       2
##  6 The Adventures of Sherlock Holmes abhorrent      1
##  7 The Adventures of Sherlock Holmes abiding        1
##  8 The Adventures of Sherlock Holmes abjure         1
##  9 The Adventures of Sherlock Holmes abnormal       1
## 10 The Adventures of Sherlock Holmes abnormally     1
## # … with 14,785 more rows
  1. Make a new column named proportion which divides each word frequency by the total number of words in each book, and use the select() function to drop the n column. Below is an example of what your result should look like.
## # A tibble: 14,795 x 3
##    title                             word       proportion
##    <chr>                             <chr>           <dbl>
##  1 The Adventures of Sherlock Holmes a           0.0000312
##  2 The Adventures of Sherlock Holmes abandoned   0.0000935
##  3 The Adventures of Sherlock Holmes abandons    0.0000312
##  4 The Adventures of Sherlock Holmes abbots      0.0000312
##  5 The Adventures of Sherlock Holmes aberdeen    0.0000623
##  6 The Adventures of Sherlock Holmes abhorrent   0.0000312
##  7 The Adventures of Sherlock Holmes abiding     0.0000312
##  8 The Adventures of Sherlock Holmes abjure      0.0000312
##  9 The Adventures of Sherlock Holmes abnormal    0.0000312
## 10 The Adventures of Sherlock Holmes abnormally  0.0000312
## # … with 14,785 more rows

Solution:

bind_rows(mutate(tidy_sherlock, title = "The Adventures of Sherlock Holmes"),
          mutate(tidy_return_sherlock, title = "The Return of Sherlock Holmes")) %>%
  mutate(word = str_extract(word, "[a-z']+")) %>%
  count(title,word) %>%
  group_by(title) %>%
  mutate(proportion = n / sum(n)) %>%
  select(-n) %>%
  ungroup()
## # A tibble: 14,795 x 3
##    title                             word       proportion
##    <chr>                             <chr>           <dbl>
##  1 The Adventures of Sherlock Holmes a           0.0000312
##  2 The Adventures of Sherlock Holmes abandoned   0.0000935
##  3 The Adventures of Sherlock Holmes abandons    0.0000312
##  4 The Adventures of Sherlock Holmes abbots      0.0000312
##  5 The Adventures of Sherlock Holmes aberdeen    0.0000623
##  6 The Adventures of Sherlock Holmes abhorrent   0.0000312
##  7 The Adventures of Sherlock Holmes abiding     0.0000312
##  8 The Adventures of Sherlock Holmes abjure      0.0000312
##  9 The Adventures of Sherlock Holmes abnormal    0.0000312
## 10 The Adventures of Sherlock Holmes abnormally  0.0000312
## # … with 14,785 more rows
  1. Our goal is to create columns corresponding to each book whose values are equal to the proportion of each word. We will use the pivot_wider() function, which you saw used in the chapter if you are working from the electronic text. If you are using a physical copy of the text, you saw the functions gather() and spread() used to pivot data frames between wide and long format, pivot_wider() is a more up-to-date version of the spread() function used in the textbook. The arguments of pivot_wider() are names_from, which should be set equal to the column which contains the variables that you would like to become new columns. The column containing values which will fill the new columns is identified by values_from. Pipe %>% the data frame from the previous problem into this function:
pivot_wider(names_from = title, values_from = proportion)

The resulting data frame should look like this:

## # A tibble: 10,644 x 3
##    word       `The Adventures of Sherlock Holmes` `The Return of Sherlock Holme…
##    <chr>                                    <dbl>                          <dbl>
##  1 a                                    0.0000312                      0.0000840
##  2 abandoned                            0.0000935                      0.0000280
##  3 abandons                             0.0000312                     NA        
##  4 abbots                               0.0000312                     NA        
##  5 aberdeen                             0.0000623                     NA        
##  6 abhorrent                            0.0000312                      0.0000560
##  7 abiding                              0.0000312                     NA        
##  8 abjure                               0.0000312                     NA        
##  9 abnormal                             0.0000312                     NA        
## 10 abnormally                           0.0000312                      0.0000280
## # … with 10,634 more rows

Solution:

bind_rows(mutate(tidy_sherlock, title = "The Adventures of Sherlock Holmes"),
          mutate(tidy_return_sherlock, title = "The Return of Sherlock Holmes")) %>%
  mutate(word = str_extract(word, "[a-z']+")) %>%
  count(title,word) %>%
  group_by(title) %>%
  mutate(proportion = n / sum(n)) %>%
  select(-n) %>%
  ungroup() %>%
  pivot_wider(names_from = title, values_from = proportion)
## # A tibble: 10,644 x 3
##    word       `The Adventures of Sherlock Holmes` `The Return of Sherlock Holme…
##    <chr>                                    <dbl>                          <dbl>
##  1 a                                    0.0000312                      0.0000840
##  2 abandoned                            0.0000935                      0.0000280
##  3 abandons                             0.0000312                     NA        
##  4 abbots                               0.0000312                     NA        
##  5 aberdeen                             0.0000623                     NA        
##  6 abhorrent                            0.0000312                      0.0000560
##  7 abiding                              0.0000312                     NA        
##  8 abjure                               0.0000312                     NA        
##  9 abnormal                             0.0000312                     NA        
## 10 abnormally                           0.0000312                      0.0000280
## # … with 10,634 more rows
  1. Finally, we are ready to visualize the word frequencies. We will make a scatter plot similar to the one created in the textbook. The scatter plot coordinates will be the word frequencies from each book. A geom_text() layer will label the scatter plot points with their associated words. First, load the scales library, and then pipe %>% the data frame from the previous problem into the ggplot() command below. The axis scales are logarithmic in order to have the scatter plot be less crowded. What words were more often used in The Adventures of Sherlock Holmes? What about in The Return of Sherlock Holmes?
  ggplot(aes(`The Adventures of Sherlock Holmes`,`The Return of Sherlock Holmes`)) + 
  #We are using geom_jitter() rather than geom_point() so that the points are not plotted on top of one another
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  #geom_abline() adds the diagonal line. Words close to the diagonal are used equally frequently in the books
  geom_abline(lty = 2) + 
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) + 
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0,0.001), low = "darkslategray4", high = "gray75") +
  ggtitle("Word Frequencies Comparison")

Solution:

library(scales)

bind_rows(mutate(tidy_sherlock, title = "The Adventures of Sherlock Holmes"),
          mutate(tidy_return_sherlock, title = "The Return of Sherlock Holmes")) %>%
  mutate(word = str_extract(word, "[a-z']+")) %>%
  count(title,word) %>%
  group_by(title) %>%
  mutate(proportion = n / sum(n)) %>%
  select(-n) %>%
  ungroup() %>%
  pivot_wider(names_from = title, values_from = proportion) %>%
  ggplot(aes(`The Adventures of Sherlock Holmes`,`The Return of Sherlock Holmes`)) + 
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_abline(lty = 2) + 
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) + 
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0,0.001), low = "darkslategray4", high = "gray75") +
  ggtitle("Word Frequencies Comparison")

The Adventures of Sherlock Holmes uses the words “assizes”, “lip”, “angel”, “advertisement”, “st”, etc. more often. The Return of Sherlock Holmes uses the words “peter”, “hopkins”, “document”, “alert”, “appeal”, etc. more often.

  1. Try repeating these exercises with two other books from Project Gutenberg!

Previous, Read Tidy Text Chapter 1 Next, Week Three