Chapter 2 Homework Assignment

Note: We wrote two assignments for this week. This assignment continues using texts from the Project Gutenberg library, however the second assignment is a tutorial for conducting sentiment analyses on recent tweets. Both assignments focus on sentiment analysis, but the text we analyse is quite different across the assignments – you should feel free to do both assignments, or to just choose one. In order to access the Twitter Search API (required for the twitter sentiment tutorial), you’ll either need an existing Twitter account or to create one. If you are opposed to creating a Twitter account, please feel free to skip that assignment.

This homework assignment is intended as an opportunity to practice sentiment analysis, as introduced in Chapter 2 of Text Mining with R, A Tidy Approach. We’ll explore texts written by two very different authors (Charles Dickens and Edgar Rice Burroughs) in the Project Gutenberg library of texts.

In order to complete this assignment you’ll need to load the following libraries into an R Markdown document or an R Script: gutenbergr, tidyverse, tidytext, reshape2, and wordcloud.

Recall from the Chapter 1 Homework assignment that we can filter the gutenberg_metadata data frame to consider only our author(s) of interest and then select the gutenberg_id and title columns to see what book ids we will need. Use the technique we saw there to identify the works of both Dickens, Charles and Burroughs, Edgar Rice.

gutenberg_metadata %>%
  filter(str_detect(author, "Burroughs, Edgar Rice")) %>%
  select(gutenberg_id, author, title)

## # A tibble: 57 x 3
##    gutenberg_id author                title                        
##           <int> <chr>                 <chr>                        
##  1           62 Burroughs, Edgar Rice A Princess of Mars           
##  2           64 Burroughs, Edgar Rice The Gods of Mars             
##  3           68 Burroughs, Edgar Rice Warlord of Mars              
##  4           72 Burroughs, Edgar Rice Thuvia, Maid of Mars         
##  5           78 Burroughs, Edgar Rice Tarzan of the Apes           
##  6           81 Burroughs, Edgar Rice The Return of Tarzan         
##  7           85 Burroughs, Edgar Rice The Beasts of Tarzan         
##  8           90 Burroughs, Edgar Rice The Son of Tarzan            
##  9           92 Burroughs, Edgar Rice Tarzan and the Jewels of Opar
## 10           96 Burroughs, Edgar Rice The Monster Men              
## # ... with 47 more rows

gutenberg_metadata %>%
  filter(str_detect(author, "Dickens, Charles")) %>%
  select(gutenberg_id, title)

## # A tibble: 164 x 2
##    gutenberg_id title                                                           
##           <int> <chr>                                                           
##  1           46 "A Christmas Carol in Prose; Being a Ghost Story of Christmas"  
##  2           98 "A Tale of Two Cities"                                          
##  3          564 "The Mystery of Edwin Drood"                                    
##  4          580 "The Pickwick Papers"                                           
##  5          588 "Master Humphrey's Clock"                                       
##  6          644 "The Haunted Man and the Ghost's Bargain"                       
##  7          650 "Pictures from Italy"                                           
##  8          653 "The Chimes\r\nA Goblin Story of Some Bells That Rang an Old Ye~
##  9          675 "American Notes"                                                
## 10          676 "The Battle of Life"                                            
## # ... with 154 more rows

Both of these authors wrote a lot – Burroughs has 57 works in Project Gutenburg (some are duplicates), while Dickens has 164. Let’s narrow this down a bit to so that we just include the 15 novels Dickens wrote and the 5 books Burroughs wrote in his Barsoom series. The Gutenberg ids for Dickens’ novels are 98, 580, 564, 730, 766, 700, 917, 968, 967, 1023, 1400, 963, 786, 821, 883. The Gutenberg ids for Burroughs’ Barsoom series are 62, 64, 68, 72, and 1153.

Download these works and place them into objects called dickens_novels and burroughs_barsoom_novels. As a reminder, you can do this with the gutenberg_download() function whose first argument is a list of text ids and second argument is a mirror site to use for the download request – we will use mirror = "http://mirrors.xmission.com/gutenberg/". You can print out the head of each data frame if you would like confirmation that you have pulled the works correctly.

dickens_novels <- gutenberg_download(c(98, 580, 564, 730, 766, 700, 917, 968, 967, 1023, 1400, 963, 786, 821, 883), mirror = "http://mirrors.xmission.com/gutenberg/")

burroughs_barsoom_novels <- gutenberg_download(c(62, 64, 68, 72, 1153), mirror = "http://mirrors.xmission.com/gutenberg/")

dickens_novels %>% head()

## # A tibble: 6 x 2
##   gutenberg_id text                              
##          <int> <chr>                             
## 1           98 "A TALE OF TWO CITIES"            
## 2           98 ""                                
## 3           98 "A STORY OF THE FRENCH REVOLUTION"
## 4           98 ""                                
## 5           98 "By Charles Dickens"              
## 6           98 ""

burroughs_barsoom_novels %>% head()

## # A tibble: 6 x 2
##   gutenberg_id text                                                             
##          <int> <chr>                                                            
## 1           62 "[Frontispiece: With my back against a golden throne, I fought o~
## 2           62 "again for Dejah Thoris]"                                        
## 3           62 ""                                                               
## 4           62 ""                                                               
## 5           62 ""                                                               
## 6           62 ""

Now convert these into tidy data frames using unnest_tokens(). Remember that we will want to group by book (gutenberg_id), we may want to track line numbers or chapters like is shown in the textbook, and you can use informative names here by creating a new object with tidy_ in the title. In general, if you are doing some chapter tracking, you will benefit by looking through the original text data frames we downloaded to see the structure of the new chapter indicators. The regular expression you write must catch these chapter indicators – luckily the one we used earlier, "^chapter [\\divxlc]", searches for lines beginning with (^) the word chapter followed by either a decimal digit (\\d) or a character representing a Roman Numeral. Remember to set the ignore_case argument to TRUE, and this regular expression will still work for all of these texts.

tidy_dickens <- dickens_novels %>%
  group_by(gutenberg_id) %>%
  mutate(linenumber = row_number(), 
         chapter = cumsum(str_detect(text, 
                                     regex("^chapter [\\divxlc]", 
                                           ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

tidy_barsoom <- burroughs_barsoom_novels %>%
  group_by(gutenberg_id) %>%
  mutate(linenumber = row_number(), 
         chapter = cumsum(str_detect(text, 
                                     regex("^chapter [\\divxlc]", 
                                           ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

tidy_dickens

## # A tibble: 3,879,523 x 4
##    gutenberg_id linenumber chapter word  
##           <int>      <int>   <int> <chr> 
##  1           98          1       0 a     
##  2           98          1       0 tale  
##  3           98          1       0 of    
##  4           98          1       0 two   
##  5           98          1       0 cities
##  6           98          3       0 a     
##  7           98          3       0 story 
##  8           98          3       0 of    
##  9           98          3       0 the   
## 10           98          3       0 french
## # ... with 3,879,513 more rows

tidy_barsoom

## # A tibble: 345,393 x 4
##    gutenberg_id linenumber chapter word        
##           <int>      <int>   <int> <chr>       
##  1           62          1       0 frontispiece
##  2           62          1       0 with        
##  3           62          1       0 my          
##  4           62          1       0 back        
##  5           62          1       0 against     
##  6           62          1       0 a           
##  7           62          1       0 golden      
##  8           62          1       0 throne      
##  9           62          1       0 i           
## 10           62          1       0 fought      
## # ... with 345,383 more rows

Let’s look at two different emotions from the nrc sentiment dictionary, joy and fear and see what the most common words each writer uses with these sentiments are. First, create the nrc_joy and nrc_fear objects by getting sentiments from the "nrc" dictionary, and filtering to include only the "joy" and "fear" sentiments respectively. Then use each of the tidy text data frames we built earlier, inner join them with your restricted sentiment dictionary, and get sorted word counts to identify the most common joy and fear words for each author (you’ll run four separate blocks of code, once for each author and sentiment combination). Report your results as either a table or bar graph.

nrc_joy <- get_sentiments("nrc") %>%
  filter(sentiment == "joy")

nrc_fear <- get_sentiments("nrc") %>%
  filter(sentiment == "fear")

tidy_dickens %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE) %>%
  slice(1:30) %>%
  ggplot() +
  geom_col(aes(x = reorder(word, n), y = n)) +
  labs(x = "", y = "Count", title = "Most Common Joyful Words", subtitle = "Charles Dickens Novels") +
  coord_flip()

## Joining, by = "word"

tidy_dickens %>%
  inner_join(nrc_fear) %>%
  count(word, sort = TRUE) %>%
  slice(1:30) %>%
  ggplot() +
  geom_col(aes(x = reorder(word, n), y = n)) +
  labs(x = "", y = "Count", title = "Most Common Fearful Words", subtitle = "Charles Dickens Novels") +
  coord_flip()

## Joining, by = "word"

tidy_barsoom %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE) %>%
  slice(1:30) %>%
  ggplot() +
  geom_col(aes(x = reorder(word, n), y = n)) +
  labs(x = "", y = "Count", title = "Most Common Joyful Words", subtitle = "Barsoom Novels") +
  coord_flip()

## Joining, by = "word"

tidy_barsoom %>%
  inner_join(nrc_fear) %>%
  count(word, sort = TRUE) %>%
  slice(1:30) %>%
  ggplot() +
  geom_col(aes(x = reorder(word, n), y = n)) +
  labs(x = "", y = "Count", title = "Most Common Fearful Words", subtitle = "Barsoom Novels") +
  coord_flip()

## Joining, by = "word"

We’ll do some analysis on each book next. Currently our data frames list the numeric gutenberg_id but not a book title – let’s join book title onto our tidy data frames. The code below may be new to you – a left_join() is similar to an inner join in that we are combining information from two tables. The difference here is that with a left_join() we start with the table on the left, and add information coming from the table on the right wherever matches exist (here, were gutenberg_ids match) – with an inner_join(), if no match was found, the corresponding row in the initial table would be dropped from the resulting data frame. Also, notice that we aren’t joining all of the information from the gutenberg_metadata data frame – we are only including the gutenberg_id column (to make the matches) and the title column (the information we wanted to add). Ask any questions you have on Slack!

tidy_barsoom <- tidy_barsoom %>%
  left_join(gutenberg_metadata %>% select(gutenberg_id, title))

tidy_dickens <- tidy_dickens %>%
  left_join(gutenberg_metadata %>% select(gutenberg_id, title))

tidy_barsoom

## # A tibble: 345,393 x 5
##    gutenberg_id linenumber chapter word         title             
##           <int>      <int>   <int> <chr>        <chr>             
##  1           62          1       0 frontispiece A Princess of Mars
##  2           62          1       0 with         A Princess of Mars
##  3           62          1       0 my           A Princess of Mars
##  4           62          1       0 back         A Princess of Mars
##  5           62          1       0 against      A Princess of Mars
##  6           62          1       0 a            A Princess of Mars
##  7           62          1       0 golden       A Princess of Mars
##  8           62          1       0 throne       A Princess of Mars
##  9           62          1       0 i            A Princess of Mars
## 10           62          1       0 fought       A Princess of Mars
## # ... with 345,383 more rows

Notice that now we have a new column for title in our data frame.

Let’s see the wave of emotions that these authors take their readers through during the course of their novels. Similar to what was done in this week’s reading, let’s explore net sentiment for chunks of lines throughout each book. Start with chunks of 80 lines, as was done in the book, and then adjust this number to improve your insights. For example, 80 lines in the Barsoom books may be too long to get a reasonable picture of shifts in sentiment.

tidy_dickens %>%
  inner_join(get_sentiments("bing")) %>%
  count(title, index = linenumber %/% 150, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
  mutate(net_sentiment = positive - negative) %>%
  ggplot() +
  geom_col(aes(x = index, y = net_sentiment, fill = title), show.legend = FALSE) +
  facet_wrap(~title, ncol = 3, scales = "free_x")

## Joining, by = "word"

tidy_barsoom %>%
  inner_join(get_sentiments("bing")) %>%
  count(title, index = linenumber %/% 60, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
  mutate(net_sentiment = positive - negative) %>%
  ggplot() +
  geom_col(aes(x = index, y = net_sentiment, fill = title), show.legend = FALSE) +
  facet_wrap(~title, ncol = 2, scales = "free_x")

## Joining, by = "word"

\(\bigstar\)Challenge: Burroughs’ Barsoom series seems to be quite negative. See if you can produce a barplot displaying the most frequently occurring negative words for each of the books individually (use faceting).\(\bigstar\)

tidy_barsoom %>% 
  inner_join(get_sentiments("bing")) %>%
  filter(sentiment == "negative") %>%
  group_by(title) %>%
  count(word) %>%
  top_n(15) %>%
  ungroup() %>%
  mutate(title = as.factor(title),
         word = reorder_within(word, n, title)) %>%
  ggplot() +
  geom_col(aes(x = word, y = n, fill = title), show.legend = FALSE) +
  labs(x = "", y = "", title = "Most Common Negative Words", subtitle = "Barsoom Series") +
  coord_flip() +
  scale_x_reordered() +
  facet_wrap(~title, ncol = 2, scales = "free")

## Joining, by = "word"

## Selecting by n

Now let’s move to word clouds.

Create two word clouds, one corresponding to each author. Don’t forget to filter out stop_words. Once you have a word cloud for each author, can you generate a separate word cloud for each individual text?

tidy_dickens %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 150, scale = c(4, 0.05)))

## Joining, by = "word"

tidy_barsoom %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100, scale = c(4, 0.05)))

## Joining, by = "word"

Finally, generate a comparison cloud between a pair of sentiments of your choosing. You can either use the bing sentiment dictionary to construct a comparison a positive and negative tokens for each author, or use nrc and compare two different sentiments. Again, generate two word clouds, one for each author.

tidy_dickens %>%
  inner_join(get_sentiments("nrc")) %>%
  filter(sentiment %in% c("fear", "joy")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("purple", "gold"), max.words = 150)

## Joining, by = "word"

tidy_barsoom %>%
  inner_join(get_sentiments("nrc")) %>%
  filter(sentiment %in% c("fear", "joy")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("purple", "gold"), max.words = 150)

## Joining, by = "word"

Recall that the bing and nrc sentiment dictionaries were validated on modern text (like Twitter). Are there words which have positive or negative connotations in their current usage but had a neutral connotation when these books were written? Try recreating the comparison clouds from the previous question with these words filtered out.

Final Thoughts

Nice work. In completing this assignment you used a search technique to identify works by particular authors included in Project Gutenberg. You downloaded a selection of those works and conducted side-by-side sentiment analyses, comparing the selected works by each author. Even with just these first two Chapters behind us, we’ve gained some really useful tools for analysing the tone of a corpus as well as comparing authors or individual works. What we’ve done here is not only applicable to literature, either. We could do similar analyses on legal documents, emails, text messages, tweets, and more. Check out this week’s last supplement to learn how to connect to the Twitter Search API and to pull/analyse recent tweets.

Previous, Read Chapter 2 of Text Mining with R Next, Twitter Sentiment Tutorial