Note: We wrote two assignments for this week. This assignment continues using texts from the Project Gutenberg library, however the second assignment is a tutorial for conducting sentiment analyses on recent tweets. Both assignments focus on sentiment analysis, but the text we analyse is quite different across the assignments – you should feel free to do both assignments, or to just choose one. In order to access the Twitter Search API (required for the twitter sentiment tutorial), you’ll either need an existing Twitter account or to create one. If you are opposed to creating a Twitter account, please feel free to skip that assignment.
This homework assignment is intended as an opportunity to practice sentiment analysis, as introduced in Chapter 2 of Text Mining with R, A Tidy Approach. We’ll explore texts written by two very different authors (Charles Dickens and Edgar Rice Burroughs) in the Project Gutenberg library of texts.
In order to complete this assignment you’ll need to load the following libraries into an R Markdown document or an R Script: gutenbergr
, tidyverse
, tidytext
, reshape2
, and wordcloud
.
gutenberg_metadata
data frame to consider only our author(s) of interest and then select the gutenberg_id
and title
columns to see what book ids we will need. Use the technique we saw there to identify the works of both Dickens, Charles
and Burroughs, Edgar Rice
.%>%
gutenberg_metadata filter(str_detect(author, "Burroughs, Edgar Rice")) %>%
select(gutenberg_id, author, title)
## # A tibble: 57 x 3
## gutenberg_id author title
## <int> <chr> <chr>
## 1 62 Burroughs, Edgar Rice A Princess of Mars
## 2 64 Burroughs, Edgar Rice The Gods of Mars
## 3 68 Burroughs, Edgar Rice Warlord of Mars
## 4 72 Burroughs, Edgar Rice Thuvia, Maid of Mars
## 5 78 Burroughs, Edgar Rice Tarzan of the Apes
## 6 81 Burroughs, Edgar Rice The Return of Tarzan
## 7 85 Burroughs, Edgar Rice The Beasts of Tarzan
## 8 90 Burroughs, Edgar Rice The Son of Tarzan
## 9 92 Burroughs, Edgar Rice Tarzan and the Jewels of Opar
## 10 96 Burroughs, Edgar Rice The Monster Men
## # ... with 47 more rows
%>%
gutenberg_metadata filter(str_detect(author, "Dickens, Charles")) %>%
select(gutenberg_id, title)
## # A tibble: 164 x 2
## gutenberg_id title
## <int> <chr>
## 1 46 "A Christmas Carol in Prose; Being a Ghost Story of Christmas"
## 2 98 "A Tale of Two Cities"
## 3 564 "The Mystery of Edwin Drood"
## 4 580 "The Pickwick Papers"
## 5 588 "Master Humphrey's Clock"
## 6 644 "The Haunted Man and the Ghost's Bargain"
## 7 650 "Pictures from Italy"
## 8 653 "The Chimes\r\nA Goblin Story of Some Bells That Rang an Old Ye~
## 9 675 "American Notes"
## 10 676 "The Battle of Life"
## # ... with 154 more rows
Both of these authors wrote a lot – Burroughs has 57 works in Project Gutenburg (some are duplicates), while Dickens has 164. Let’s narrow this down a bit to so that we just include the 15 novels Dickens wrote and the 5 books Burroughs wrote in his Barsoom series. The Gutenberg ids for Dickens’ novels are 98, 580, 564, 730, 766, 700, 917, 968, 967, 1023, 1400, 963, 786, 821, 883. The Gutenberg ids for Burroughs’ Barsoom series are 62, 64, 68, 72, and 1153.
dickens_novels
and burroughs_barsoom_novels
. As a reminder, you can do this with the gutenberg_download()
function whose first argument is a list of text ids and second argument is a mirror site to use for the download request – we will use mirror = "http://mirrors.xmission.com/gutenberg/"
. You can print out the head of each data frame if you would like confirmation that you have pulled the works correctly.<- gutenberg_download(c(98, 580, 564, 730, 766, 700, 917, 968, 967, 1023, 1400, 963, 786, 821, 883), mirror = "http://mirrors.xmission.com/gutenberg/")
dickens_novels
<- gutenberg_download(c(62, 64, 68, 72, 1153), mirror = "http://mirrors.xmission.com/gutenberg/")
burroughs_barsoom_novels
%>% head() dickens_novels
## # A tibble: 6 x 2
## gutenberg_id text
## <int> <chr>
## 1 98 "A TALE OF TWO CITIES"
## 2 98 ""
## 3 98 "A STORY OF THE FRENCH REVOLUTION"
## 4 98 ""
## 5 98 "By Charles Dickens"
## 6 98 ""
%>% head() burroughs_barsoom_novels
## # A tibble: 6 x 2
## gutenberg_id text
## <int> <chr>
## 1 62 "[Frontispiece: With my back against a golden throne, I fought o~
## 2 62 "again for Dejah Thoris]"
## 3 62 ""
## 4 62 ""
## 5 62 ""
## 6 62 ""
unnest_tokens()
. Remember that we will want to group by book (gutenberg_id
), we may want to track line numbers or chapters like is shown in the textbook, and you can use informative names here by creating a new object with tidy_
in the title. In general, if you are doing some chapter tracking, you will benefit by looking through the original text data frames we downloaded to see the structure of the new chapter indicators. The regular expression you write must catch these chapter indicators – luckily the one we used earlier, "^chapter [\\divxlc]"
, searches for lines beginning with (^
) the word chapter followed by either a decimal digit (\\d
) or a character representing a Roman Numeral. Remember to set the ignore_case
argument to TRUE
, and this regular expression will still work for all of these texts.<- dickens_novels %>%
tidy_dickens group_by(gutenberg_id) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
<- burroughs_barsoom_novels %>%
tidy_barsoom group_by(gutenberg_id) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
tidy_dickens
## # A tibble: 3,879,523 x 4
## gutenberg_id linenumber chapter word
## <int> <int> <int> <chr>
## 1 98 1 0 a
## 2 98 1 0 tale
## 3 98 1 0 of
## 4 98 1 0 two
## 5 98 1 0 cities
## 6 98 3 0 a
## 7 98 3 0 story
## 8 98 3 0 of
## 9 98 3 0 the
## 10 98 3 0 french
## # ... with 3,879,513 more rows
tidy_barsoom
## # A tibble: 345,393 x 4
## gutenberg_id linenumber chapter word
## <int> <int> <int> <chr>
## 1 62 1 0 frontispiece
## 2 62 1 0 with
## 3 62 1 0 my
## 4 62 1 0 back
## 5 62 1 0 against
## 6 62 1 0 a
## 7 62 1 0 golden
## 8 62 1 0 throne
## 9 62 1 0 i
## 10 62 1 0 fought
## # ... with 345,383 more rows
nrc
sentiment dictionary, joy
and fear
and see what the most common words each writer uses with these sentiments are. First, create the nrc_joy
and nrc_fear
objects by getting sentiments from the "nrc"
dictionary, and filtering to include only the "joy"
and "fear"
sentiments respectively. Then use each of the tidy text data frames we built earlier, inner join them with your restricted sentiment dictionary, and get sorted word counts to identify the most common joy and fear words for each author (you’ll run four separate blocks of code, once for each author and sentiment combination). Report your results as either a table or bar graph.<- get_sentiments("nrc") %>%
nrc_joy filter(sentiment == "joy")
<- get_sentiments("nrc") %>%
nrc_fear filter(sentiment == "fear")
%>%
tidy_dickens inner_join(nrc_joy) %>%
count(word, sort = TRUE) %>%
slice(1:30) %>%
ggplot() +
geom_col(aes(x = reorder(word, n), y = n)) +
labs(x = "", y = "Count", title = "Most Common Joyful Words", subtitle = "Charles Dickens Novels") +
coord_flip()
## Joining, by = "word"
%>%
tidy_dickens inner_join(nrc_fear) %>%
count(word, sort = TRUE) %>%
slice(1:30) %>%
ggplot() +
geom_col(aes(x = reorder(word, n), y = n)) +
labs(x = "", y = "Count", title = "Most Common Fearful Words", subtitle = "Charles Dickens Novels") +
coord_flip()
## Joining, by = "word"
%>%
tidy_barsoom inner_join(nrc_joy) %>%
count(word, sort = TRUE) %>%
slice(1:30) %>%
ggplot() +
geom_col(aes(x = reorder(word, n), y = n)) +
labs(x = "", y = "Count", title = "Most Common Joyful Words", subtitle = "Barsoom Novels") +
coord_flip()
## Joining, by = "word"
%>%
tidy_barsoom inner_join(nrc_fear) %>%
count(word, sort = TRUE) %>%
slice(1:30) %>%
ggplot() +
geom_col(aes(x = reorder(word, n), y = n)) +
labs(x = "", y = "Count", title = "Most Common Fearful Words", subtitle = "Barsoom Novels") +
coord_flip()
## Joining, by = "word"
We’ll do some analysis on each book next. Currently our data frames list the numeric gutenberg_id
but not a book title – let’s join book title onto our tidy data frames. The code below may be new to you – a left_join()
is similar to an inner join in that we are combining information from two tables. The difference here is that with a left_join()
we start with the table on the left, and add information coming from the table on the right wherever matches exist (here, were gutenberg_id
s match) – with an inner_join()
, if no match was found, the corresponding row in the initial table would be dropped from the resulting data frame. Also, notice that we aren’t joining all of the information from the gutenberg_metadata
data frame – we are only including the gutenberg_id
column (to make the matches) and the title
column (the information we wanted to add). Ask any questions you have on Slack!
<- tidy_barsoom %>%
tidy_barsoom left_join(gutenberg_metadata %>% select(gutenberg_id, title))
<- tidy_dickens %>%
tidy_dickens left_join(gutenberg_metadata %>% select(gutenberg_id, title))
tidy_barsoom
## # A tibble: 345,393 x 5
## gutenberg_id linenumber chapter word title
## <int> <int> <int> <chr> <chr>
## 1 62 1 0 frontispiece A Princess of Mars
## 2 62 1 0 with A Princess of Mars
## 3 62 1 0 my A Princess of Mars
## 4 62 1 0 back A Princess of Mars
## 5 62 1 0 against A Princess of Mars
## 6 62 1 0 a A Princess of Mars
## 7 62 1 0 golden A Princess of Mars
## 8 62 1 0 throne A Princess of Mars
## 9 62 1 0 i A Princess of Mars
## 10 62 1 0 fought A Princess of Mars
## # ... with 345,383 more rows
Notice that now we have a new column for title
in our data frame.
%>%
tidy_dickens inner_join(get_sentiments("bing")) %>%
count(title, index = linenumber %/% 150, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(net_sentiment = positive - negative) %>%
ggplot() +
geom_col(aes(x = index, y = net_sentiment, fill = title), show.legend = FALSE) +
facet_wrap(~title, ncol = 3, scales = "free_x")
## Joining, by = "word"
%>%
tidy_barsoom inner_join(get_sentiments("bing")) %>%
count(title, index = linenumber %/% 60, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(net_sentiment = positive - negative) %>%
ggplot() +
geom_col(aes(x = index, y = net_sentiment, fill = title), show.legend = FALSE) +
facet_wrap(~title, ncol = 2, scales = "free_x")
## Joining, by = "word"
%>%
tidy_barsoom inner_join(get_sentiments("bing")) %>%
filter(sentiment == "negative") %>%
group_by(title) %>%
count(word) %>%
top_n(15) %>%
ungroup() %>%
mutate(title = as.factor(title),
word = reorder_within(word, n, title)) %>%
ggplot() +
geom_col(aes(x = word, y = n, fill = title), show.legend = FALSE) +
labs(x = "", y = "", title = "Most Common Negative Words", subtitle = "Barsoom Series") +
coord_flip() +
scale_x_reordered() +
facet_wrap(~title, ncol = 2, scales = "free")
## Joining, by = "word"
## Selecting by n
Now let’s move to word clouds.
stop_words
. Once you have a word cloud for each author, can you generate a separate word cloud for each individual text?%>%
tidy_dickens anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 150, scale = c(4, 0.05)))
## Joining, by = "word"
%>%
tidy_barsoom anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100, scale = c(4, 0.05)))
## Joining, by = "word"
bing
sentiment dictionary to construct a comparison a positive and negative tokens for each author, or use nrc
and compare two different sentiments. Again, generate two word clouds, one for each author.%>%
tidy_dickens inner_join(get_sentiments("nrc")) %>%
filter(sentiment %in% c("fear", "joy")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("purple", "gold"), max.words = 150)
## Joining, by = "word"
%>%
tidy_barsoom inner_join(get_sentiments("nrc")) %>%
filter(sentiment %in% c("fear", "joy")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("purple", "gold"), max.words = 150)
## Joining, by = "word"
bing
and nrc
sentiment dictionaries were validated on modern text (like Twitter). Are there words which have positive or negative connotations in their current usage but had a neutral connotation when these books were written? Try recreating the comparison clouds from the previous question with these words filtered out.Nice work. In completing this assignment you used a search technique to identify works by particular authors included in Project Gutenberg. You downloaded a selection of those works and conducted side-by-side sentiment analyses, comparing the selected works by each author. Even with just these first two Chapters behind us, we’ve gained some really useful tools for analysing the tone of a corpus as well as comparing authors or individual works. What we’ve done here is not only applicable to literature, either. We could do similar analyses on legal documents, emails, text messages, tweets, and more. Check out this week’s last supplement to learn how to connect to the Twitter Search API and to pull/analyse recent tweets.
Previous, Read Chapter 2 of Text Mining with R Next, Twitter Sentiment Tutorial