This homework set is intended to provide you with practice making word frequency counts and introductory analysis of texts from Project Gutenberg, which is an online repository of freely available books.
Feel free to ask questions or for help with errors in the Slack channel or during our weekly synchronous meeting.
For these problems, we will be using the gutenbergr
library, which can be installed using the following syntax:
Here is a quick introduction to some of the functionality included in gutenbergr
. The following command provides a data frame of the texts that are available to download:
## # A tibble: 51,997 x 8
## gutenberg_id title author gutenberg_autho… language gutenberg_books… rights
## <int> <chr> <chr> <int> <chr> <chr> <chr>
## 1 0 <NA> <NA> NA en <NA> Publi…
## 2 1 "The … Jeffer… 1638 en United States L… Publi…
## 3 2 "The … United… 1 en American Revolu… Publi…
## 4 3 "John… Kenned… 1666 en <NA> Publi…
## 5 4 "Linc… Lincol… 3 en US Civil War Publi…
## 6 5 "The … United… 1 en American Revolu… Publi…
## 7 6 "Give… Henry,… 4 en American Revolu… Publi…
## 8 7 "The … <NA> NA en <NA> Publi…
## 9 8 "Abra… Lincol… 3 en US Civil War Publi…
## 10 9 "Abra… Lincol… 3 en US Civil War Publi…
## # … with 51,987 more rows, and 1 more variable: has_text <lgl>
This data frame can be wrangled using the dplyr
functions that we learned last week. For example, we can search for all of the books on Project Gutenberg which were authored by Arthur Conan Doyle.
library(tidyverse)
gutenberg_metadata %>%
filter(author == "Doyle, Arthur Conan") %>%
select(gutenberg_id, title)
## # A tibble: 123 x 2
## gutenberg_id title
## <int> <chr>
## 1 108 "The Return of Sherlock Holmes"
## 2 126 "The Poison Belt"
## 3 139 "The Lost World"
## 4 221 "The Return of Sherlock Holmes"
## 5 244 "A Study in Scarlet"
## 6 290 "The Stark Munro Letters\r\nBeing series of twelve letters writ…
## 7 294 "The Captain of the Polestar, and Other Tales"
## 8 355 "The Parasite: A Story"
## 9 356 "Beyond the City"
## 10 423 "Round the Red Lamp: Being Facts and Fancies of Medical Life"
## # … with 113 more rows
The command gutenberg_download()
can be used to download full texts from the Project Gutenberg website. The gutenbergr
package is mentioned in Text Mining with R, but the default mirror for the package is no longer operational, so using the syntax from the textbook will result in an error. We can manually set the gutenberg_download()
function to use an updated mirror. Let’s download The Adventures of Sherlock Holmes, whose gutenberg_id
is 1661
.
## # A tibble: 12,648 x 2
## gutenberg_id text
## <int> <chr>
## 1 1661 "THE ADVENTURES OF SHERLOCK HOLMES"
## 2 1661 ""
## 3 1661 "by"
## 4 1661 ""
## 5 1661 "SIR ARTHUR CONAN DOYLE"
## 6 1661 ""
## 7 1661 ""
## 8 1661 ""
## 9 1661 " I. A Scandal in Bohemia"
## 10 1661 " II. The Red-headed League"
## # … with 12,638 more rows
unnest_tokens()
function from the tidytext
library to create a data frame named tidy_sherlock
which is tokenized by word. As a reminder, you can use the arrow operator (<-
) to store the result of a series of commands into a new object, and you can print out the contents of an object just by calling its name. The result should look like this:## # A tibble: 105,426 x 2
## gutenberg_id word
## <int> <chr>
## 1 1661 the
## 2 1661 adventures
## 3 1661 of
## 4 1661 sherlock
## 5 1661 holmes
## 6 1661 by
## 7 1661 sir
## 8 1661 arthur
## 9 1661 conan
## 10 1661 doyle
## # … with 105,416 more rows
Solution:
## # A tibble: 105,426 x 2
## gutenberg_id word
## <int> <chr>
## 1 1661 the
## 2 1661 adventures
## 3 1661 of
## 4 1661 sherlock
## 5 1661 holmes
## 6 1661 by
## 7 1661 sir
## 8 1661 arthur
## 9 1661 conan
## 10 1661 doyle
## # … with 105,416 more rows
stop_words
is available when you load the tidytext
library, but you can also use get_stopwords()
to access stop word dictionaries from other sources and other languages.) What are the most commonly used words?Solution:
tidy_sherlock <- tidy_sherlock %>%
anti_join(stop_words)
tidy_sherlock %>%
count(word) %>%
arrange(desc(n)) %>%
slice(1:30) %>%
ggplot() +
geom_col(aes(n,reorder(word,n), fill = word), show.legend = FALSE) +
labs(x = "Count", y = NULL)
The most common words are “holmes”, “time”, “door”, “matter”, etc.
gutenberg_id
equal to 108
. It would be a good idea to give the text data frame and tidied data frame different names (for example, return_sherlock
and tidy_return_sherlock
) because we will be using both books in the next problem.Solution:
return_sherlock <- gutenberg_download(108, mirror = "http://mirrors.xmission.com/gutenberg/")
tidy_return_sherlock <- return_sherlock %>%
unnest_tokens(word,text) %>%
anti_join(stop_words)
tidy_return_sherlock %>%
count(word) %>%
arrange(desc(n)) %>%
slice(1:30) %>%
ggplot() +
geom_col(aes(n,reorder(word,n), fill = word), show.legend = FALSE) +
labs(x = "Count", y = NULL)
The most common words are “holmes”, “watson”, “sir”, “time”, etc. Perhaps Watson was more important in the remake?
In the upcoming problem, we will compare the relative frequencies of words in The Adventures of Sherlock Holmes and The Return of Sherlock Holmes.
bind_rows()
and mutate()
functions to concatenate the tidy_sherlock
and tidy_return_sherlock
data frames together and add a column named title
which identifies the book which the word came from. Your result should look similar to what appears below. If you want to make sure that you have words from both texts in the resulting data frame, try using the head()
and tail()
functions.Solution:
bind_rows(mutate(tidy_sherlock, title = "The Adventures of Sherlock Holmes"),
mutate(tidy_return_sherlock, title = "The Return of Sherlock Holmes"))
## # A tibble: 67,793 x 3
## gutenberg_id word title
## <int> <chr> <chr>
## 1 1661 adventures The Adventures of Sherlock Holmes
## 2 1661 sherlock The Adventures of Sherlock Holmes
## 3 1661 holmes The Adventures of Sherlock Holmes
## 4 1661 sir The Adventures of Sherlock Holmes
## 5 1661 arthur The Adventures of Sherlock Holmes
## 6 1661 conan The Adventures of Sherlock Holmes
## 7 1661 doyle The Adventures of Sherlock Holmes
## 8 1661 scandal The Adventures of Sherlock Holmes
## 9 1661 bohemia The Adventures of Sherlock Holmes
## 10 1661 ii The Adventures of Sherlock Holmes
## # … with 67,783 more rows
%>%
the previous data frame into a mutate command mutate(word = str_extract(word, "[a-z']+"))
. As mentioned in the text book, this regular expression removes brackets that are used for emphasis in the text. Then use the count()
function to count the number of times that a word appears in each book. Again, your output should match what is seen below.## # A tibble: 14,795 x 3
## title word n
## <chr> <chr> <int>
## 1 The Adventures of Sherlock Holmes a 1
## 2 The Adventures of Sherlock Holmes abandoned 3
## 3 The Adventures of Sherlock Holmes abandons 1
## 4 The Adventures of Sherlock Holmes abbots 1
## 5 The Adventures of Sherlock Holmes aberdeen 2
## 6 The Adventures of Sherlock Holmes abhorrent 1
## 7 The Adventures of Sherlock Holmes abiding 1
## 8 The Adventures of Sherlock Holmes abjure 1
## 9 The Adventures of Sherlock Holmes abnormal 1
## 10 The Adventures of Sherlock Holmes abnormally 1
## # … with 14,785 more rows
Solution:
bind_rows(mutate(tidy_sherlock, title = "The Adventures of Sherlock Holmes"),
mutate(tidy_return_sherlock, title = "The Return of Sherlock Holmes")) %>%
mutate(word = str_extract(word, "[a-z']+")) %>%
count(title,word)
## # A tibble: 14,795 x 3
## title word n
## <chr> <chr> <int>
## 1 The Adventures of Sherlock Holmes a 1
## 2 The Adventures of Sherlock Holmes abandoned 3
## 3 The Adventures of Sherlock Holmes abandons 1
## 4 The Adventures of Sherlock Holmes abbots 1
## 5 The Adventures of Sherlock Holmes aberdeen 2
## 6 The Adventures of Sherlock Holmes abhorrent 1
## 7 The Adventures of Sherlock Holmes abiding 1
## 8 The Adventures of Sherlock Holmes abjure 1
## 9 The Adventures of Sherlock Holmes abnormal 1
## 10 The Adventures of Sherlock Holmes abnormally 1
## # … with 14,785 more rows
proportion
which divides each word frequency by the total number of words in each book, and use the select()
function to drop the n
column. Below is an example of what your result should look like.## # A tibble: 14,795 x 3
## title word proportion
## <chr> <chr> <dbl>
## 1 The Adventures of Sherlock Holmes a 0.0000312
## 2 The Adventures of Sherlock Holmes abandoned 0.0000935
## 3 The Adventures of Sherlock Holmes abandons 0.0000312
## 4 The Adventures of Sherlock Holmes abbots 0.0000312
## 5 The Adventures of Sherlock Holmes aberdeen 0.0000623
## 6 The Adventures of Sherlock Holmes abhorrent 0.0000312
## 7 The Adventures of Sherlock Holmes abiding 0.0000312
## 8 The Adventures of Sherlock Holmes abjure 0.0000312
## 9 The Adventures of Sherlock Holmes abnormal 0.0000312
## 10 The Adventures of Sherlock Holmes abnormally 0.0000312
## # … with 14,785 more rows
Solution:
bind_rows(mutate(tidy_sherlock, title = "The Adventures of Sherlock Holmes"),
mutate(tidy_return_sherlock, title = "The Return of Sherlock Holmes")) %>%
mutate(word = str_extract(word, "[a-z']+")) %>%
count(title,word) %>%
group_by(title) %>%
mutate(proportion = n / sum(n)) %>%
select(-n) %>%
ungroup()
## # A tibble: 14,795 x 3
## title word proportion
## <chr> <chr> <dbl>
## 1 The Adventures of Sherlock Holmes a 0.0000312
## 2 The Adventures of Sherlock Holmes abandoned 0.0000935
## 3 The Adventures of Sherlock Holmes abandons 0.0000312
## 4 The Adventures of Sherlock Holmes abbots 0.0000312
## 5 The Adventures of Sherlock Holmes aberdeen 0.0000623
## 6 The Adventures of Sherlock Holmes abhorrent 0.0000312
## 7 The Adventures of Sherlock Holmes abiding 0.0000312
## 8 The Adventures of Sherlock Holmes abjure 0.0000312
## 9 The Adventures of Sherlock Holmes abnormal 0.0000312
## 10 The Adventures of Sherlock Holmes abnormally 0.0000312
## # … with 14,785 more rows
pivot_wider()
function, which you saw used in the chapter if you are working from the electronic text. If you are using a physical copy of the text, you saw the functions gather()
and spread()
used to pivot data frames between wide and long format, pivot_wider()
is a more up-to-date version of the spread()
function used in the textbook. The arguments of pivot_wider()
are names_from
, which should be set equal to the column which contains the variables that you would like to become new columns. The column containing values which will fill the new columns is identified by values_from
. Pipe %>%
the data frame from the previous problem into this function:The resulting data frame should look like this:
## # A tibble: 10,644 x 3
## word `The Adventures of Sherlock Holmes` `The Return of Sherlock Holme…
## <chr> <dbl> <dbl>
## 1 a 0.0000312 0.0000840
## 2 abandoned 0.0000935 0.0000280
## 3 abandons 0.0000312 NA
## 4 abbots 0.0000312 NA
## 5 aberdeen 0.0000623 NA
## 6 abhorrent 0.0000312 0.0000560
## 7 abiding 0.0000312 NA
## 8 abjure 0.0000312 NA
## 9 abnormal 0.0000312 NA
## 10 abnormally 0.0000312 0.0000280
## # … with 10,634 more rows
Solution:
bind_rows(mutate(tidy_sherlock, title = "The Adventures of Sherlock Holmes"),
mutate(tidy_return_sherlock, title = "The Return of Sherlock Holmes")) %>%
mutate(word = str_extract(word, "[a-z']+")) %>%
count(title,word) %>%
group_by(title) %>%
mutate(proportion = n / sum(n)) %>%
select(-n) %>%
ungroup() %>%
pivot_wider(names_from = title, values_from = proportion)
## # A tibble: 10,644 x 3
## word `The Adventures of Sherlock Holmes` `The Return of Sherlock Holme…
## <chr> <dbl> <dbl>
## 1 a 0.0000312 0.0000840
## 2 abandoned 0.0000935 0.0000280
## 3 abandons 0.0000312 NA
## 4 abbots 0.0000312 NA
## 5 aberdeen 0.0000623 NA
## 6 abhorrent 0.0000312 0.0000560
## 7 abiding 0.0000312 NA
## 8 abjure 0.0000312 NA
## 9 abnormal 0.0000312 NA
## 10 abnormally 0.0000312 0.0000280
## # … with 10,634 more rows
geom_text()
layer will label the scatter plot points with their associated words. First, load the scales
library, and then pipe %>%
the data frame from the previous problem into the ggplot()
command below. The axis scales are logarithmic in order to have the scatter plot be less crowded. What words were more often used in The Adventures of Sherlock Holmes? What about in The Return of Sherlock Holmes? ggplot(aes(`The Adventures of Sherlock Holmes`,`The Return of Sherlock Holmes`)) +
#We are using geom_jitter() rather than geom_point() so that the points are not plotted on top of one another
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
#geom_abline() adds the diagonal line. Words close to the diagonal are used equally frequently in the books
geom_abline(lty = 2) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0,0.001), low = "darkslategray4", high = "gray75") +
ggtitle("Word Frequencies Comparison")
Solution:
library(scales)
bind_rows(mutate(tidy_sherlock, title = "The Adventures of Sherlock Holmes"),
mutate(tidy_return_sherlock, title = "The Return of Sherlock Holmes")) %>%
mutate(word = str_extract(word, "[a-z']+")) %>%
count(title,word) %>%
group_by(title) %>%
mutate(proportion = n / sum(n)) %>%
select(-n) %>%
ungroup() %>%
pivot_wider(names_from = title, values_from = proportion) %>%
ggplot(aes(`The Adventures of Sherlock Holmes`,`The Return of Sherlock Holmes`)) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_abline(lty = 2) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0,0.001), low = "darkslategray4", high = "gray75") +
ggtitle("Word Frequencies Comparison")
The Adventures of Sherlock Holmes uses the words “assizes”, “lip”, “angel”, “advertisement”, “st”, etc. more often. The Return of Sherlock Holmes uses the words “peter”, “hopkins”, “document”, “alert”, “appeal”, etc. more often.