This homework assignment is intended as an opportunity to practice locating terms which are important in identifying individual texts from a corpus (via tf-idf), as introduced in Chapter 3 of Text Mining with R, A Tidy Approach. We’ll look at articles from a relatively famous dataset often referred to as the 20 Newsgroups dataset and was first assembled by Ken Lang. Some of the code to create this week’s visuals gets lengthy – post in Slack if you have questions about any of it.
In order to complete this assignment you’ll need to load the following libraries into an R Markdown document or an R Script: tidyverse
, tidytext
, tm
, and wordcloud
.
After opening your Markdown or script file and loading the required libraries, let’s load the newsgroups data. You can do this with the following code:
<- read_csv("http://ssc.wisc.edu/~ahanna/20_newsgroups.csv") newsgroups
## Warning: Missing column names filled in: 'X1' [1]
##
## -- Column specification --------------------------------------------------------
## cols(
## X1 = col_double(),
## target = col_double(),
## text = col_character()
## )
glimpse(newsgroups)
## Rows: 11,314
## Columns: 3
## $ X1 <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1~
## $ target <dbl> 9, 4, 11, 4, 0, 4, 5, 5, 13, 12, 17, 11, 10, 6, 5, 10, 7, 2, 11~
## $ text <chr> "From: cubbie@garnet.berkeley.edu ( ~
Now that you have the dataset, conduct some initial exploration. What features (variables/columns) does the dataset contain? Can you determine what the contents of each column actually corresponds to? You may notice that we don’t have all 20,000 of the messages from the original newsgroup dataset – we have just over 11,000 of them.
You may want to apply some of the techniques we learned in previous chapters to better understand each of the message boards. For example, we could build a wordcloud of the most common words appearing on message board 0 as follows:
%>%
newsgroups filter(target == 0) %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
with(wordcloud(word, n, max.words = 150, scale = c(4, 0.05)))
## Joining, by = "word"
Looks like we found the atheism message board! Take some time to explore the others. You could generate word-clouds for each board very efficiently by making use of a for
loop if you’d like, or you can generate them “by hand” if you change the filter()
manually.
Now that you’ve got a better idea of the contents of some of these message boards, let’s see if we have good intuition about the most important words in identifying which message board a message was posted to. Before writing and running the code, make some guesses.
newsgroups
dataset and tokeninze it with unnest_tokens()
X1
column.bind_tf_idf()
on the word
, target
, and n
columns to compute and attach the tf_idf
-related variables.tf_idf
score, and use slice_max()
to keep the top 10 or so tokens (according to tf_idf
score, not count).ungroup()
the data frame and use mutate()
to transform the target
variable to a categorical variable (use as.factor()
) and use reorder_within(word, tf_idf, target)
to reorder the word
variable by tf_idf
score within each target
grouping.ggplot()
and faceting as you’ve done in previous weeks to produce a barplot of the tokens with the largest tf_idf
score within each message board.%>%
newsgroups unnest_tokens(word, text) %>%
select(-X1) %>%
group_by(target) %>%
count(word) %>%
bind_tf_idf(word, target, n) %>%
arrange(desc(tf_idf)) %>%
slice_max(tf_idf, n = 10) %>%
ungroup() %>%
mutate(target = as.factor(target), word = reorder_within(word, tf_idf, target)) %>%
ggplot() +
geom_col(aes(x = word, y = tf_idf, fill = target), show.legend = FALSE) +
scale_x_reordered() +
labs(x = "", y = "", title = "Highest `tf-idf`", subtitle = "Newsgroups Message Boards") +
coord_flip() +
facet_wrap(~target, ncol = 2, scales = "free")
The tf_idf
score depends on both the frequency of a token within the document (or document-group as we’ve forced here) in question but also whether or not that token is present within the other documents in the corpus. This means that the tf_idf
scores and rankings will change if we consider a different corpus. For example, it looks like message boards 0, 15, and 19 all deal with some aspect of religion. Reproduce your plot analysis from above, but this time filter to include only these three message boards (hint: you can use either the %in%
operator or a series of or |
statements to help you). What happens to your results?
%>%
newsgroups filter(target %in% c(0, 15, 19)) %>%
unnest_tokens(word, text) %>%
select(-X1) %>%
group_by(target) %>%
count(word) %>%
bind_tf_idf(word, target, n) %>%
arrange(desc(tf_idf)) %>%
slice_max(tf_idf, n = 10) %>%
mutate(word = reorder(word, tf_idf)) %>%
ungroup() %>%
ggplot() +
geom_col(aes(x = word, y = tf_idf, fill = as.factor(target)), show.legend = FALSE) +
scale_x_reordered() +
labs(x = "", y = "", title = "Highest `tf-idf`", subtitle = "Newsgroups Message Boards") +
coord_flip() +
facet_wrap(~target, ncol = 2, scales = "free")
I hope you found that interesting. This is a great sneak peak into the challenges that come with applying machine learning algorithms to different datasets. In some scenarios model building is more challenging than others. Much of what we’ve been doing so far with our pursuit of NLP is engineering features from text (word counts, sentiment scores, ratios, and now tf_idf
score). The hope is that these features give us insights into things we might care about – the particular text, the author and their style – in particular, we hope that these insights help us differentiate texts or authors from one another. When we explored tf_idf
scores across all of the message boards in the corpus, it looked like the results were promising. At least the tokens for the message boards on automotives and motorcycles were quite different from those tokens most often associated with the message boards on sports. We gave ourselves a more difficult dataset once we increased the homogeneity of the topics discussed, by filtering down to only those message boards on topics adjacent to religion. It looks like the identities of the users who posted in each earned the highest tf_idf
scores because the actual conversation content was so similar across the three boards. Leveraging the users to differentiate the boards provides much less insight than actually considering the content of the posts. It is worth noting, however, that we still would have had trouble taking a post with a religious slant and assigning it to the correct message board with the tf_idf
information from the larger corpus though.
Okay, one final thing before we go. This chapter also discussed an interesting phenomenon, called Zipf’s Law. The law states that the relative frequency of a word within a document is typically proportional to the rank of that word (by frequency). Let’s explore Zipf’s Law and several works of Shakespeare.
<- gutenberg_download(seq(2242, 2258, by = 1), mirror = "http://mirrors.xmission.com/gutenberg/")
some_shakespeare_plays
<- some_shakespeare_plays %>%
some_shakespeare_plays left_join((gutenberg_metadata %>% select(gutenberg_id, title)))
## Joining, by = "gutenberg_id"
some_shakespeare_plays
## # A tibble: 65,911 x 3
## gutenberg_id text title
## <int> <chr> <chr>
## 1 2242 "Executive Director's Notes:" A Midsummer Night~
## 2 2242 "" A Midsummer Night~
## 3 2242 "In addition to the notes below, and so you ~ A Midsummer Night~
## 4 2242 "the spelling errors introduced by the print~ A Midsummer Night~
## 5 2242 "been corrected, here are the first few line~ A Midsummer Night~
## 6 2242 "are presented herein:" A Midsummer Night~
## 7 2242 "" A Midsummer Night~
## 8 2242 " Barnardo. Who's there?" A Midsummer Night~
## 9 2242 " Fran. Nay answer me: Stand & vnfold" A Midsummer Night~
## 10 2242 "your selfe" A Midsummer Night~
## # ... with 65,901 more rows
play_word_counts
by using unnest_tokens()
and then counting by title
and word
. Remember to set sort = TRUE
when you count (or use arrange()
after you’ve counted).play_words_total
by starting with play_word_counts
, then grouping by title
, and using summarize(total = sum(n))
as was done in the textbook.left_join()
play_words_total
onto play_word_counts
. You can overwrite the play_word_counts
object with this result.freq_by_rank
by starting with play_word_counts
, group by title
, then mutate two new columns rank
equal to row_number()
and then term_frequency
equal to n/total
, and then ungroup()
.freq_by_rank
into ggplot()
and build a geom_line()
layer with the aesthetics x = rank
, y = term_frequency
, and color = title
. You can also edit the size
and alpha
parameters as done in the book if you like. You’ll want to set show.legend = FALSE
as well. Finally, add the scale_x_log10()
and scale_y_log10()
layers.<- some_shakespeare_plays %>%
play_word_counts unnest_tokens(word, text) %>%
count(title, word, sort = TRUE)
<- play_word_counts %>%
play_words_total group_by(title) %>%
summarize(total = sum(n))
<- play_word_counts %>%
play_word_counts left_join(play_words_total)
## Joining, by = "title"
<- play_word_counts %>%
freq_by_rank group_by(title) %>%
mutate(rank = row_number(), term_frequency = n/total) %>%
ungroup()
%>%
freq_by_rank ggplot() +
geom_line(aes(x = rank, y = term_frequency, color = title), show.legend = FALSE) +
scale_x_log10() +
scale_y_log10() +
labs(x = "Token Rank", y = "Token Rel. Frequency", title = "Zipf's Law", subtitle = "Several Shakespeare Plays")
I don’t see any clear evidence in favor of the authorship question – at least through the lens of Zipf’s Law. Maybe we can get David to discuss the origin of this question, some of the evidence, and whether there is a consensus within the community of Shakespearean Scholars!