This homework assignment is intended as an opportunity to practice working with converting between tidy and non-tidy formats for text data and also for using the non-tidy formats as input for modeling functionality. As a reminder, the tidy format is very useful for EDA and data visualization, but it not the shape expected for machine learning algorithms. It is often the case where we want to predict genre, authorship, recommend titles, etc. – it these cases, the document-term matrix can be a more appropriate representation of our text data. This is because, in all of these scenarios, the observation is not the particular token, it is the full-text which we wish to classify or recommend.
This week we will work with movie descriptions, each labeled by their primary genres. We’ll start with an exploratory analysis and do some comparison of tidy and untidy organizations of our data. Next we’ll use what was covered in Chapter 6 to create a topic model based on the movie descriptions. We will see how well these models correspond to the primary genres for our movies.
Open up a new R Script or RMarkdown file and load the tidyverse
, tidytext
, tm
, and topicmodels
libraries. Remember to install any packages you’ve never used before by running install.packages("PACKAGE_NAME")
.
You’ll read in the labeled data from our GitHub Repo using the following commands. Note that mutate()
is adding a new column called movie_id
and select()
is just reordering the columns.
<- read_csv("https://raw.githubusercontent.com/agmath/FacultyUpskilling/main/2021_NLP/data/train.csv") movies
##
## -- Column specification --------------------------------------------------------
## cols(
## genre = col_character(),
## overview = col_character()
## )
<- movies %>%
movies mutate(movie_id = row_number()) %>%
select(movie_id, genre, overview)
movies
## # A tibble: 4,294 x 3
## movie_id genre overview
## <int> <chr> <chr>
## 1 1 Drama "While serving time for insanity at a state mental hospital~
## 2 2 Drama "A re-imagined account of the early life of Maria Anna 'Nan~
## 3 3 Drama "A falsely accused nobleman survives years of slavery to ta~
## 4 4 Action "A detective specializing in missing children is on a madca~
## 5 5 Horror "An insatiable great white shark terrorizes the townspeople~
## 6 6 Drama "This is a story of a man in free fall. On the road to rede~
## 7 7 Action "Engineer Johnny Munroe is enlisted to build a railroad tun~
## 8 8 Comedy "Two air traffic controllers (John Cusack, Billy Bob Thornt~
## 9 9 History "\"Selma,\" as in Alabama, the place where segregation in t~
## 10 10 Comedy "Top student Jane Ryan heads to Manhattan for a college-sch~
## # ... with 4,284 more rows
Now that we have our data, let’s use our familiar workflow of unnesting tokens to create a tidy version of our data frame. Use unnest_tokens()
to unnest the overview
column.
<- movies %>%
movies_tidy unnest_tokens(word, overview)
movies_tidy
## # A tibble: 224,827 x 3
## movie_id genre word
## <int> <chr> <chr>
## 1 1 Drama while
## 2 1 Drama serving
## 3 1 Drama time
## 4 1 Drama for
## 5 1 Drama insanity
## 6 1 Drama at
## 7 1 Drama a
## 8 1 Drama state
## 9 1 Drama mental
## 10 1 Drama hospital
## # ... with 224,817 more rows
You can see that the resulting data frame looks quite familiar. We have columns tracking the genre
and movie_id
and built a column containing each individual word from each review. This is our familiar tidy text data frame. Let’s do something we’ve done before and remove stop_words
by using an anti_join()
. You can overwrite your movies_tidy
data frame here.
<- movies_tidy %>%
movies_tidy anti_join(stop_words)
## Joining, by = "word"
movies_tidy
## # A tibble: 107,629 x 3
## movie_id genre word
## <int> <chr> <chr>
## 1 1 Drama serving
## 2 1 Drama time
## 3 1 Drama insanity
## 4 1 Drama mental
## 5 1 Drama hospital
## 6 1 Drama implacable
## 7 1 Drama rabble
## 8 1 Drama rouser
## 9 1 Drama randle
## 10 1 Drama patrick
## # ... with 107,619 more rows
That cut over 120,000 rows of very common words from our data frame. This can be an important step because it will reduce the number of columns present in a document-term-matrix when we construct one. Now let’s build a new data frame that includes word counts for each word within each movie description. Call your new object movies_word_count
, start with movies_tidy
, group by movie_id
, count the word
occurrences, and then ungroup()
.
<- movies_tidy %>%
movies_word_count group_by(movie_id) %>%
count(word) %>%
ungroup()
movies_word_count
## # A tibble: 100,721 x 3
## movie_id word n
## <int> <chr> <int>
## 1 1 authoritarian 1
## 2 1 fellow 1
## 3 1 head 1
## 4 1 hospital 1
## 5 1 implacable 1
## 6 1 insanity 1
## 7 1 inspires 1
## 8 1 mcmurphy 1
## 9 1 mental 1
## 10 1 mildred 1
## # ... with 100,711 more rows
Now let’s cast
our tidy data frame as a Document Term Matrix, which is a sparse matrix (meaning that it contains mostly 0’s) in which each row corresponds to a movie description and each column corresponds to a word present in our corpus of reviews. The entries of this matrix will denote the number of times the word corresponding to the column appears in the description corresponding to the row. Call your new object movies_dtm
and create it using the cast_dtm()
function.
<- movies_word_count %>%
movies_dtm cast_dtm(movie_id, word, n)
movies_dtm
## <<DocumentTermMatrix (documents: 4294, terms: 20376)>>
## Non-/sparse entries: 100721/87393823
## Sparsity : 100%
## Maximal term length: NA
## Weighting : term frequency (tf)
Notice that R won’t print out the sparse matrix for us even if we ask for it by name. Instead, we are provided some information about the object. In particular, we can see that our matrix is almost 100% sparse, there are only 100,721 non-zero entries among the over 87 Million entries in the matrix. If you would like to see the terms represented by the first ten columns in our document-term matrix, you can do so by running Terms(movies_dtm)[1:10]
. Try it. Change the indices to see terms represented by other columns as well.
Terms(movies_dtm)[1000:1020]
## [1] "1950s" "connecticut" "crisis" "housewife" "marital"
## [6] "mounting" "racial" "tensions" "attorney" "chelsea"
## [11] "chelsea's" "client" "corners" "crimes" "cut"
## [16] "deardon" "defence" "district" "emerge" "involved"
## [21] "kelly"
Okay, now that we’ve got our data into a format where every row represents a movie description and every column represents a feature of that movie description (ie. a word it contains), we are in a scenario where we can try building a topic model using the LDA()
function from R’s topicmodels
package. As a reminder, when we began, we knew the primary genre of each movie in this dataset of descriptions. There are 20 primary genres present in our movie descriptions dataset, which you can verify by running length(unique(movies$genre))
. Use LDA()
on our document-term-matrix representing the movie descriptions and use 20 topics – it may take about five minutes or so to run. Be sure to store the result in movies_lda
.
<- LDA(movies_dtm, k = 20, control = list(seed = 1337))
movies_lda movies_lda
## A LDA_VEM topic model with 20 topics.
Create an object called movie_genre_est
by using the tidy()
function with the arguments movies_lda
and matrix = "beta"
. View the result – what do the columns in the output represent?
<- tidy(movies_lda, matrix = "beta")
movie_genre_est movie_genre_est
## # A tibble: 407,520 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 authoritarian 1.18e-198
## 2 2 authoritarian 2.44e-199
## 3 3 authoritarian 3.52e-199
## 4 4 authoritarian 5.26e-199
## 5 5 authoritarian 1.90e- 4
## 6 6 authoritarian 3.78e-199
## 7 7 authoritarian 3.72e-199
## 8 8 authoritarian 4.42e-199
## 9 9 authoritarian 1.66e-199
## 10 10 authoritarian 1.51e-199
## # ... with 407,510 more rows
%>%
movie_genre_est pivot_wider(names_from = topic, values_from = beta)
## # A tibble: 20,376 x 21
## term `1` `2` `3` `4` `5` `6` `7`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 authorit~ 1.18e-198 2.44e-199 3.52e-199 5.26e-199 1.90e-4 3.78e-199 3.72e-199
## 2 fellow 6.62e- 4 8.65e- 28 1.87e- 4 3.11e- 4 4.13e-4 1.19e- 3 3.86e- 4
## 3 head 1.61e- 3 1.11e- 3 4.63e- 4 1.08e- 3 6.68e-4 2.18e- 55 3.87e- 4
## 4 hospital 2.24e- 4 2.31e- 68 3.93e- 4 7.88e- 4 3.76e-4 5.93e- 33 2.53e- 4
## 5 implacab~ 1.18e-198 2.44e-199 3.52e-199 5.26e-199 1.90e-4 3.78e-199 3.72e-199
## 6 insanity 2.71e-120 8.41e-158 1.11e-157 4.20e-139 1.90e-4 1.20e- 28 3.39e-139
## 7 inspires 7.09e-310 5.50e- 4 4.50e-261 2.38e-119 5.70e-4 7.05e- 67 5.79e- 23
## 8 mcmurphy 1.18e-198 2.44e-199 3.52e-199 5.26e-199 1.90e-4 3.78e-199 3.72e-199
## 9 mental 2.54e- 4 8.00e- 76 1.36e- 56 2.37e-259 5.15e-4 1.68e-100 3.09e-122
## 10 mildred 1.10e-178 2.77e-179 3.83e-179 5.43e-179 1.90e-4 1.94e- 4 4.01e-179
## # ... with 20,366 more rows, and 13 more variables: 8 <dbl>, 9 <dbl>, 10 <dbl>,
## # 11 <dbl>, 12 <dbl>, 13 <dbl>, 14 <dbl>, 15 <dbl>, 16 <dbl>, 17 <dbl>,
## # 18 <dbl>, 19 <dbl>, 20 <dbl>
Similar to what is done in the text, let’s use slice_max()
to find the 10 terms from each topic, which are most surely associated with that topic. Adapt the code from the textbook to produce a faceted set of plots showing your results. If you are working in R Markdown, you may want to set the fig.height
chunk parameter to something like 20 – you set your chunk arguments within the curly braces at the beginning of an R chunk.
<- movie_genre_est %>%
movies_top_terms group_by(topic) %>%
slice_max(beta, n = 10) %>%
ungroup() %>%
arrange(topic, -beta)
%>%
movies_top_terms mutate(term = reorder_within(term, beta, topic)) %>%
ggplot() +
geom_col(mapping = aes(x = beta, y = term, fill = factor(topic)), show.legend = FALSE) +
facet_wrap(~topic, scales = "free", ncol = 4) +
scale_y_reordered()
After seeing your plot panel, do you think LDA learned topics based on the film genres? Let’s see which topic each film was assigned to using tidy()
again, but setting the matrix
argument to "gamma"
. Store your result in an object called movie_assignments
. View the result.
<- tidy(movies_lda, matrix = "gamma")
movie_assignments movie_assignments
## # A tibble: 85,880 x 3
## document topic gamma
## <chr> <int> <dbl>
## 1 1 1 0.00109
## 2 2 1 0.00223
## 3 3 1 0.00275
## 4 4 1 0.000849
## 5 5 1 0.00133
## 6 6 1 0.000993
## 7 7 1 0.243
## 8 8 1 0.00161
## 9 9 1 0.00114
## 10 10 1 0.000765
## # ... with 85,870 more rows
Let’s try to extract the topic each movie has been assigned to by adapting our code from 10. We still want to use slice_max()
here, but let’s group by document
and then take only the row corresponding to the top gamma
value. After slicing, don’t forget to ungroup()
the data frame. The document
column will be converted to a string by default, so you’ll want to use mutate()
and as.numeric()
to convert this column back to a numeric column. Store your result in movie_assignment
.
<- movie_assignments %>%
movie_assignment group_by(document) %>%
slice_max(gamma, n = 1) %>%
ungroup() %>%
mutate(document = as.numeric(document)) %>%
arrange(document)
movie_assignment
## # A tibble: 4,294 x 3
## document topic gamma
## <dbl> <int> <dbl>
## 1 1 5 0.979
## 2 2 6 0.527
## 3 3 2 0.478
## 4 4 14 0.865
## 5 5 19 0.912
## 6 6 13 0.598
## 7 7 7 0.735
## 8 8 10 0.969
## 9 9 6 0.978
## 10 10 7 0.886
## # ... with 4,284 more rows
Did our Latent Dirichlet model create topics corresponding to the primary genres? Let’s see. We’ll start by joining the known primary genre of each movie onto our movie_assignment
data frame. Then we will build and plot a confusion matrix to see our results!
movies
and movie_assignment
data frames have columns representing the movie id. In the original movies
data frame, that column is called movie_id
while it is called document
in the movie_assignment
data frame. Start with the movie_assignment
data frame and left_join()
the information from movies
onto it. You’ll need to supply the argument by = c("document" = "movie_id")
to left_join()
because the key columns don’t share the same name.genre
and the assigned topic
contained in the movie_assignment
data frame, follow and adapt the code from the book to create the tile plot representing the confusion matrix.<- movie_assignment %>%
movie_assignment left_join(movies, by = c("document" = "movie_id"))
movie_assignment
## # A tibble: 4,294 x 5
## document topic gamma genre overview
## <dbl> <int> <dbl> <chr> <chr>
## 1 1 5 0.979 Drama "While serving time for insanity at a state men~
## 2 2 6 0.527 Drama "A re-imagined account of the early life of Mar~
## 3 3 2 0.478 Drama "A falsely accused nobleman survives years of s~
## 4 4 14 0.865 Action "A detective specializing in missing children i~
## 5 5 19 0.912 Horror "An insatiable great white shark terrorizes the~
## 6 6 13 0.598 Drama "This is a story of a man in free fall. On the ~
## 7 7 7 0.735 Action "Engineer Johnny Munroe is enlisted to build a ~
## 8 8 10 0.969 Comedy "Two air traffic controllers (John Cusack, Bill~
## 9 9 6 0.978 History "\"Selma,\" as in Alabama, the place where segr~
## 10 10 7 0.886 Comedy "Top student Jane Ryan heads to Manhattan for a~
## # ... with 4,284 more rows
%>%
movie_assignment count(across(c(topic, genre))) %>%
group_by(genre) %>%
mutate(percent = n/sum(n)) %>%
ggplot() +
geom_tile(mapping = aes(x = topic, y = genre, fill = percent)) +
scale_fill_gradient2(low = "violet", high = "darkred") +
theme_minimal() +
labs(x = "Assigned Topic", y = "Primary Genre", fill = "% of assignments")
Final Thoughts: It looks like our model didn’t choose its topics according to the primary genre. The foreign films and made for TV movies were each split across very few topics, but films from the other genres were shared widely across the 20 topics learned by our LDA model. There are a few reasons for this. First, as we saw in the barplots for each topic, the model learned storylines and settings rather than genres. Secondly, in the original dataset, most films were assigned several genre tags, so genres can overlap quite a bit and my choice to assume that the first genre listed was the primary genre may have been a poor one. If you want to dig a bit deeper into how our topic model grouped movies, you can run code blocks similar to the one appearing below:
%>%
movie_assignment filter(topic == 1) %>%
select(genre, overview)
## # A tibble: 155 x 2
## genre overview
## <chr> <chr>
## 1 Drama "The true story of technical troubles that scuttle the Apollo 13 luna~
## 2 Comedy "Jackass 3D is a 3-D film and the third movie of the Jackass series. ~
## 3 Drama "Harriet and Queenie Mahoney, a vaudeville act, come to Broadway, whe~
## 4 Comedy "Disgraced ex-England captain (Danny 'Mean Machine' Meehan) is thrown~
## 5 Drama "The global economy is on the brink of collapse. Brilliant creators, ~
## 6 Comedy "Born in America and raised in an Indian ashram, Pitka returns to his~
## 7 Drama "A romantic comedy centered on Dexter and Emma, who first meet during~
## 8 Drama "The true story of how businessman Oskar Schindler saved over a thous~
## 9 War "An Hungarian youth comes of age at Buchenwald during World War II. G~
## 10 Comedy "A scheming raccoon fools a mismatched family of forest creatures int~
## # ... with 145 more rows
In any case, I hope you found this assignment fun and that it showed you a bit about why untidy formats are sometimes preferred over tidy data. Additionally, I hope you’ll consider applying topic models to different contexts!