This homework assignment is intended as an opportunity to practice working with converting between tidy and non-tidy formats for text data and also for using the non-tidy formats as input for modeling functionality. As a reminder, the tidy format is very useful for EDA and data visualization, but it not the shape expected for machine learning algorithms. It is often the case where we want to predict genre, authorship, recommend titles, etc. – it these cases, the document-term matrix can be a more appropriate representation of our text data. This is because, in all of these scenarios, the observation is not the particular token, it is the full-text which we wish to classify or recommend.

This week we will work with movie descriptions, each labeled by their primary genres. We’ll start with an exploratory analysis and do some comparison of tidy and untidy organizations of our data. Next we’ll use what was covered in Chapter 6 to create a topic model based on the movie descriptions. We will see how well these models correspond to the primary genres for our movies.


  1. Open up a new R Script or RMarkdown file and load the tidyverse, tidytext, tm, and topicmodels libraries. Remember to install any packages you’ve never used before by running install.packages("PACKAGE_NAME").

  2. You’ll read in the labeled data from our GitHub Repo using the following commands. Note that mutate() is adding a new column called movie_id and select() is just reordering the columns.

    movies <- read_csv("https://raw.githubusercontent.com/agmath/FacultyUpskilling/main/2021_NLP/data/train.csv")
    ## 
    ## -- Column specification --------------------------------------------------------
    ## cols(
    ##   genre = col_character(),
    ##   overview = col_character()
    ## )
    movies <- movies %>%
      mutate(movie_id = row_number()) %>%
      select(movie_id, genre, overview)
    movies
    ## # A tibble: 4,294 x 3
    ##    movie_id genre   overview                                                    
    ##       <int> <chr>   <chr>                                                       
    ##  1        1 Drama   "While serving time for insanity at a state mental hospital~
    ##  2        2 Drama   "A re-imagined account of the early life of Maria Anna 'Nan~
    ##  3        3 Drama   "A falsely accused nobleman survives years of slavery to ta~
    ##  4        4 Action  "A detective specializing in missing children is on a madca~
    ##  5        5 Horror  "An insatiable great white shark terrorizes the townspeople~
    ##  6        6 Drama   "This is a story of a man in free fall. On the road to rede~
    ##  7        7 Action  "Engineer Johnny Munroe is enlisted to build a railroad tun~
    ##  8        8 Comedy  "Two air traffic controllers (John Cusack, Billy Bob Thornt~
    ##  9        9 History "\"Selma,\" as in Alabama, the place where segregation in t~
    ## 10       10 Comedy  "Top student Jane Ryan heads to Manhattan for a college-sch~
    ## # ... with 4,284 more rows
  3. Now that we have our data, let’s use our familiar workflow of unnesting tokens to create a tidy version of our data frame. Use unnest_tokens() to unnest the overview column.

    movies_tidy <- movies %>%
      unnest_tokens(word, overview)
    movies_tidy
    ## # A tibble: 224,827 x 3
    ##    movie_id genre word    
    ##       <int> <chr> <chr>   
    ##  1        1 Drama while   
    ##  2        1 Drama serving 
    ##  3        1 Drama time    
    ##  4        1 Drama for     
    ##  5        1 Drama insanity
    ##  6        1 Drama at      
    ##  7        1 Drama a       
    ##  8        1 Drama state   
    ##  9        1 Drama mental  
    ## 10        1 Drama hospital
    ## # ... with 224,817 more rows
  4. You can see that the resulting data frame looks quite familiar. We have columns tracking the genre and movie_id and built a column containing each individual word from each review. This is our familiar tidy text data frame. Let’s do something we’ve done before and remove stop_words by using an anti_join(). You can overwrite your movies_tidy data frame here.

    movies_tidy <- movies_tidy %>%
      anti_join(stop_words)
    ## Joining, by = "word"
    movies_tidy
    ## # A tibble: 107,629 x 3
    ##    movie_id genre word      
    ##       <int> <chr> <chr>     
    ##  1        1 Drama serving   
    ##  2        1 Drama time      
    ##  3        1 Drama insanity  
    ##  4        1 Drama mental    
    ##  5        1 Drama hospital  
    ##  6        1 Drama implacable
    ##  7        1 Drama rabble    
    ##  8        1 Drama rouser    
    ##  9        1 Drama randle    
    ## 10        1 Drama patrick   
    ## # ... with 107,619 more rows
  5. That cut over 120,000 rows of very common words from our data frame. This can be an important step because it will reduce the number of columns present in a document-term-matrix when we construct one. Now let’s build a new data frame that includes word counts for each word within each movie description. Call your new object movies_word_count, start with movies_tidy, group by movie_id, count the word occurrences, and then ungroup().

    movies_word_count <- movies_tidy %>%
      group_by(movie_id) %>%
      count(word) %>%
      ungroup()
    movies_word_count
    ## # A tibble: 100,721 x 3
    ##    movie_id word              n
    ##       <int> <chr>         <int>
    ##  1        1 authoritarian     1
    ##  2        1 fellow            1
    ##  3        1 head              1
    ##  4        1 hospital          1
    ##  5        1 implacable        1
    ##  6        1 insanity          1
    ##  7        1 inspires          1
    ##  8        1 mcmurphy          1
    ##  9        1 mental            1
    ## 10        1 mildred           1
    ## # ... with 100,711 more rows
  6. Now let’s cast our tidy data frame as a Document Term Matrix, which is a sparse matrix (meaning that it contains mostly 0’s) in which each row corresponds to a movie description and each column corresponds to a word present in our corpus of reviews. The entries of this matrix will denote the number of times the word corresponding to the column appears in the description corresponding to the row. Call your new object movies_dtm and create it using the cast_dtm() function.

    movies_dtm <- movies_word_count %>%
      cast_dtm(movie_id, word, n)
    movies_dtm
    ## <<DocumentTermMatrix (documents: 4294, terms: 20376)>>
    ## Non-/sparse entries: 100721/87393823
    ## Sparsity           : 100%
    ## Maximal term length: NA
    ## Weighting          : term frequency (tf)
  7. Notice that R won’t print out the sparse matrix for us even if we ask for it by name. Instead, we are provided some information about the object. In particular, we can see that our matrix is almost 100% sparse, there are only 100,721 non-zero entries among the over 87 Million entries in the matrix. If you would like to see the terms represented by the first ten columns in our document-term matrix, you can do so by running Terms(movies_dtm)[1:10]. Try it. Change the indices to see terms represented by other columns as well.

    Terms(movies_dtm)[1000:1020]
    ##  [1] "1950s"       "connecticut" "crisis"      "housewife"   "marital"    
    ##  [6] "mounting"    "racial"      "tensions"    "attorney"    "chelsea"    
    ## [11] "chelsea's"   "client"      "corners"     "crimes"      "cut"        
    ## [16] "deardon"     "defence"     "district"    "emerge"      "involved"   
    ## [21] "kelly"
  8. Okay, now that we’ve got our data into a format where every row represents a movie description and every column represents a feature of that movie description (ie. a word it contains), we are in a scenario where we can try building a topic model using the LDA() function from R’s topicmodels package. As a reminder, when we began, we knew the primary genre of each movie in this dataset of descriptions. There are 20 primary genres present in our movie descriptions dataset, which you can verify by running length(unique(movies$genre)). Use LDA() on our document-term-matrix representing the movie descriptions and use 20 topics – it may take about five minutes or so to run. Be sure to store the result in movies_lda.

    movies_lda <- LDA(movies_dtm, k = 20, control = list(seed = 1337))
    movies_lda
    ## A LDA_VEM topic model with 20 topics.
  9. Create an object called movie_genre_est by using the tidy() function with the arguments movies_lda and matrix = "beta". View the result – what do the columns in the output represent?

    movie_genre_est <- tidy(movies_lda, matrix = "beta")
    movie_genre_est
    ## # A tibble: 407,520 x 3
    ##    topic term               beta
    ##    <int> <chr>             <dbl>
    ##  1     1 authoritarian 1.18e-198
    ##  2     2 authoritarian 2.44e-199
    ##  3     3 authoritarian 3.52e-199
    ##  4     4 authoritarian 5.26e-199
    ##  5     5 authoritarian 1.90e-  4
    ##  6     6 authoritarian 3.78e-199
    ##  7     7 authoritarian 3.72e-199
    ##  8     8 authoritarian 4.42e-199
    ##  9     9 authoritarian 1.66e-199
    ## 10    10 authoritarian 1.51e-199
    ## # ... with 407,510 more rows
    movie_genre_est %>%
      pivot_wider(names_from = topic, values_from = beta)
    ## # A tibble: 20,376 x 21
    ##    term            `1`       `2`       `3`       `4`     `5`       `6`       `7`
    ##    <chr>         <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>     <dbl>
    ##  1 authorit~ 1.18e-198 2.44e-199 3.52e-199 5.26e-199 1.90e-4 3.78e-199 3.72e-199
    ##  2 fellow    6.62e-  4 8.65e- 28 1.87e-  4 3.11e-  4 4.13e-4 1.19e-  3 3.86e-  4
    ##  3 head      1.61e-  3 1.11e-  3 4.63e-  4 1.08e-  3 6.68e-4 2.18e- 55 3.87e-  4
    ##  4 hospital  2.24e-  4 2.31e- 68 3.93e-  4 7.88e-  4 3.76e-4 5.93e- 33 2.53e-  4
    ##  5 implacab~ 1.18e-198 2.44e-199 3.52e-199 5.26e-199 1.90e-4 3.78e-199 3.72e-199
    ##  6 insanity  2.71e-120 8.41e-158 1.11e-157 4.20e-139 1.90e-4 1.20e- 28 3.39e-139
    ##  7 inspires  7.09e-310 5.50e-  4 4.50e-261 2.38e-119 5.70e-4 7.05e- 67 5.79e- 23
    ##  8 mcmurphy  1.18e-198 2.44e-199 3.52e-199 5.26e-199 1.90e-4 3.78e-199 3.72e-199
    ##  9 mental    2.54e-  4 8.00e- 76 1.36e- 56 2.37e-259 5.15e-4 1.68e-100 3.09e-122
    ## 10 mildred   1.10e-178 2.77e-179 3.83e-179 5.43e-179 1.90e-4 1.94e-  4 4.01e-179
    ## # ... with 20,366 more rows, and 13 more variables: 8 <dbl>, 9 <dbl>, 10 <dbl>,
    ## #   11 <dbl>, 12 <dbl>, 13 <dbl>, 14 <dbl>, 15 <dbl>, 16 <dbl>, 17 <dbl>,
    ## #   18 <dbl>, 19 <dbl>, 20 <dbl>
  10. Similar to what is done in the text, let’s use slice_max() to find the 10 terms from each topic, which are most surely associated with that topic. Adapt the code from the textbook to produce a faceted set of plots showing your results. If you are working in R Markdown, you may want to set the fig.height chunk parameter to something like 20 – you set your chunk arguments within the curly braces at the beginning of an R chunk.

    movies_top_terms <- movie_genre_est %>%
      group_by(topic) %>%
      slice_max(beta, n = 10) %>%
      ungroup() %>%
      arrange(topic, -beta)
    
    movies_top_terms %>%
      mutate(term = reorder_within(term, beta, topic)) %>%
      ggplot() + 
      geom_col(mapping = aes(x = beta, y = term, fill = factor(topic)), show.legend = FALSE) +
      facet_wrap(~topic, scales = "free", ncol = 4) +
      scale_y_reordered()

  11. After seeing your plot panel, do you think LDA learned topics based on the film genres? Let’s see which topic each film was assigned to using tidy() again, but setting the matrix argument to "gamma". Store your result in an object called movie_assignments. View the result.

    movie_assignments <- tidy(movies_lda, matrix = "gamma")
    movie_assignments
    ## # A tibble: 85,880 x 3
    ##    document topic    gamma
    ##    <chr>    <int>    <dbl>
    ##  1 1            1 0.00109 
    ##  2 2            1 0.00223 
    ##  3 3            1 0.00275 
    ##  4 4            1 0.000849
    ##  5 5            1 0.00133 
    ##  6 6            1 0.000993
    ##  7 7            1 0.243   
    ##  8 8            1 0.00161 
    ##  9 9            1 0.00114 
    ## 10 10           1 0.000765
    ## # ... with 85,870 more rows
  12. Let’s try to extract the topic each movie has been assigned to by adapting our code from 10. We still want to use slice_max() here, but let’s group by document and then take only the row corresponding to the top gamma value. After slicing, don’t forget to ungroup() the data frame. The document column will be converted to a string by default, so you’ll want to use mutate() and as.numeric() to convert this column back to a numeric column. Store your result in movie_assignment.

    movie_assignment <- movie_assignments %>%
      group_by(document) %>%
      slice_max(gamma, n = 1) %>%
      ungroup() %>%
      mutate(document = as.numeric(document)) %>%
      arrange(document)
    movie_assignment
    ## # A tibble: 4,294 x 3
    ##    document topic gamma
    ##       <dbl> <int> <dbl>
    ##  1        1     5 0.979
    ##  2        2     6 0.527
    ##  3        3     2 0.478
    ##  4        4    14 0.865
    ##  5        5    19 0.912
    ##  6        6    13 0.598
    ##  7        7     7 0.735
    ##  8        8    10 0.969
    ##  9        9     6 0.978
    ## 10       10     7 0.886
    ## # ... with 4,284 more rows
  13. Did our Latent Dirichlet model create topics corresponding to the primary genres? Let’s see. We’ll start by joining the known primary genre of each movie onto our movie_assignment data frame. Then we will build and plot a confusion matrix to see our results!

    • Both the original movies and movie_assignment data frames have columns representing the movie id. In the original movies data frame, that column is called movie_id while it is called document in the movie_assignment data frame. Start with the movie_assignment data frame and left_join() the information from movies onto it. You’ll need to supply the argument by = c("document" = "movie_id") to left_join() because the key columns don’t share the same name.
    • Now that you have both the known primary genre and the assigned topic contained in the movie_assignment data frame, follow and adapt the code from the book to create the tile plot representing the confusion matrix.
    movie_assignment <- movie_assignment %>%
      left_join(movies, by = c("document" = "movie_id"))
    movie_assignment
    ## # A tibble: 4,294 x 5
    ##    document topic gamma genre   overview                                        
    ##       <dbl> <int> <dbl> <chr>   <chr>                                           
    ##  1        1     5 0.979 Drama   "While serving time for insanity at a state men~
    ##  2        2     6 0.527 Drama   "A re-imagined account of the early life of Mar~
    ##  3        3     2 0.478 Drama   "A falsely accused nobleman survives years of s~
    ##  4        4    14 0.865 Action  "A detective specializing in missing children i~
    ##  5        5    19 0.912 Horror  "An insatiable great white shark terrorizes the~
    ##  6        6    13 0.598 Drama   "This is a story of a man in free fall. On the ~
    ##  7        7     7 0.735 Action  "Engineer Johnny Munroe is enlisted to build a ~
    ##  8        8    10 0.969 Comedy  "Two air traffic controllers (John Cusack, Bill~
    ##  9        9     6 0.978 History "\"Selma,\" as in Alabama, the place where segr~
    ## 10       10     7 0.886 Comedy  "Top student Jane Ryan heads to Manhattan for a~
    ## # ... with 4,284 more rows
    movie_assignment %>%
      count(across(c(topic, genre))) %>%
      group_by(genre) %>%
      mutate(percent = n/sum(n)) %>%
      ggplot() + 
      geom_tile(mapping = aes(x = topic, y = genre, fill = percent)) + 
      scale_fill_gradient2(low = "violet", high = "darkred") +
      theme_minimal() +
      labs(x = "Assigned Topic", y = "Primary Genre", fill = "% of assignments")


Final Thoughts: It looks like our model didn’t choose its topics according to the primary genre. The foreign films and made for TV movies were each split across very few topics, but films from the other genres were shared widely across the 20 topics learned by our LDA model. There are a few reasons for this. First, as we saw in the barplots for each topic, the model learned storylines and settings rather than genres. Secondly, in the original dataset, most films were assigned several genre tags, so genres can overlap quite a bit and my choice to assume that the first genre listed was the primary genre may have been a poor one. If you want to dig a bit deeper into how our topic model grouped movies, you can run code blocks similar to the one appearing below:

movie_assignment %>%
  filter(topic == 1) %>%
  select(genre, overview)
## # A tibble: 155 x 2
##    genre  overview                                                              
##    <chr>  <chr>                                                                 
##  1 Drama  "The true story of technical troubles that scuttle the Apollo 13 luna~
##  2 Comedy "Jackass 3D is a 3-D film and the third movie of the Jackass series. ~
##  3 Drama  "Harriet and Queenie Mahoney, a vaudeville act, come to Broadway, whe~
##  4 Comedy "Disgraced ex-England captain (Danny 'Mean Machine' Meehan) is thrown~
##  5 Drama  "The global economy is on the brink of collapse. Brilliant creators, ~
##  6 Comedy "Born in America and raised in an Indian ashram, Pitka returns to his~
##  7 Drama  "A romantic comedy centered on Dexter and Emma, who first meet during~
##  8 Drama  "The true story of how businessman Oskar Schindler saved over a thous~
##  9 War    "An Hungarian youth comes of age at Buchenwald during World War II. G~
## 10 Comedy "A scheming raccoon fools a mismatched family of forest creatures int~
## # ... with 145 more rows

In any case, I hope you found this assignment fun and that it showed you a bit about why untidy formats are sometimes preferred over tidy data. Additionally, I hope you’ll consider applying topic models to different contexts!


Previous, Read Chapter 6 Next, Epilogue