Introduction to Word Embeddings (tf-idf)

Purpose: In this notebook we’ll consider a powerful technique falling under the umbrella of word embeddings. In particular, we’ll look at a measure called term frequency, inverse document frequency as a way to measure how important a term is to one category of documents versus other categories.

Data Used: We’ll use excerpts from Spooky Authors (Edgar-Allen Poe, Mary Shelley, and HP Lovecraft), which appeared as part of a Kaggle Competition in 2018. For convenience, I’ve added the training data to my GitHub repository here.

The Big Idea

Are tokens more frequently utilized in one type of document versus another? Here, we’ll look at a measure of how frequent a token is to a particular class is and how unique it is to that class, relative the other classes. The term frequency, inverse document frequency of a token is defined as follows: \(\displaystyle{\text{tf\_idf}\left(\text{token}\right) = f_{\text{Token}}\cdot\ln\left(\frac{n_{\text{Documents}}}{n_{\text{Documents Containing Token}}}\right)}\), where \(f_{\text{Token}}\) denotes the frequency of the \(\tt{Token}\) across the entire corpus.

In the formula for \(\text{tf\_idf}\left(\text{Token}\right)\), if \(\tt{Token}\) appears across all of the documents, then the ration \(\frac{n_{\text{Documents}}}{n_{\text{Documents Containing Token}}}\) will evaluate to \(1\) and the logarithm will be \(0\). That is, a token appearing across all documents cannot be useful in differentiating the document. That logarithm will be largest when the token in question appears in just one of the documents and none of the others. This means that tokens with the highest term frequency, inverse document frequency will be those unique to their document and having a high frequency of usage within that document.

Let’s try this out with our spooky author excerpts. You can see the first few rows of the spooky author data below.

id text author
id26305 This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall. EAP
id17569 It never once occurred to me that the fumbling might be a mere mistake. HPL
id11008 In his left hand was a gold snuff box, from which, as he capered down the hill, cutting all manner of fantastic steps, he took snuff incessantly with an air of the greatest possible self satisfaction. EAP
id27763 How lovely is spring As we looked from Windsor Terrace on the sixteen fertile counties spread beneath, speckled by happy cottages and wealthier towns, all looked as in former years, heart cheering and fair. MWS
id12958 Finding nothing else, not even gold, the Superintendent abandoned his attempts; but a perplexed look occasionally steals over his countenance as he sits thinking at his desk. HPL
id22965 A youth passed in solitude, my best years spent under your gentle and feminine fosterage, has so refined the groundwork of my character that I cannot overcome an intense distaste to the usual brutality exercised on board ship: I have never believed it to be necessary, and when I heard of a mariner equally noted for his kindliness of heart and the respect and obedience paid to him by his crew, I felt myself peculiarly fortunate in being able to secure his services. MWS

One thing we could do here is to group by author and then compute term frequency, inverse document frequency for all tokens across the three resulting “documents”.

spooky_tf_idf <- spooky_authors %>%
  unnest_tokens(word, text) %>%
  group_by(author) %>%
  count(word) %>%
  bind_tf_idf(word, author, n) %>%
  arrange(-tf_idf) 

spooky_tf_idf %>%
  head(10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
author word n tf idf tf_idf
MWS perdita 156 0.0009415 1.0986123 0.0010343
MWS adrian 126 0.0007604 1.0986123 0.0008354
MWS idris 109 0.0006578 1.0986123 0.0007227
MWS raymond 248 0.0014967 0.4054651 0.0006069
MWS windsor 73 0.0004406 1.0986123 0.0004840
HPL gilman 64 0.0004096 1.0986123 0.0004500
HPL innsmouth 59 0.0003776 1.0986123 0.0004148
HPL arkham 58 0.0003712 1.0986123 0.0004078
EAP dupin 57 0.0002838 1.0986123 0.0003118
MWS felix 44 0.0002655 1.0986123 0.0002917

It is likely that many of these top words are characters or locations specific to stories written by the authors. However, if we look further down in the list arranged by tf-idf we can see more commonplace words that are specific to each author’s lexicon.

spooky_tf_idf %>%
  slice(200:210) %>%
  kable() %>%
  kable_styling(c("hover", "striped"))
author word n tf idf tf_idf
EAP paces 8 0.0000398 1.0986123 4.38e-05
EAP patients 8 0.0000398 1.0986123 4.38e-05
EAP pe 8 0.0000398 1.0986123 4.38e-05
EAP pitcher 8 0.0000398 1.0986123 4.38e-05
EAP pon 8 0.0000398 1.0986123 4.38e-05
EAP snob 8 0.0000398 1.0986123 4.38e-05
EAP tarn 8 0.0000398 1.0986123 4.38e-05
EAP unparticled 8 0.0000398 1.0986123 4.38e-05
EAP vapor 8 0.0000398 1.0986123 4.38e-05
EAP vor 8 0.0000398 1.0986123 4.38e-05
EAP affair 21 0.0001046 0.4054651 4.24e-05
HPL wal 9 0.0000576 1.0986123 6.33e-05
HPL weedy 9 0.0000576 1.0986123 6.33e-05
HPL birch 24 0.0001536 0.4054651 6.23e-05
HPL hellish 24 0.0001536 0.4054651 6.23e-05
HPL marsh 24 0.0001536 0.4054651 6.23e-05
HPL fer 23 0.0001472 0.4054651 5.97e-05
HPL gardens 23 0.0001472 0.4054651 5.97e-05
HPL laboratory 23 0.0001472 0.4054651 5.97e-05
HPL arthur 22 0.0001408 0.4054651 5.71e-05
HPL curiously 22 0.0001408 0.4054651 5.71e-05
HPL didn't 22 0.0001408 0.4054651 5.71e-05
MWS eloquence 14 0.0000845 0.4054651 3.43e-05
MWS exertions 14 0.0000845 0.4054651 3.43e-05
MWS gentleness 14 0.0000845 0.4054651 3.43e-05
MWS occasioned 14 0.0000845 0.4054651 3.43e-05
MWS proceed 14 0.0000845 0.4054651 3.43e-05
MWS virtues 14 0.0000845 0.4054651 3.43e-05
MWS wondrous 14 0.0000845 0.4054651 3.43e-05
MWS alps 5 0.0000302 1.0986123 3.32e-05
MWS ambassador 5 0.0000302 1.0986123 3.32e-05
MWS assisted 5 0.0000302 1.0986123 3.32e-05
MWS blooming 5 0.0000302 1.0986123 3.32e-05

Knowing what tokens distinguish one author from another can help us take an unlabeled passage and assign it to the most likely author.

spooky_sample <- spooky_authors %>%
  sample_n(2500)
rf_spec <- rand_forest() %>%
  set_engine("ranger") %>%
  set_mode("classification")
rf_rec <- recipe(author ~ text, data = spooky_sample) %>%
  step_tokenize(text) %>%
  step_stopwords(text) %>%
  step_tfidf(text)
rf_wf <- workflow() %>%
  add_model(rf_spec) %>%
  add_recipe(rf_rec)
rf_fit <- rf_wf %>%
  fit(spooky_sample)

conf_matrix <- rf_fit %>%
  augment(spooky_authors) %>%
  mutate(author = as.factor(author)) %>%
  count(author, .pred_class) %>%
  pivot_wider(id_cols = author, 
              names_from = .pred_class, 
              values_from = n)
Warning in asMethod(object): sparse->dense coercion: allocating vector of size
1.4 GiB
conf_matrix
# A tibble: 3 × 4
  author   EAP   HPL   MWS
  <fct>  <int> <int> <int>
1 EAP     6830   507   563
2 HPL     2280  2929   426
3 MWS     2294   428  3322
accuracy <- (conf_matrix[1,2] + conf_matrix[2,3] + conf_matrix[3,4])/nrow(spooky_authors)

Note that the model above takes quite some time to fit and also to predict. This is because each distinct word in the training set (aside from stopwords) has become a feature that the model can use. Additionally, we’re fitting a random forest consisting of 500 individual trees. That being said, even though the model was trained on only 2500 records, it is performing reasonably well. It has an accuracy of 0.6681138 and identifies excerpts from Edgar Allen Poe quite well. However, it has a difficult time differentiating the excerpts from the other authors from Edgar Allen Poe.


Summary

Over these past three topics, you’ve learned quite a bit about extracting insights from text data. We’ve engaged in basic text analysis through tokenization, looked at using regular expressions for extracting features from text using pattern matching, and now we’ve seen how to use a simple word embedding – tf-idf scores – as a component of a predictive model.

If you want to learn more about the basics of working with text data, check out *Tidy Text Mining with R by Julia Silge and David Robinson. If you want to learn more about using text data in machine learning algorithms, check out Supervised Machine Learning for Text Analysis in R by Julia Silge and Emil Hvitfeldt.