Introduction to Word Embeddings (tf-idf)

Purpose: In this notebook we’ll consider a powerful technique falling under the umbrella of word embeddings. In particular, we’ll look at a measure called term frequency, inverse document frequency as a way to measure how important a term is to one category of documents versus other categories.

Data Used: We’ll use excerpts from Spooky Authors (Edgar-Allen Poe, Mary Shelley, and HP Lovecraft), which appeared as part of a Kaggle Competition in 2018. For convenience, I’ve added the training data to my GitHub repository here.

The Big Idea

Are tokens more frequently utilized in one type of document versus another? Here, we’ll look at a measure of how frequent a token is to a particular class is and how unique it is to that class, relative the other classes. The term frequency, inverse document frequency of a token is defined as follows: \(\displaystyle{\text{tf\_idf}\left(\text{token}\right) = f_{\text{Token}}\cdot\ln\left(\frac{n_{\text{Documents}}}{n_{\text{Documents Containing Token}}}\right)}\), where \(f_{\text{Token}}\) denotes the relative frequency of the \(\tt{Token}\) within the document.

In the formula for \(\text{tf\_idf}\left(\text{Token}\right)\), if \(\tt{Token}\) appears across all of the documents, then the ratio \(\frac{n_{\text{Documents}}}{n_{\text{Documents Containing Token}}}\) will evaluate to \(1\) and the logarithm will be \(0\). That is, a token appearing across all documents cannot be useful in differentiating the document type. That logarithm will be largest when the token in question appears in just one of the documents and none of the others. This means that tokens with the highest term frequency, inverse document frequency will be those unique to their document and having a high frequency of usage within that document.

Let’s try this out with our spooky author excerpts. You can see the first few rows of the spooky author data below.

id	text	author
id26305	This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.	EAP
id17569	It never once occurred to me that the fumbling might be a mere mistake.	HPL
id11008	In his left hand was a gold snuff box, from which, as he capered down the hill, cutting all manner of fantastic steps, he took snuff incessantly with an air of the greatest possible self satisfaction.	EAP
id27763	How lovely is spring As we looked from Windsor Terrace on the sixteen fertile counties spread beneath, speckled by happy cottages and wealthier towns, all looked as in former years, heart cheering and fair.	MWS
id12958	Finding nothing else, not even gold, the Superintendent abandoned his attempts; but a perplexed look occasionally steals over his countenance as he sits thinking at his desk.	HPL
id22965	A youth passed in solitude, my best years spent under your gentle and feminine fosterage, has so refined the groundwork of my character that I cannot overcome an intense distaste to the usual brutality exercised on board ship: I have never believed it to be necessary, and when I heard of a mariner equally noted for his kindliness of heart and the respect and obedience paid to him by his crew, I felt myself peculiarly fortunate in being able to secure his services.	MWS

One thing we could do here is to group by author and then compute term frequency, inverse document frequency for all tokens across the three resulting “documents”.

spooky_tf_idf <- spooky_authors %>%
  unnest_tokens(word, text) %>%
  group_by(author) %>%
  count(word) %>%
  bind_tf_idf(word, author, n) %>%
  arrange(-tf_idf) 

spooky_tf_idf %>%
  head(10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("hover", "striped"))

author	word	n	tf	idf	tf_idf
MWS	perdita	156	0.0009415	1.0986123	0.0010343
MWS	adrian	126	0.0007604	1.0986123	0.0008354
MWS	idris	109	0.0006578	1.0986123	0.0007227
MWS	raymond	248	0.0014967	0.4054651	0.0006069
MWS	windsor	73	0.0004406	1.0986123	0.0004840
HPL	gilman	64	0.0004096	1.0986123	0.0004500
HPL	innsmouth	59	0.0003776	1.0986123	0.0004148
HPL	arkham	58	0.0003712	1.0986123	0.0004078
EAP	dupin	57	0.0002838	1.0986123	0.0003118
MWS	felix	44	0.0002655	1.0986123	0.0002917

It is likely that many of these top words are characters or locations specific to stories written by the authors. However, if we look further down in the list arranged by tf-idf we can see more commonplace words that are specific to each author’s lexicon.

spooky_tf_idf %>%
  slice(200:210) %>%
  kable() %>%
  kable_styling(c("hover", "striped"))

author	word	n	tf	idf	tf_idf
EAP	paces	8	0.0000398	1.0986123	4.38e-05
EAP	patients	8	0.0000398	1.0986123	4.38e-05
EAP	pe	8	0.0000398	1.0986123	4.38e-05
EAP	pitcher	8	0.0000398	1.0986123	4.38e-05
EAP	pon	8	0.0000398	1.0986123	4.38e-05
EAP	snob	8	0.0000398	1.0986123	4.38e-05
EAP	tarn	8	0.0000398	1.0986123	4.38e-05
EAP	unparticled	8	0.0000398	1.0986123	4.38e-05
EAP	vapor	8	0.0000398	1.0986123	4.38e-05
EAP	vor	8	0.0000398	1.0986123	4.38e-05
EAP	affair	21	0.0001046	0.4054651	4.24e-05
HPL	wal	9	0.0000576	1.0986123	6.33e-05
HPL	weedy	9	0.0000576	1.0986123	6.33e-05
HPL	birch	24	0.0001536	0.4054651	6.23e-05
HPL	hellish	24	0.0001536	0.4054651	6.23e-05
HPL	marsh	24	0.0001536	0.4054651	6.23e-05
HPL	fer	23	0.0001472	0.4054651	5.97e-05
HPL	gardens	23	0.0001472	0.4054651	5.97e-05
HPL	laboratory	23	0.0001472	0.4054651	5.97e-05
HPL	arthur	22	0.0001408	0.4054651	5.71e-05
HPL	curiously	22	0.0001408	0.4054651	5.71e-05
HPL	didn't	22	0.0001408	0.4054651	5.71e-05
MWS	eloquence	14	0.0000845	0.4054651	3.43e-05
MWS	exertions	14	0.0000845	0.4054651	3.43e-05
MWS	gentleness	14	0.0000845	0.4054651	3.43e-05
MWS	occasioned	14	0.0000845	0.4054651	3.43e-05
MWS	proceed	14	0.0000845	0.4054651	3.43e-05
MWS	virtues	14	0.0000845	0.4054651	3.43e-05
MWS	wondrous	14	0.0000845	0.4054651	3.43e-05
MWS	alps	5	0.0000302	1.0986123	3.32e-05
MWS	ambassador	5	0.0000302	1.0986123	3.32e-05
MWS	assisted	5	0.0000302	1.0986123	3.32e-05
MWS	blooming	5	0.0000302	1.0986123	3.32e-05

Knowing what tokens distinguish one author from another can help us take an unlabeled passage and assign it to the most likely author.

spooky_sample <- spooky_authors %>%
  sample_n(2500)
rf_spec <- rand_forest() %>%
  set_engine("ranger") %>%
  set_mode("classification")
rf_rec <- recipe(author ~ text, data = spooky_sample) %>%
  step_tokenize(text) %>%
  step_stopwords(text) %>%
  step_tfidf(text)
rf_wf <- workflow() %>%
  add_model(rf_spec) %>%
  add_recipe(rf_rec)
rf_fit <- rf_wf %>%
  fit(spooky_sample)

conf_matrix <- rf_fit %>%
  augment(spooky_authors) %>%
  mutate(author = as.factor(author)) %>%
  count(author, .pred_class) %>%
  pivot_wider(id_cols = author, 
              names_from = .pred_class, 
              values_from = n)

Warning in asMethod(object): sparse->dense coercion: allocating vector of size
1.4 GiB

conf_matrix

# A tibble: 3 × 4
  author   EAP   HPL   MWS
  <fct>  <int> <int> <int>
1 EAP     6810   646   444
2 HPL     2132  3158   345
3 MWS     2336   560  3148

accuracy <- (conf_matrix[1,2] + conf_matrix[2,3] + conf_matrix[3,4])/nrow(spooky_authors)

accuracy

        EAP
1 0.6699014

Note that the model above takes quite some time to fit and also to predict. This is because each distinct word in the training set (aside from stopwords) has become a feature that the model can use. Additionally, we’re fitting a random forest consisting of 500 individual trees. That being said, even though the model was trained on only 2500 records, it is performing reasonably well. It has an accuracy of 0.6699014 and identifies excerpts from Edgar Allen Poe quite well. However, it has a difficult time differentiating the excerpts from the other authors from Edgar Allen Poe.

Summary

Over these past three topics, you’ve learned quite a bit about extracting insights from text data. We’ve engaged in basic text analysis through tokenization, looked at using regular expressions for extracting features from text using pattern matching, and now we’ve seen how to use a simple word embedding – tf-idf scores – as a component of a predictive model.

If you want to learn more about the basics of working with text data, check out *Tidy Text Mining with R by Julia Silge and David Robinson. If you want to learn more about using text data in machine learning algorithms, check out Supervised Machine Learning for Text Analysis in R by Julia Silge and Emil Hvitfeldt.