id | text | author |
---|---|---|
id26305 | This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall. | EAP |
id17569 | It never once occurred to me that the fumbling might be a mere mistake. | HPL |
id11008 | In his left hand was a gold snuff box, from which, as he capered down the hill, cutting all manner of fantastic steps, he took snuff incessantly with an air of the greatest possible self satisfaction. | EAP |
id27763 | How lovely is spring As we looked from Windsor Terrace on the sixteen fertile counties spread beneath, speckled by happy cottages and wealthier towns, all looked as in former years, heart cheering and fair. | MWS |
id12958 | Finding nothing else, not even gold, the Superintendent abandoned his attempts; but a perplexed look occasionally steals over his countenance as he sits thinking at his desk. | HPL |
id22965 | A youth passed in solitude, my best years spent under your gentle and feminine fosterage, has so refined the groundwork of my character that I cannot overcome an intense distaste to the usual brutality exercised on board ship: I have never believed it to be necessary, and when I heard of a mariner equally noted for his kindliness of heart and the respect and obedience paid to him by his crew, I felt myself peculiarly fortunate in being able to secure his services. | MWS |
Introduction to Word Embeddings (tf-idf)
Purpose: In this notebook we’ll consider a powerful technique falling under the umbrella of word embeddings. In particular, we’ll look at a measure called term frequency, inverse document frequency as a way to measure how important a term is to one category of documents versus other categories.
Data Used: We’ll use excerpts from Spooky Authors (Edgar-Allen Poe, Mary Shelley, and HP Lovecraft), which appeared as part of a Kaggle Competition in 2018. For convenience, I’ve added the training data to my GitHub repository here.
The Big Idea
Are tokens more frequently utilized in one type of document versus another? Here, we’ll look at a measure of how frequent a token is to a particular class is and how unique it is to that class, relative the other classes. The term frequency, inverse document frequency of a token is defined as follows: \(\displaystyle{\text{tf\_idf}\left(\text{token}\right) = f_{\text{Token}}\cdot\ln\left(\frac{n_{\text{Documents}}}{n_{\text{Documents Containing Token}}}\right)}\), where \(f_{\text{Token}}\) denotes the frequency of the \(\tt{Token}\) across the entire corpus.
In the formula for \(\text{tf\_idf}\left(\text{Token}\right)\), if \(\tt{Token}\) appears across all of the documents, then the ration \(\frac{n_{\text{Documents}}}{n_{\text{Documents Containing Token}}}\) will evaluate to \(1\) and the logarithm will be \(0\). That is, a token appearing across all documents cannot be useful in differentiating the document. That logarithm will be largest when the token in question appears in just one of the documents and none of the others. This means that tokens with the highest term frequency, inverse document frequency will be those unique to their document and having a high frequency of usage within that document.
Let’s try this out with our spooky author excerpts. You can see the first few rows of the spooky author data below.
One thing we could do here is to group by author and then compute term frequency, inverse document frequency for all tokens across the three resulting “documents”.
<- spooky_authors %>%
spooky_tf_idf unnest_tokens(word, text) %>%
group_by(author) %>%
count(word) %>%
bind_tf_idf(word, author, n) %>%
arrange(-tf_idf)
%>%
spooky_tf_idf head(10) %>%
kable() %>%
kable_styling(bootstrap_options = c("hover", "striped"))
author | word | n | tf | idf | tf_idf |
---|---|---|---|---|---|
MWS | perdita | 156 | 0.0009415 | 1.0986123 | 0.0010343 |
MWS | adrian | 126 | 0.0007604 | 1.0986123 | 0.0008354 |
MWS | idris | 109 | 0.0006578 | 1.0986123 | 0.0007227 |
MWS | raymond | 248 | 0.0014967 | 0.4054651 | 0.0006069 |
MWS | windsor | 73 | 0.0004406 | 1.0986123 | 0.0004840 |
HPL | gilman | 64 | 0.0004096 | 1.0986123 | 0.0004500 |
HPL | innsmouth | 59 | 0.0003776 | 1.0986123 | 0.0004148 |
HPL | arkham | 58 | 0.0003712 | 1.0986123 | 0.0004078 |
EAP | dupin | 57 | 0.0002838 | 1.0986123 | 0.0003118 |
MWS | felix | 44 | 0.0002655 | 1.0986123 | 0.0002917 |
It is likely that many of these top words are characters or locations specific to stories written by the authors. However, if we look further down in the list arranged by tf-idf we can see more commonplace words that are specific to each author’s lexicon.
%>%
spooky_tf_idf slice(200:210) %>%
kable() %>%
kable_styling(c("hover", "striped"))
author | word | n | tf | idf | tf_idf |
---|---|---|---|---|---|
EAP | paces | 8 | 0.0000398 | 1.0986123 | 4.38e-05 |
EAP | patients | 8 | 0.0000398 | 1.0986123 | 4.38e-05 |
EAP | pe | 8 | 0.0000398 | 1.0986123 | 4.38e-05 |
EAP | pitcher | 8 | 0.0000398 | 1.0986123 | 4.38e-05 |
EAP | pon | 8 | 0.0000398 | 1.0986123 | 4.38e-05 |
EAP | snob | 8 | 0.0000398 | 1.0986123 | 4.38e-05 |
EAP | tarn | 8 | 0.0000398 | 1.0986123 | 4.38e-05 |
EAP | unparticled | 8 | 0.0000398 | 1.0986123 | 4.38e-05 |
EAP | vapor | 8 | 0.0000398 | 1.0986123 | 4.38e-05 |
EAP | vor | 8 | 0.0000398 | 1.0986123 | 4.38e-05 |
EAP | affair | 21 | 0.0001046 | 0.4054651 | 4.24e-05 |
HPL | wal | 9 | 0.0000576 | 1.0986123 | 6.33e-05 |
HPL | weedy | 9 | 0.0000576 | 1.0986123 | 6.33e-05 |
HPL | birch | 24 | 0.0001536 | 0.4054651 | 6.23e-05 |
HPL | hellish | 24 | 0.0001536 | 0.4054651 | 6.23e-05 |
HPL | marsh | 24 | 0.0001536 | 0.4054651 | 6.23e-05 |
HPL | fer | 23 | 0.0001472 | 0.4054651 | 5.97e-05 |
HPL | gardens | 23 | 0.0001472 | 0.4054651 | 5.97e-05 |
HPL | laboratory | 23 | 0.0001472 | 0.4054651 | 5.97e-05 |
HPL | arthur | 22 | 0.0001408 | 0.4054651 | 5.71e-05 |
HPL | curiously | 22 | 0.0001408 | 0.4054651 | 5.71e-05 |
HPL | didn't | 22 | 0.0001408 | 0.4054651 | 5.71e-05 |
MWS | eloquence | 14 | 0.0000845 | 0.4054651 | 3.43e-05 |
MWS | exertions | 14 | 0.0000845 | 0.4054651 | 3.43e-05 |
MWS | gentleness | 14 | 0.0000845 | 0.4054651 | 3.43e-05 |
MWS | occasioned | 14 | 0.0000845 | 0.4054651 | 3.43e-05 |
MWS | proceed | 14 | 0.0000845 | 0.4054651 | 3.43e-05 |
MWS | virtues | 14 | 0.0000845 | 0.4054651 | 3.43e-05 |
MWS | wondrous | 14 | 0.0000845 | 0.4054651 | 3.43e-05 |
MWS | alps | 5 | 0.0000302 | 1.0986123 | 3.32e-05 |
MWS | ambassador | 5 | 0.0000302 | 1.0986123 | 3.32e-05 |
MWS | assisted | 5 | 0.0000302 | 1.0986123 | 3.32e-05 |
MWS | blooming | 5 | 0.0000302 | 1.0986123 | 3.32e-05 |
Knowing what tokens distinguish one author from another can help us take an unlabeled passage and assign it to the most likely author.
<- spooky_authors %>%
spooky_sample sample_n(2500)
<- rand_forest() %>%
rf_spec set_engine("ranger") %>%
set_mode("classification")
<- recipe(author ~ text, data = spooky_sample) %>%
rf_rec step_tokenize(text) %>%
step_stopwords(text) %>%
step_tfidf(text)
<- workflow() %>%
rf_wf add_model(rf_spec) %>%
add_recipe(rf_rec)
<- rf_wf %>%
rf_fit fit(spooky_sample)
<- rf_fit %>%
conf_matrix augment(spooky_authors) %>%
mutate(author = as.factor(author)) %>%
count(author, .pred_class) %>%
pivot_wider(id_cols = author,
names_from = .pred_class,
values_from = n)
Warning in asMethod(object): sparse->dense coercion: allocating vector of size
1.4 GiB
conf_matrix
# A tibble: 3 × 4
author EAP HPL MWS
<fct> <int> <int> <int>
1 EAP 6830 507 563
2 HPL 2280 2929 426
3 MWS 2294 428 3322
<- (conf_matrix[1,2] + conf_matrix[2,3] + conf_matrix[3,4])/nrow(spooky_authors) accuracy
Note that the model above takes quite some time to fit and also to predict. This is because each distinct word in the training set (aside from stopwords) has become a feature that the model can use. Additionally, we’re fitting a random forest consisting of 500 individual trees. That being said, even though the model was trained on only 2500 records, it is performing reasonably well. It has an accuracy of 0.6681138 and identifies excerpts from Edgar Allen Poe quite well. However, it has a difficult time differentiating the excerpts from the other authors from Edgar Allen Poe.
Summary
Over these past three topics, you’ve learned quite a bit about extracting insights from text data. We’ve engaged in basic text analysis through tokenization, looked at using regular expressions for extracting features from text using pattern matching, and now we’ve seen how to use a simple word embedding – tf-idf scores – as a component of a predictive model.
If you want to learn more about the basics of working with text data, check out *Tidy Text Mining with R by Julia Silge and David Robinson. If you want to learn more about using text data in machine learning algorithms, check out Supervised Machine Learning for Text Analysis in R by Julia Silge and Emil Hvitfeldt.