Introduction to Word Embeddings (tf-idf)

Purpose: In this notebook we’ll consider a powerful technique falling under the umbrella of word embeddings. In particular, we’ll look at a measure called term frequency, inverse document frequency as a way to measure how important a term is to one category of documents versus other categories.

Data Used: We’ll use excerpts from Spooky Authors (Edgar-Allen Poe, Mary Shelley, and HP Lovecraft), which appeared as part of a Kaggle Competition in 2018. For convenience, I’ve added the training data to my GitHub repository here.

The Big Idea

Are tokens more frequently utilized in one type of document versus another? Here, we’ll look at a measure of how frequent a token is to a particular class is and how unique it is to that class, relative the other classes. The term frequency, inverse document frequency of a token is defined as follows: \(\displaystyle{\text{tf_idf}\left(\text{token}\right) = f_{\text{Token}}\cdot\ln\left(\frac{n_{\text{Documents}}}{n_{\text{Documents Containing Token}}}\right)}\), where \(f_{\text{Token}}\) denotes the frequency of the \(\tt{Token}\) across the entire corpus.

In the formula for \(\text{tf_idf}\left(\text{Token}\right)\), if \(\tt{Token}\) appears across all of the documents, then the ration \(\frac{n_{\text{Documents}}}{n_{\text{Documents Containing Token}}}\) will evaluate to \(1\) and the logarithm will be \(0\). That is, a token appearing across all documents cannot be useful in differentiating the document. That logarithm will be largest when the token in question appears in just one of the documents and none of the others. This means that tokens with the highest term frequency, inverse document frequency will be those unique to their document and having a high frequency of usage within that document.

Let’s try this out with our spooky author excerpts. You can see the first few rows of the spooky author data below.

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score

spooky_authors = pd.read_csv("https://raw.githubusercontent.com/agmath/agmath.github.io/master/data/classification/spooky_authors.csv")

py$spooky_authors %>%
  head() %>%
  kable() %>%
  kable_styling(bootstrap_options = c("hover", "striped"))

id	text	author
id26305	This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.	EAP
id17569	It never once occurred to me that the fumbling might be a mere mistake.	HPL
id11008	In his left hand was a gold snuff box, from which, as he capered down the hill, cutting all manner of fantastic steps, he took snuff incessantly with an air of the greatest possible self satisfaction.	EAP
id27763	How lovely is spring As we looked from Windsor Terrace on the sixteen fertile counties spread beneath, speckled by happy cottages and wealthier towns, all looked as in former years, heart cheering and fair.	MWS
id12958	Finding nothing else, not even gold, the Superintendent abandoned his attempts; but a perplexed look occasionally steals over his countenance as he sits thinking at his desk.	HPL
id22965	A youth passed in solitude, my best years spent under your gentle and feminine fosterage, has so refined the groundwork of my character that I cannot overcome an intense distaste to the usual brutality exercised on board ship: I have never believed it to be necessary, and when I heard of a mariner equally noted for his kindliness of heart and the respect and obedience paid to him by his crew, I felt myself peculiarly fortunate in being able to secure his services.	MWS

Let’s compute term-frequency, inverse document frequency for our excerpted text from our spooky authors. We’ll start by splitting our spooky_authors data into training and test sets. We’ll then begin by isolating the text column, creating a TfidfVectorizer(), fitting it to the spooky author text, and then using it to transform the text.

train, test = train_test_split(spooky_authors, train_size = 0.75, random_state = 123)

spooky_train_text = train["text"].values

tfidf_vec = TfidfVectorizer(analyzer = "word", stop_words = "english")

tfidf_vec.fit(spooky_train_text)

TfidfVectorizer(stop_words='english')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

spooky_train_tfidf = tfidf_vec.transform(spooky_train_text)
spooky_train_tfidf

## <14684x22228 sparse matrix of type '<class 'numpy.float64'>'
##  with 165706 stored elements in Compressed Sparse Row format>

The spooky_authors_tfidf object is stored as a sparse matrix. This means that most of the entries in the matrix are 0’s, and so it is more efficient to store the size of the matrix, the locations of non-zero entries, and their values in a different data structure (sparse matrix) than it is to store the entire matrix in memory.

It makes sense that the resulting matrix would be sparse. Each excerpt makes use of very few words in the entire lexicon present across the union of the excerpts. Still, we can see part of the sparse matrix if we use .toarray() to force the matrix to be printed out.

spooky_train_tfidf.toarray()

## array([[0., 0., 0., ..., 0., 0., 0.],
##        [0., 0., 0., ..., 0., 0., 0.],
##        [0., 0., 0., ..., 0., 0., 0.],
##        ...,
##        [0., 0., 0., ..., 0., 0., 0.],
##        [0., 0., 0., ..., 0., 0., 0.],
##        [0., 0., 0., ..., 0., 0., 0.]])

spooky_train_tfidf.toarray().shape

## (14684, 22228)

Since the matrix is sparse, we aren’t seeing anything useful! There are many non-zero entries in here though – just the proportion of non-zero entries is very tiny.

Now that we’ve got our text converted into usable numeric features, let’s build a model. I’ll use a random forest since we have so many features here!

rf_clf = RandomForestClassifier()
rf_clf.fit(spooky_train_tfidf, train["author"])

RandomForestClassifier()

Now that our model is fit, let’s use it on a spooky sentence!

predictions = rf_clf.predict_proba(tfidf_vec.transform(['I laid down on the floor and stretched my arm towards the corner of the draped comforter. I lifted it slightly and...']))
predictions

## array([[0.6187619 , 0.32704762, 0.05419048]])

Let’s see how the forest actually performs on our unseen test excerpts from our spooky_authors data. We’ll first need to transform the text using our tfidf_vec() transformer – we won’t fit it again, because fitting it was part of our training process.

test_text_tfidf = tfidf_vec.transform(test["text"].values)
test_author_preds = rf_clf.predict(test_text_tfidf)
cm = confusion_matrix(test["author"], test_author_preds)
test_accuracy = accuracy_score(test["author"], test_author_preds)
print("Order of Classes: ", rf_clf.classes_)

## Order of Classes:  ['EAP' 'HPL' 'MWS']

cm

## array([[1459,  178,  309],
##        [ 379,  889,  144],
##        [ 352,  132, 1053]], dtype=int64)

print("Model (Test) Accuracy: ", test_accuracy)

## Model (Test) Accuracy:  0.6947906026557712

Note that the model above takes quite some time to fit and also to predict. This is because each distinct word in the training set (aside from stopwords) has become a feature that the model can use. Additionally, we’re fitting a random forest consisting of 100 individual trees. That being said, it is performing reasonably well. It has an accuracy of 0.6947906 and identifies excerpts from Edgar Allen Poe quite well. However, it has a difficult time differentiating the excerpts from the other authors from Edgar Allen Poe.

Introduction to Word Embeddings (tf-idf)

Dr. Gilbert

August 07, 2023

The Big Idea

Summary