Introduction to Text and Tokenization

Purpose: In this notebook we’ll expand our ability to utilize recorded data by seeing how to process and gain insights from text data using the {tidytext} package.

The Big Idea

Until now we’ve limited ourselves to working with strictly numerical or categorical data. Any other columns in datasets, particularly text columns, have been ignored. I think we can all agree that there are likely valuable insights buried in text data though – its just a matter of how to extract those insights.

We’ll spend several class meetings thinking about and utilizing text data. In this first notebook, we’ll talk about tokenization. We’ll treat each text entry as a bag of words and assign each observed word to that response. We’ll also introduce the notion of stopwords, which are words so common that they should be ignored as part of an analysis.

In order to understand how tokenization works, let’s consider an example. Below is a small dataframe consisting of the opening lines of each of three classic books.

book	opening_line
Moby Dick	Call me Ishmael.
A Tale of Two Cities	It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair.
Pride and Prejudice	It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.

Let’s tokenize the statements!

book	word
Moby Dick	call
Moby Dick	me
Moby Dick	ishmael
A Tale of Two Cities	it
A Tale of Two Cities	was
A Tale of Two Cities	the

We’ve just broken the sentences up into single words. Notice that our data frame has gotten much longer. We’ve moved from 3 rows to 86 rows of data.

We can also see that some of these tokens are likely uninformative – words like it, was, and the are highly likely to appear in any sentence. These very common and uninformative words are often called stopwords. The {tidytext} package has a pre-built data frame of 1149 stopwords that we can use as a starting point for removal. The first six stopwords in the stopwords data frame are shown below.

word	lexicon
a	SMART
a's	SMART
able	SMART
about	SMART
above	SMART
according	SMART

You can also see that there are different lexicons – those available are SMART, onix, snowball. Let’s see what the data frame of tokenized opening lines looks like if we remove all stopwords.

book	word
Moby Dick	call
Moby Dick	ishmael
A Tale of Two Cities	times
A Tale of Two Cities	worst
A Tale of Two Cities	times
A Tale of Two Cities	age
A Tale of Two Cities	wisdom
A Tale of Two Cities	age
A Tale of Two Cities	foolishness
A Tale of Two Cities	epoch
A Tale of Two Cities	belief
A Tale of Two Cities	epoch
A Tale of Two Cities	incredulity
A Tale of Two Cities	season
A Tale of Two Cities	light
A Tale of Two Cities	season
A Tale of Two Cities	darkness
A Tale of Two Cities	spring
A Tale of Two Cities	hope
A Tale of Two Cities	winter
A Tale of Two Cities	despair
Pride and Prejudice	truth
Pride and Prejudice	universally
Pride and Prejudice	acknowledged
Pride and Prejudice	single
Pride and Prejudice	possession
Pride and Prejudice	fortune
Pride and Prejudice	wife

The remaining words look much more informative, and now we’ve only got 28 tokens to deal with.

How Can We Use This?

Tokenization is a very basic approach, but it can be really powerful. For example, let’s consider two full texts from Jane Austen – Pride and Prejudice and Sense and Sensibility. We can get these using the {janeaustenr} package.

The plot above shows word frequencies across the two texts. For example, we can see that the main character’s names in their respective texts appears lots in that text but very infrequently in the other. Any words falling near the dashed line are words about equally represented in both texts, while words falling further from the line are more strongly represented in the text corresponding to the axis the observation is closer to. I’ve made the plot with {plotly} so you can zoom in and out as you like.

Implementation in `{tidymodels}` with `{textrecipes}`

The {textrecipes} library provides access to several feature engineering step_*() functions which can be used specifically with text features. For example, a few recipe steps which are related to the concepts in this notebook are

step_tokenize() which will tokenize text-based columns.
step_stopwords() which will remove stopwords.
step_tokenfilter() which will allow you to keep only the most frequently occurring tokens.

These steps can be added to a recipe() along with, perhaps, step_dummy() to engineer features from text columns which can be utilized in our classification models. You can learn more from the {textrecipes} vignette.

Summary

In this notebook we introduced word tokenization as a basic technique which can be useful in analysis of text data. We also introduced the notion of stopwords which are common and uninformative words which can easily be removed from a tokenized corpus (that’s what we commonly call a collection of text data). We also saw that we can use data visualization techniques to help identify tokens common to one corpus and not common to another. We could use such a plot to create text-based features (via dummy variables) for a model.

We’ll do more with tokenization in our next class meeting.

The Big Idea

How Can We Use This?

Implementation in {tidymodels} with {textrecipes}

Summary

Implementation in `{tidymodels}` with `{textrecipes}`