Purpose: In this notebook we’ll expand our ability
to utilize recorded data by seeing how to process and gain insights from
text data using the {tidytext}
package and using
{pandas}
functionality in Python.
Note: Note that exploratory analyses with text
features seems quite a bit simpler in R. Perhaps my bias is showing, but
I explored several Python packages including {nltk}
,
{SpaCy}
, and a port of {tidytext}
. None of
these packages seemed intuitive to me, because they either
{tidytext}
I’d encourage you to use R functionality here for your exploratory
work, but then use {reticulate}
to switch back to
{sklearn}
and include text steps in your feature
engineering Pipeline()
s when it is time to build models.
More on this towards the end of the notebook.
Until now we’ve limited ourselves to working with strictly numerical or categorical data. Any other columns in datasets, particularly text columns, have been ignored. I think we can all agree that there are likely valuable insights buried in text data though – its just a matter of how to extract those insights.
We’ll spend several class meetings thinking about and utilizing text data. In this first notebook, we’ll talk about tokenization. We’ll treat each text entry as a bag of words and assign each observed word to that response. We’ll also introduce the notion of stopwords, which are words so common that they should be ignored as part of an analysis.
In order to understand how tokenization works, let’s consider an example. Below is a small dataframe consisting of the opening lines of each of three classic books.
book | opening_line |
---|---|
Moby Dick | Call me Ishmael. |
A Tale of Two Cities | It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair. |
Pride and Prejudice | It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife. |
Let’s tokenize the statements!
book | value |
---|---|
A Tale of Two Cities | It |
A Tale of Two Cities | was |
A Tale of Two Cities | the |
A Tale of Two Cities | best |
A Tale of Two Cities | of |
A Tale of Two Cities | times, |
We’ve just broken the sentences up into single words. Notice that our data frame has gotten much longer. We’ve moved from 3 rows to 180 rows of data.
We can also see that some of these tokens are likely uninformative –
words like it
, was
, and the
are
highly likely to appear in any sentence. These very common and
uninformative words are often called stopwords. The
{nltk}
package has a pre-built data frame of 1149
stopwords that we can use as a starting point for removal. The
first six stopwords in the stopwords
data frame are shown
below.
x |
---|
i |
me |
my |
myself |
we |
our |
You can also see that there are different lexicons – those available are 179. Let’s see what the data frame of tokenized opening lines looks like if we remove all stopwords.
index | book | value | |
---|---|---|---|
0 | 0 | A Tale of Two Cities | It |
2 | 22 | A Tale of Two Cities | best |
3 | 35 | A Tale of Two Cities | times, |
4 | 36 | A Tale of Two Cities | times, |
5 | 46 | A Tale of Two Cities | worst |
6 | 47 | A Tale of Two Cities | age |
7 | 48 | A Tale of Two Cities | age |
8 | 49 | A Tale of Two Cities | wisdom, |
9 | 50 | A Tale of Two Cities | foolishness, |
10 | 51 | A Tale of Two Cities | epoch |
11 | 52 | A Tale of Two Cities | epoch |
12 | 53 | A Tale of Two Cities | belief, |
13 | 54 | A Tale of Two Cities | incredulity, |
14 | 55 | A Tale of Two Cities | season |
15 | 56 | A Tale of Two Cities | season |
16 | 57 | A Tale of Two Cities | Light, |
17 | 58 | A Tale of Two Cities | Darkness, |
18 | 59 | A Tale of Two Cities | spring |
19 | 60 | A Tale of Two Cities | hope, |
20 | 61 | A Tale of Two Cities | winter |
21 | 62 | A Tale of Two Cities | despair. |
22 | 63 | Moby Dick | Call |
23 | 65 | Moby Dick | Ishmael. |
1 | 1 | Pride and Prejudice | It |
118 | 165 | Pride and Prejudice | truth |
119 | 166 | Pride and Prejudice | universally |
120 | 167 | Pride and Prejudice | acknowledged, |
121 | 169 | Pride and Prejudice | single |
122 | 170 | Pride and Prejudice | man |
123 | 173 | Pride and Prejudice | possession |
124 | 174 | Pride and Prejudice | good |
125 | 175 | Pride and Prejudice | fortune, |
126 | 176 | Pride and Prejudice | must |
127 | 178 | Pride and Prejudice | want |
128 | 179 | Pride and Prejudice | wife. |
The remaining words look much more informative, and now we’ve only got 35 tokens to deal with.
Tokenization is a very basic approach, but it can be really powerful.
For example, let’s consider two full texts from Jane Austen – Pride
and Prejudice and Sense and Sensibility. We can get these
using the {janeaustenr}
package. If you look at the RMD
file, the code below uses {tidytext}
in R because of its
convenience.
The plot above shows word frequencies across the two texts. For
example, we can see that the main character’s names in their respective
texts appears lots in that text but very infrequently in the other. Any
words falling near the dashed line are words about equally represented
in both texts, while words falling further from the line are more
strongly represented in the text corresponding to the axis the
observation is closer to. I’ve made the plot with {plotly}
so you can zoom in and out as you like.
{sklearn}
There are several feature engineering steps from
sklearn.feature_extraction.text
which can be useful
inclusions to your modeling pipelines. One of which is relevant to the
discussions in this notebook – that is, CountVectorizer()
.
You can learn about this particular feature engineering step from the CountVectorizer()
documentation.
In this notebook we introduced word tokenization as a basic technique which can be useful in analysis of text data. We also introduced the notion of stopwords which are common and uninformative words which can easily be removed from a tokenized corpus (that’s what we commonly call a collection of text data). We also saw that we can use data visualization techniques to help identify tokens common to one corpus and not common to another. We could use such a plot to create text-based features (via dummy variables) for a model.
We’ll do more with tokenization in our next class meeting.