Purpose: In this notebook we’ll continue to expand our ability to extract insights from text data using a pattern-matching tool called regular expressions or regex.
Often times when working with text, we know the structure of the token we are looking for, but we don’t know the exact token. For example, we may be digging through listings of items for sale on a social media marketplace and we want to know the cost of each of those items. We know that we are looking for a dollar sign ($) followed by some digits, but we don’t know the exact price. This isn’t so much a problem for humans reading through a handful of postings, but if we are trying to get a computer to do this, we need some way to describe to a computer what a typical “price” might look like. This is where regular expressions come in.
Before we move forward, unless you work with regular expressions quite often, its likely that you’re going to find them quite mysterious and that you’ll forget the syntax for matching things like digits. Here’s a link to a regex cheat sheet from Dave Child and a link to an interactive regex applet so that you can check that your regular expressions are working as you intend for them to.
Regular expressions are also something you might ask your favorite AI
assistant for help with. For example, I asked chatGPT
, “Hey
chatGPT, can you help me write a regular expression that will match
dollar values in a posting? In particular, these dollar values may have
commas separating thousands or millions of dollars and could contain
decimals for cents. Thank you!” – ChatGPT returned with the following
regular expression \$\d{1,3}(?:,\d{3})*(?:\.\d{1,2})?
,
which works as intended according to the regex applet from the previous
paragraph.
There are lots of specialty text processing packages in Python,
including {nltk}
, {SpaCy}
, and others. We can
accomplish quite a bit in Python using basic string functionality from
{pandas}
and regular expressions though. In particular, the
following functions can be applied to columns of data frames.
Counter()
can be used in conjunction with the
.most_common()
method to identify the most common words in
a text column.str.contains()
will search a string for a pattern match
using either fixed strings or regular expressions.str.extract()
will collect all components in a string
which match a pattern.There’s lots we can do, and you’ll find more use-cases as you continue to explore working with text.
Let’s consider an example. I’ll load this dataset containing Joe Biden’s tweets from 2007 to 2020 which was posted to Kaggle by user Vopani in 2020. We can see the first few tweets below.
import pandas as pd
biden_tweets = pd.read_csv("https://raw.githubusercontent.com/agmath/agmath.github.io/master/data/classification/biden_tweets.csv")
py$biden_tweets %>%
select(-id, -url) %>%
head() %>%
kable() %>%
kable_styling(bootstrap_options = c("hover", "striped"))
timestamp | tweet | replies | retweets | quotes | likes |
---|---|---|---|---|---|
2007-10-24 22:45 | Tune in 11:30 ET tomorrow for a live webcast of Families USA Presidential Forum on health care: http://presidentialforums.health08.org/ | 19 | 5 | 17 | 11 |
2007-12-29 15:35 | Iowans, there’s a good chance there’s a Biden near you today on a cool 14 F day: http://blog.joebiden.com/?p=1625 | 13 | 16 | 6 | 22 |
2012-04-09 09:42 | We’re excited to announce that @JoeBiden is being rebooted for the 2012 campaign season to give you news of the Vice President on the trail. | 21 | 82 | 1 | 20 |
2012-04-09 09:43 | Campaign staff will run this account to keep you up to date on what the VP’s up to, but you’ll see occasional tweets from Joe himself, too. | 144 | 76 | 37 | 51 |
2012-04-09 13:11 | News for you this morning: VP Biden will speak in Exeter, NH on 4/12 on tax fairness and the President’s support for the #BuffettRule. | 10 | 54 | 0 | 5 |
2012-04-09 13:25 | In NH on 4/12, the Vice President will give his take on why millionaires shouldn’t pay a lower tax rate than middle class families do. | 16 | 52 | 0 | 6 |
Perhaps we want to extract the first appearing hashtag from each of
the tweets posted by @JoeBiden
. We can see the first tweets
containing hashtags, along with their extracted hashtag, below.
biden_tweets["hashtags"] = biden_tweets["tweet"].str.extract(r"(#[A-Za-z\d-]+)")
biden_hashtags = biden_tweets.loc[biden_tweets["hashtags"].notnull(), ["tweet", "hashtags"]]
py$biden_hashtags %>%
head() %>%
kable() %>%
kable_styling(bootstrap_options = c("hover", "striped"))
tweet | hashtags | |
---|---|---|
4 | News for you this morning: VP Biden will speak in Exeter, NH on 4/12 on tax fairness and the President’s support for the #BuffettRule. | #BuffettRule |
9 | The VP will be speaking in NH on Thursday about the #BuffettRule. Here’s what you need to know about it: http://t.co/sWu4EjD0 | #BuffettRule |
13 | Heads up: We’ll be livestreaming the Vice President’s New Hampshire speech on the #BuffettRule tomorrow. Watch this space for details. | #BuffettRule |
15 | RT @OFA_NH: #Exeter gets ready for @JoeBiden’s #BuffettRule speech here tomorrow: http://t.co/8YZjlEqe | #Exeter |
16 | Compare your tax rate to Mitt Romney’s: http://t.co/R0vmjGJe #BuffettRule | #BuffettRule |
17 | The Vice President is speaking about the #BuffettRule in Exeter, New Hampshire at 12:15pm ET—tune in here: http://t.co/QK0yYL8X | #BuffettRule |
Perhaps we wanted all of the hashtags appearing in each tweet. Below is one way to do that – but I am sure that there is a more efficient way.
biden_hashtags = pd.DataFrame(biden_tweets["tweet"].str.extractall(r"(#[A-Za-z\d-]+)"))
biden_hashtags["tweet_number"] = biden_hashtags.index.get_level_values(0)
biden_hashtags.columns = ["hashtags", "tweet_number"]
biden_tweets["tweet_number"] = biden_tweets.index
biden_hashtags = biden_hashtags.merge(biden_tweets[["tweet_number", "tweet"]], how = "left", left_on = "tweet_number", right_on = "tweet_number")
py$biden_hashtags %>%
head() %>%
kable() %>%
kable_styling(bootstrap_options = c("hover", "striped"))
hashtags | tweet_number | tweet |
---|---|---|
#BuffettRule | 4 | News for you this morning: VP Biden will speak in Exeter, NH on 4/12 on tax fairness and the President’s support for the #BuffettRule. |
#BuffettRule | 9 | The VP will be speaking in NH on Thursday about the #BuffettRule. Here’s what you need to know about it: http://t.co/sWu4EjD0 |
#BuffettRule | 13 | Heads up: We’ll be livestreaming the Vice President’s New Hampshire speech on the #BuffettRule tomorrow. Watch this space for details. |
#Exeter | 15 | RT @OFA_NH: #Exeter gets ready for @JoeBiden’s #BuffettRule speech here tomorrow: http://t.co/8YZjlEqe |
#BuffettRule | 15 | RT @OFA_NH: #Exeter gets ready for @JoeBiden’s #BuffettRule speech here tomorrow: http://t.co/8YZjlEqe |
#BuffettRule | 16 | Compare your tax rate to Mitt Romney’s: http://t.co/R0vmjGJe #BuffettRule |
Your ability to extract insights from text programmatically is strongly enhanced by command of regular expressions. This tool allows you to perform targeted inquiries about the presence or absence of particular components of text. Regular expressions can also be a way to extract numerical or categorical features from text descriptions as long as you are reasonably confident that the way those features appear across responses follows a consistent pattern.
In terms of model construction, you may continue to use some of the
{sklearn}
steps from
sklearn.feature_extraction.text
submodule from the previous
notebook to engineer new model features. You may also create custom
feature engineering functions for your modeling Pipelines()
but doing so is outside of my expertise. You can read more about custom
feature engineering steps here, from Andrew Villazon.
We’ll do more with regular expressions in our next class meeting. For now, you’ve seen how to use regular expressions for pattern matching. We can use regular expressions to filter observations or to extract particular components from text.