Purpose: In this notebook we’ll continue to expand our ability to extract insights from text data using a pattern-matching tool called regular expressions or regex.

The Big Idea

Often times when working with text, we know the structure of the token we are looking for, but we don’t know the exact token. For example, we may be digging through listings of items for sale on a social media marketplace and we want to know the cost of each of those items. We know that we are looking for a dollar sign ($) followed by some digits, but we don’t know the exact price. This isn’t so much a problem for humans reading through a handful of postings, but if we are trying to get a computer to do this, we need some way to describe to a computer what a typical “price” might look like. This is where regular expressions come in.

Building regex

Before we move forward, unless you work with regular expressions quite often, its likely that you’re going to find them quite mysterious and that you’ll forget the syntax for matching things like digits. Here’s a link to a regex cheat sheet from Dave Child and a link to an interactive regex applet so that you can check that your regular expressions are working as you intend for them to.

Regular expressions are also something you might ask your favorite AI assistant for help with. For example, I asked chatGPT, “Hey chatGPT, can you help me write a regular expression that will match dollar values in a posting? In particular, these dollar values may have commas separating thousands or millions of dollars and could contain decimals for cents. Thank you!” – ChatGPT returned with the following regular expression \$\d{1,3}(?:,\d{3})*(?:\.\d{1,2})?, which works as intended according to the regex applet from the previous paragraph.

How Do We Implement and Use regex?

There are lots of specialty text processing packages in Python, including {nltk}, {SpaCy}, and others. We can accomplish quite a bit in Python using basic string functionality from {pandas} and regular expressions though. In particular, the following functions can be applied to columns of data frames.

  • Counter() can be used in conjunction with the .most_common() method to identify the most common words in a text column.
  • str.contains() will search a string for a pattern match using either fixed strings or regular expressions.
  • str.extract() will collect all components in a string which match a pattern.

There’s lots we can do, and you’ll find more use-cases as you continue to explore working with text.

Let’s consider an example. I’ll load this dataset containing Joe Biden’s tweets from 2007 to 2020 which was posted to Kaggle by user Vopani in 2020. We can see the first few tweets below.

import pandas as pd

biden_tweets = pd.read_csv("https://raw.githubusercontent.com/agmath/agmath.github.io/master/data/classification/biden_tweets.csv")
py$biden_tweets %>%
  select(-id, -url) %>%
  head() %>%
  kable() %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
timestamp tweet replies retweets quotes likes
2007-10-24 22:45 Tune in 11:30 ET tomorrow for a live webcast of Families USA Presidential Forum on health care: http://presidentialforums.health08.org/ 19 5 17 11
2007-12-29 15:35 Iowans, there’s a good chance there’s a Biden near you today on a cool 14 F day: http://blog.joebiden.com/?p=1625 13 16 6 22
2012-04-09 09:42 We’re excited to announce that @JoeBiden is being rebooted for the 2012 campaign season to give you news of the Vice President on the trail. 21 82 1 20
2012-04-09 09:43 Campaign staff will run this account to keep you up to date on what the VP’s up to, but you’ll see occasional tweets from Joe himself, too. 144 76 37 51
2012-04-09 13:11 News for you this morning: VP Biden will speak in Exeter, NH on 4/12 on tax fairness and the President’s support for the #BuffettRule. 10 54 0 5
2012-04-09 13:25 In NH on 4/12, the Vice President will give his take on why millionaires shouldn’t pay a lower tax rate than middle class families do. 16 52 0 6

Perhaps we want to extract the first appearing hashtag from each of the tweets posted by @JoeBiden. We can see the first tweets containing hashtags, along with their extracted hashtag, below.

biden_tweets["hashtags"] = biden_tweets["tweet"].str.extract(r"(#[A-Za-z\d-]+)")
biden_hashtags = biden_tweets.loc[biden_tweets["hashtags"].notnull(), ["tweet", "hashtags"]]
py$biden_hashtags %>%
  head() %>%
  kable() %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
tweet hashtags
4 News for you this morning: VP Biden will speak in Exeter, NH on 4/12 on tax fairness and the President’s support for the #BuffettRule. #BuffettRule
9 The VP will be speaking in NH on Thursday about the #BuffettRule. Here’s what you need to know about it: http://t.co/sWu4EjD0 #BuffettRule
13 Heads up: We’ll be livestreaming the Vice President’s New Hampshire speech on the #BuffettRule tomorrow. Watch this space for details. #BuffettRule
15 RT @OFA_NH: #Exeter gets ready for @JoeBiden’s #BuffettRule speech here tomorrow: http://t.co/8YZjlEqe #Exeter
16 Compare your tax rate to Mitt Romney’s: http://t.co/R0vmjGJe #BuffettRule #BuffettRule
17 The Vice President is speaking about the #BuffettRule in Exeter, New Hampshire at 12:15pm ET—tune in here: http://t.co/QK0yYL8X #BuffettRule

Perhaps we wanted all of the hashtags appearing in each tweet. Below is one way to do that – but I am sure that there is a more efficient way.

biden_hashtags = pd.DataFrame(biden_tweets["tweet"].str.extractall(r"(#[A-Za-z\d-]+)"))

biden_hashtags["tweet_number"] = biden_hashtags.index.get_level_values(0) 

biden_hashtags.columns = ["hashtags", "tweet_number"]

biden_tweets["tweet_number"] = biden_tweets.index

biden_hashtags = biden_hashtags.merge(biden_tweets[["tweet_number", "tweet"]], how = "left", left_on = "tweet_number", right_on = "tweet_number")
py$biden_hashtags %>%
  head() %>%
  kable() %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
hashtags tweet_number tweet
#BuffettRule 4 News for you this morning: VP Biden will speak in Exeter, NH on 4/12 on tax fairness and the President’s support for the #BuffettRule.
#BuffettRule 9 The VP will be speaking in NH on Thursday about the #BuffettRule. Here’s what you need to know about it: http://t.co/sWu4EjD0
#BuffettRule 13 Heads up: We’ll be livestreaming the Vice President’s New Hampshire speech on the #BuffettRule tomorrow. Watch this space for details.
#Exeter 15 RT @OFA_NH: #Exeter gets ready for @JoeBiden’s #BuffettRule speech here tomorrow: http://t.co/8YZjlEqe
#BuffettRule 15 RT @OFA_NH: #Exeter gets ready for @JoeBiden’s #BuffettRule speech here tomorrow: http://t.co/8YZjlEqe
#BuffettRule 16 Compare your tax rate to Mitt Romney’s: http://t.co/R0vmjGJe #BuffettRule

Your ability to extract insights from text programmatically is strongly enhanced by command of regular expressions. This tool allows you to perform targeted inquiries about the presence or absence of particular components of text. Regular expressions can also be a way to extract numerical or categorical features from text descriptions as long as you are reasonably confident that the way those features appear across responses follows a consistent pattern.

In terms of model construction, you may continue to use some of the {sklearn} steps from sklearn.feature_extraction.text submodule from the previous notebook to engineer new model features. You may also create custom feature engineering functions for your modeling Pipelines() but doing so is outside of my expertise. You can read more about custom feature engineering steps here, from Andrew Villazon.


Summary

We’ll do more with regular expressions in our next class meeting. For now, you’ve seen how to use regular expressions for pattern matching. We can use regular expressions to filter observations or to extract particular components from text.