Purpose: In this notebook we’ll continue to expand our ability to extract insights from text data using a pattern-matching tool called regular expressions or regex.

The Big Idea

Often times when working with text, we know the structure of the token we are looking for, but we don’t know the exact token. For example, we may be digging through listings of items for sale on a social media marketplace and we want to know the cost of each of those items. We know that we are looking for a dollar sign ($) followed by some digits, but we don’t know the exact price. This isn’t so much a problem for humans reading through a handful of postings, but if we are trying to get a computer to do this, we need some way to describe to a computer what a typical “price” might look like. This is where regular expressions come in.

Building regex

Before we move forward, unless you work with regular expressions quite often, its likely that you’re going to find them quite mysterious and that you’ll forget the syntax for matching things like digits. Here’s a link to a regex cheat sheet from Dave Child and a link to an interactive regex applet so that you can check that your regular expressions are working as you intend for them to.

Regular expressions are also something you might ask your favorite AI assistant for help with. For example, I asked chatGPT, “Hey chatGPT, can you help me write a regular expression that will match dollar values in a posting? In particular, these dollar values may have commas separating thousands or millions of dollars and could contain decimals for cents. Thank you!” – ChatGPT returned with the following regular expression \$\d{1,3}(?:,\d{3})*(?:\.\d{1,2})?, which works as intended according to the regex applet from the previous paragraph.

How Do We Implement and Use regex?

Using R we can use regular expressions in multiple ways. We can

  • tokenize words (or bigrams) and then filter the results using a regular expression and the str_detect() function.
  • tokenize sentences (or paragraphs, or chapters, etc.) and filter results using a regular expression and the str_detect() function.
  • use str_extract() on raw text to extract subsections matching a pattern.

There’s lots we can do, and you’ll find more use-cases as you continue to explore working with text.

Let’s consider an example. I’ll load this dataset containing Joe Biden’s tweets from 2007 to 2020 which was posted to Kaggle by user Vopani in 2020. We can see the first few tweets below.

biden_tweets %>%
  select(-id, -url) %>%
  head() %>%
  kable() %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
timestamp tweet replies retweets quotes likes
2007-10-24 22:45:00 Tune in 11:30 ET tomorrow for a live webcast of Families USA Presidential Forum on health care: http://presidentialforums.health08.org/ 19 5 17 11
2007-12-29 15:35:00 Iowans, there’s a good chance there’s a Biden near you today on a cool 14 F day: http://blog.joebiden.com/?p=1625 13 16 6 22
2012-04-09 09:42:00 We’re excited to announce that @JoeBiden is being rebooted for the 2012 campaign season to give you news of the Vice President on the trail. 21 82 1 20
2012-04-09 09:43:00 Campaign staff will run this account to keep you up to date on what the VP’s up to, but you’ll see occasional tweets from Joe himself, too. 144 76 37 51
2012-04-09 13:11:00 News for you this morning: VP Biden will speak in Exeter, NH on 4/12 on tax fairness and the President’s support for the #BuffettRule. 10 54 0 5
2012-04-09 13:25:00 In NH on 4/12, the Vice President will give his take on why millionaires shouldn’t pay a lower tax rate than middle class families do. 16 52 0 6

Perhaps we want to extract the first appearing hashtag from each of the tweets posted by @JoeBiden. We can see the first tweets containing hashtags, along with their extracted hashtag, below.

biden_tweets %>%
  mutate(hashtags = str_extract(tweet, "#([A-z]|\\d|-)+")) %>%
  select(tweet, hashtags) %>%
  filter(!is.na(hashtags)) %>%
  head(10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
tweet hashtags
News for you this morning: VP Biden will speak in Exeter, NH on 4/12 on tax fairness and the President’s support for the #BuffettRule. #BuffettRule
The VP will be speaking in NH on Thursday about the #BuffettRule. Here’s what you need to know about it: http://t.co/sWu4EjD0 #BuffettRule
Heads up: We’ll be livestreaming the Vice President’s New Hampshire speech on the #BuffettRule tomorrow. Watch this space for details. #BuffettRule
RT @OFA_NH: #Exeter gets ready for @JoeBiden’s #BuffettRule speech here tomorrow: http://t.co/8YZjlEqe #Exeter
Compare your tax rate to Mitt Romney’s: http://t.co/R0vmjGJe #BuffettRule #BuffettRule
The Vice President is speaking about the #BuffettRule in Exeter, New Hampshire at 12:15pm ET—tune in here: http://t.co/QK0yYL8X #BuffettRule
Watch Vice President Biden’s speech live: http://t.co/uYcZa0vK #BuffettRule #BuffettRule
RT @Obama2012: “The President and I believe in a fair shot and a fair shake.”—VP @JoeBiden #BuffettRule #BuffettRule
Here’s the full video of VP Biden’s speech on the #BuffettRule in New Hampshire this morning: http://t.co/RFhqv9o8 #BuffettRule
Join Vice President Biden and special guests for a #Gen44 event in D.C. on April 17th: http://t.co/8SUwSQoW #Gen44

Perhaps we wanted all of the hashtags appearing in each tweet. Below is one way to do that – but I am sure that there is a more efficient way.

biden_tweets %>%
  mutate(hashtags = str_extract(tweet, "#([A-z]|\\d|-)+")) %>%
  select(tweet, hashtags) %>%
  filter(!is.na(hashtags)) %>%
  mutate(hashtags = str_extract_all(tweet, "#([A-z]|\\d|-)+")) %>%
  head(10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
tweet hashtags
News for you this morning: VP Biden will speak in Exeter, NH on 4/12 on tax fairness and the President’s support for the #BuffettRule. #BuffettRule
The VP will be speaking in NH on Thursday about the #BuffettRule. Here’s what you need to know about it: http://t.co/sWu4EjD0 #BuffettRule
Heads up: We’ll be livestreaming the Vice President’s New Hampshire speech on the #BuffettRule tomorrow. Watch this space for details. #BuffettRule
RT @OFA_NH: #Exeter gets ready for @JoeBiden’s #BuffettRule speech here tomorrow: http://t.co/8YZjlEqe #Exeter , #BuffettRule
Compare your tax rate to Mitt Romney’s: http://t.co/R0vmjGJe #BuffettRule #BuffettRule
The Vice President is speaking about the #BuffettRule in Exeter, New Hampshire at 12:15pm ET—tune in here: http://t.co/QK0yYL8X #BuffettRule
Watch Vice President Biden’s speech live: http://t.co/uYcZa0vK #BuffettRule #BuffettRule
RT @Obama2012: “The President and I believe in a fair shot and a fair shake.”—VP @JoeBiden #BuffettRule #BuffettRule
Here’s the full video of VP Biden’s speech on the #BuffettRule in New Hampshire this morning: http://t.co/RFhqv9o8 #BuffettRule
Join Vice President Biden and special guests for a #Gen44 event in D.C. on April 17th: http://t.co/8SUwSQoW #Gen44

Your ability to extract insights from text programmatically is strongly enhanced by command of regular expressions. This tool allows you to perform targeted inquiries about the presence or absence of particular components of text. Regular expressions can also be a way to extract numerical or categorical features from text descriptions as long as you are reasonably confident that the way those features appear across responses follows a consistent pattern.

In terms of model construction, you may use some of the {textrecipes} steps from the previous notebook in conjunction with a custom step_mutate() which utilizes regular expressions to engineer new model features.


Summary

We’ll do more with regular expressions in our next class meeting. For now, you’ve seen how to use regular expressions for pattern matching. We can use regular expressions to filter observations or to extract particular components from text.