At our Week 1 meeting, participants expressed an interest in the possibility of analyzing real-time data from Twitter. In this assignment you will obtain credentials to access the Twitter live stream through Twitter’s API, you’ll pull recent tweets involving Alexandria Ocasio-Cortez (@AOC, @repAOC, #AOC) and Marjorie Taylor-Green (@mtgreene, @repMTG, #MTG).
Alright, now we are ready for the fun part. Open up a new R Markdown document (or R Script, if you prefer) and we’ll get started.
We will need the following libraries for this tutorial: tidyverse
, tidytext
, devtools
, and rtweet
. Run the following and, if prompted, type 1 and hit Enter/Return
to update all necessary packages.
install.packages("devtools")
And then,
::install_github('mkearney/rtweet') devtools
Now load the tidyverse
, tidytext
, reshape2
, wordcloud
, and rtweet
libraries with the library()
command. You don’t need to load devtools
.
Let’s pass the authentication credentials that R will need to access the twitter API and then create our initial connection. You might want to set the echo
parameter for this code chunk to FALSE
– this will prevent the code from being displayed in your markdown document. We are all pretty trustworthy people here, but its best not to put your access credentials on display.
<- "COPY_PASTE_CONSUMER_KEY"
api_key <- "COPY_PASTE_CONSUMER_SECRET"
api_secret <- "COPY_PASTE_AUTHORIZATION_KEY"
access_token <- "COPY_PASTE_AUTHORIZATION_SECRET"
access_secret
<- create_token(app = "PASTE_YOUR_APP_NAME",
token consumer_key = api_key,
consumer_secret = api_secret,
access_token = access_token,
access_secret = access_secret)
Now that you’ve done this, R has stored your credentials for future use. The rtweet
package should be able to find this token
if you want to connect to the twitter API in another R session.
Let’s run a search for the most recent 1,000 tweets involving AOC. We’ll exclude retweets.
#Allow the markdown document to access the token you
#created with create_token()
auth_as("create_token")
## Reading auth from 'C:\Users\agilb\AppData\Roaming/R/rtweet/create_token.rds'
#Search and store tweets including @AOC OR @repAOC OR #AOC
<- search_tweets("@AOC OR @repAOC OR #AOC", n = 1000, include_rts = FALSE)
AOC_tweets
#View the head of the resulting data frame
%>% head() AOC_tweets
## # A tibble: 6 x 73
## status_id created_at user_id screen_name text source
## <chr> <dttm> <chr> <chr> <chr> <chr>
## 1 1402058280~ 2021-06-08 00:21:57 40353407 ElieNYC "Well, @mayawil~ Twitter~
## 2 1401924717~ 2021-06-07 15:31:13 17642330 YALiberty "Socialists lik~ Twitter~
## 3 1401780566~ 2021-06-07 05:58:25 4540710~ JillWineBa~ "Good news for ~ Twitter~
## 4 1402421591~ 2021-06-09 00:25:37 2394394~ flacademtb "@ratemyskypero~ Twitter~
## 5 1402421589~ 2021-06-09 00:25:37 7307793~ NDePuy "@CleenMister @~ Twitter~
## 6 1402421577~ 2021-06-09 00:25:34 2491577~ mgon920 "@OffBeat_eimma~ Twitter~
## # ... with 67 more variables: display_text_width <dbl>,
## # reply_to_status_id <chr>, reply_to_user_id <chr>,
## # reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
## # favorite_count <int>, retweet_count <int>, quote_count <int>,
## # reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
## # urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
## # media_t.co <list>, media_expanded_url <list>, media_type <list>,
## # ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
## # ext_media_type <chr>, ext_alt_text <list>, mentions_user_id <list>,
## # mentions_screen_name <list>, lang <chr>, quoted_status_id <chr>,
## # quoted_text <chr>, quoted_created_at <dttm>, quoted_source <chr>,
## # quoted_favorite_count <int>, quoted_retweet_count <int>,
## # quoted_user_id <chr>, quoted_screen_name <chr>, quoted_name <chr>,
## # quoted_followers_count <int>, quoted_friends_count <int>,
## # quoted_statuses_count <int>, quoted_location <chr>,
## # quoted_description <chr>, quoted_verified <lgl>, retweet_status_id <chr>,
## # retweet_text <chr>, retweet_created_at <dttm>, retweet_source <chr>,
## # retweet_favorite_count <int>, retweet_retweet_count <int>,
## # retweet_user_id <chr>, retweet_screen_name <chr>, retweet_name <chr>,
## # retweet_followers_count <int>, retweet_friends_count <int>,
## # retweet_statuses_count <int>, retweet_location <chr>,
## # retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
## # place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
## # country_code <chr>, geo_coords <list>, coords_coords <list>,
## # bbox_coords <list>, status_url <chr>
Your result may have returned fewer than the 1,000 tweets we requested. This is because Twitter’s standard search API goes back only about 6 to 9 days. There are some premium search API’s for Twitter that go back 30 days or longer, but they are paid services. That being said, you can see that we get lots of information on each of the tweets we were able to retrieve – 73 columns worth! We can see the text of the most recent 10 tweets by running:
%>%
AOC_tweets pull(text) %>%
head(n = 10)
## [1] "Well, @mayawiley has now been endorsed by @AOC, @ewarren, and me. So... I'm now bringing down the curve on the entire endorsement class, something Professor Warren will probably recognize as my typical move. :)"
## [2] "Socialists like @AOC don’t actually want solutions to problems. They want the system to fail entirely and seize absolute power.\n\nRead a history book. https://t.co/zQq6Cy0K93"
## [3] "Good news for former #SisterInLaw @mayawiley. @AOC endorsed her for New York City mayor! That should give a boost to her campaign. This is Maya and me pre-covid at NY's Strand Bookstore where she did a fantastic job interviewing me about my memoir #TheWatergateGirl. https://t.co/46KtqNV3Ae"
## [4] "@ratemyskyperoom @AOC And lower the volume..."
## [5] "@CleenMister @JayLouis @lavern_spicer @AOC Oh, boy. This calls for a boat parade!"
## [6] "@OffBeat_eimmaJ @erehm @NestoPb @ThePoliterate @AOC And I'm guessing you don't just type like that on Tweeter."
## [7] "@tomwatson @chrislhayes @AOC @chrislhayes is a terrorists he accused Biden of rape."
## [8] "@AOC just called out @JoeManchinWV reasoning for being against the For The People Act….he wouldn’t be able to line his pockets with #darkmoney from filthy rich #KochBrothers. \n#NailedIt #DarkMoneyJoe #NoIntegrity"
## [9] ".@AOC is speaking straight facts on @allinwithchris right now <U+0001F44F><U+0001F3FD><U+0001F44F><U+0001F3FD><U+0001F44F><U+0001F3FD>"
## [10] "@CleenMister @TheGrimRaptor @RelaxedMomma @lavern_spicer @AOC LOL, remember this cover? Oh wait, it's a fake: https://t.co/vMacdm7Od1"
Now that we have our two sets of tweets, we can prepare the data for sentiment analysis. There’s lots of great information in these data frames, but we will just keep to the text
column. We’ll keep each individual tweet as its own document, this means we won’t need to group_by()
or mutate()
anything before proceeding to unnest_tokens()
. We should however, reduce the number of columns we are looking at.
Create tidy_AOC_tweets
and tidy_MTG_tweets
by first select()
ing the status_id
, created_at
, screen_name
, and text
columns and then passing the result to unnest_tokens()
to generate a new word
column from the existing text
column.
Now that you’ve tokenized your tweets, generate frequency count()
s for the words in each data frame. What are the most common words used in tweets involving AOC? What are the most common words used in tweets involving MTG? You can choose to display your results as a table or to use a visual, like a bar graph.
The majority of these words are pretty innocuous. This is why we often eliminate stop_words
from a corpus before working with it. Reproduce your earlier code, but before computing counts, add an anti_join(stop_words)
to your pipeline prior to computing the word counts.
That’s a little better. We see some artifacts that could still be cleaned up. For example, t.co
is the prefix to a bitly link to a twitter post, and http
/https
often show up in links as well. We can remove those by either filtering them out or adding rows to our list of stop_words
. Doing this additional filtering looks something like what appears below, where the third line says we would like to filter to only include rows where the word
is not (!
) in (%in%
) the list containing mtg
, repmtg
, t.co
, http
, and https
.
%>%
tidy_MTG_tweets anti_join(stop_words) %>%
filter(!(word %in% c("mtg", "repmtg", "t.co", "http", "https"))) %>%
count(word) %>%
arrange(desc(n)) %>%
slice(1:30) %>%
ggplot() +
geom_col(aes(x = n, y = reorder(word, n))) +
labs(title = "Most Common Words", subtitle = "Tweets involving Marj", x = "Count", y = NULL)
Now let’s think about a sentiment analysis on the two sets of tweets.
net_sentiment
score for each tweet. You won’t need to pass an index to the count()
function. Instead, try grouping by status_id
before creating the counts.How do the two distributions compare? Let’s look a wider array of sentiments with the nrc
dictionary.
inner_join()
with the nrc
sentiment dictionary, group the resulting data frames by sentiment
and produce counts for each sentiment. What are the most prevalent emotions in each set of tweets?Well, that was insightful! Let’s end this assignment by creating a comparison word cloud.
bing
library first and then switch to nrc
to see if you can compare a different pair of emotions.Given the positions of these two Representatives, it is likely that the word “trump” snuck in there as a positive word. Its pretty likely that when these tweets reference the “trump” they are indicating the Former Guy rather than the noun/verb. Try reproducing your word clouds but filter out “trump” since the sentiment dictionary is incorrectly interpreting that word.
There it is! You’ve accomplished a lot here. You’ve gained the ability to pull tweets from Twitter’s Search API and you’ve performed a sentiment analysis on tweets from, at, and mentioning two United States Representatives. You can now take these new found super powers and apply them to different twitter topics, users, hashtags, and more. There’s also a lot more we can do with data from twitter, including topic modeling, analyzing the social network that lies beneath the surface (looking at the web of tweets, retweets, replies, and mentions to identify popular, influential, or important players within particular conversations), and more. We’ll probably revisit data from twitter later on in the workshop.