At our Week 1 meeting, participants expressed an interest in the possibility of analyzing real-time data from Twitter. In this assignment you will obtain credentials to access the Twitter live stream through Twitter’s API, you’ll pull recent tweets involving Alexandria Ocasio-Cortez (@AOC, @repAOC, #AOC) and Marjorie Taylor-Green (@mtgreene, @repMTG, #MTG).

  1. In order to access Twitter’s API, you’ll need credentials. Head over to Developer.Twitter.com – once there, do the following (or follow along with Karolina).
    • Click the button in the top-right to Apply for credentials.
    • Click the button to Apply for a Developer Account.
    • Log in with your regular Twitter account info (unfortunately you’ll need a Twitter account to do this).
    • Fill in the required fields as honestly as possible. Essentially we are requesting this account/building this app to gain experience with NLP. We’ll be applying sentiment analysis to tweets.
    • Once you’ve filled in all of the required fields, View and Accept the Terms and Conditions.
    • You’ll get an email once your account has been authorized – click the link in the message to confirm and then log back into Developer.Twitter.com
    • Click the Projects & Apps link and then click on the button to Create an App – again, fill in the required fields as honestly as possible. We are using this app to connect to the Twitter API and to practice sentiment analysis and other NLP tools with tweets.
    • Click the Create button and then click on the link to keys and tokens. You’ll need the consumer keys and authentication tokens. If you don’t see them displayed, just click the Generate/Regenerate buttons. Keep these available, you’ll need them.

Alright, now we are ready for the fun part. Open up a new R Markdown document (or R Script, if you prefer) and we’ll get started.

  1. We will need the following libraries for this tutorial: tidyverse, tidytext, devtools, and rtweet. Run the following and, if prompted, type 1 and hit Enter/Return to update all necessary packages.

    install.packages("devtools")

    And then,

    devtools::install_github('mkearney/rtweet')

    Now load the tidyverse, tidytext, reshape2, wordcloud, and rtweet libraries with the library() command. You don’t need to load devtools.

  2. Let’s pass the authentication credentials that R will need to access the twitter API and then create our initial connection. You might want to set the echo parameter for this code chunk to FALSE – this will prevent the code from being displayed in your markdown document. We are all pretty trustworthy people here, but its best not to put your access credentials on display.

    api_key <- "COPY_PASTE_CONSUMER_KEY"
    api_secret <- "COPY_PASTE_CONSUMER_SECRET"
    access_token <- "COPY_PASTE_AUTHORIZATION_KEY"
    access_secret <- "COPY_PASTE_AUTHORIZATION_SECRET"
    
    token <- create_token(app = "PASTE_YOUR_APP_NAME",
                          consumer_key = api_key,
                          consumer_secret = api_secret,
                          access_token = access_token,
                          access_secret = access_secret)

Now that you’ve done this, R has stored your credentials for future use. The rtweet package should be able to find this token if you want to connect to the twitter API in another R session.

  1. Let’s run a search for the most recent 1,000 tweets involving AOC. We’ll exclude retweets.

    #Allow the markdown document to access the token you 
    #created with create_token()
    auth_as("create_token")
    ## Reading auth from 'C:\Users\agilb\AppData\Roaming/R/rtweet/create_token.rds'
    #Search and store tweets including @AOC OR @repAOC OR #AOC
    AOC_tweets <- search_tweets("@AOC OR @repAOC OR #AOC", n = 1000, include_rts = FALSE)
    
    #View the head of the resulting data frame
    AOC_tweets %>% head()
    ## # A tibble: 6 x 73
    ##   status_id   created_at          user_id  screen_name text             source  
    ##   <chr>       <dttm>              <chr>    <chr>       <chr>            <chr>   
    ## 1 1402058280~ 2021-06-08 00:21:57 40353407 ElieNYC     "Well, @mayawil~ Twitter~
    ## 2 1401924717~ 2021-06-07 15:31:13 17642330 YALiberty   "Socialists lik~ Twitter~
    ## 3 1401780566~ 2021-06-07 05:58:25 4540710~ JillWineBa~ "Good news for ~ Twitter~
    ## 4 1402421591~ 2021-06-09 00:25:37 2394394~ flacademtb  "@ratemyskypero~ Twitter~
    ## 5 1402421589~ 2021-06-09 00:25:37 7307793~ NDePuy      "@CleenMister @~ Twitter~
    ## 6 1402421577~ 2021-06-09 00:25:34 2491577~ mgon920     "@OffBeat_eimma~ Twitter~
    ## # ... with 67 more variables: display_text_width <dbl>,
    ## #   reply_to_status_id <chr>, reply_to_user_id <chr>,
    ## #   reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
    ## #   favorite_count <int>, retweet_count <int>, quote_count <int>,
    ## #   reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
    ## #   urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
    ## #   media_t.co <list>, media_expanded_url <list>, media_type <list>,
    ## #   ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
    ## #   ext_media_type <chr>, ext_alt_text <list>, mentions_user_id <list>,
    ## #   mentions_screen_name <list>, lang <chr>, quoted_status_id <chr>,
    ## #   quoted_text <chr>, quoted_created_at <dttm>, quoted_source <chr>,
    ## #   quoted_favorite_count <int>, quoted_retweet_count <int>,
    ## #   quoted_user_id <chr>, quoted_screen_name <chr>, quoted_name <chr>,
    ## #   quoted_followers_count <int>, quoted_friends_count <int>,
    ## #   quoted_statuses_count <int>, quoted_location <chr>,
    ## #   quoted_description <chr>, quoted_verified <lgl>, retweet_status_id <chr>,
    ## #   retweet_text <chr>, retweet_created_at <dttm>, retweet_source <chr>,
    ## #   retweet_favorite_count <int>, retweet_retweet_count <int>,
    ## #   retweet_user_id <chr>, retweet_screen_name <chr>, retweet_name <chr>,
    ## #   retweet_followers_count <int>, retweet_friends_count <int>,
    ## #   retweet_statuses_count <int>, retweet_location <chr>,
    ## #   retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
    ## #   place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
    ## #   country_code <chr>, geo_coords <list>, coords_coords <list>,
    ## #   bbox_coords <list>, status_url <chr>

Your result may have returned fewer than the 1,000 tweets we requested. This is because Twitter’s standard search API goes back only about 6 to 9 days. There are some premium search API’s for Twitter that go back 30 days or longer, but they are paid services. That being said, you can see that we get lots of information on each of the tweets we were able to retrieve – 73 columns worth! We can see the text of the most recent 10 tweets by running:

AOC_tweets %>%
  pull(text) %>%
  head(n = 10)
##  [1] "Well, @mayawiley has now been endorsed by @AOC, @ewarren, and me. So... I'm now bringing down the curve on the entire endorsement class, something Professor Warren will probably recognize as my typical move. :)"                                                                                  
##  [2] "Socialists like @AOC don’t actually want solutions to problems. They want the system to fail entirely and seize absolute power.\n\nRead a history book. https://t.co/zQq6Cy0K93"                                                                                                                     
##  [3] "Good news for former #SisterInLaw @mayawiley. @AOC   endorsed her for New York City mayor! That should give a boost to her campaign. This is Maya and me pre-covid at NY's Strand Bookstore where she did a fantastic job interviewing me about my memoir #TheWatergateGirl. https://t.co/46KtqNV3Ae"
##  [4] "@ratemyskyperoom @AOC And lower the volume..."                                                                                                                                                                                                                                                       
##  [5] "@CleenMister @JayLouis @lavern_spicer @AOC Oh, boy. This calls for a boat parade!"                                                                                                                                                                                                                   
##  [6] "@OffBeat_eimmaJ @erehm @NestoPb @ThePoliterate @AOC And I'm guessing you don't just type like that on Tweeter."                                                                                                                                                                                      
##  [7] "@tomwatson @chrislhayes @AOC @chrislhayes is a terrorists he accused Biden of rape."                                                                                                                                                                                                                 
##  [8] "@AOC just called out @JoeManchinWV reasoning for being against the For The People Act….he wouldn’t be able to line his pockets with #darkmoney from filthy rich #KochBrothers.  \n#NailedIt #DarkMoneyJoe #NoIntegrity"                                                                              
##  [9] ".@AOC is speaking straight facts on @allinwithchris right now <U+0001F44F><U+0001F3FD><U+0001F44F><U+0001F3FD><U+0001F44F><U+0001F3FD>"                                                                                                                                                              
## [10] "@CleenMister @TheGrimRaptor @RelaxedMomma @lavern_spicer @AOC LOL, remember this cover?  Oh wait, it's a fake: https://t.co/vMacdm7Od1"
  1. Use similar code to extract the most recent 1,000 tweets involving Marjorie Taylor-Greene.

Now that we have our two sets of tweets, we can prepare the data for sentiment analysis. There’s lots of great information in these data frames, but we will just keep to the text column. We’ll keep each individual tweet as its own document, this means we won’t need to group_by() or mutate() anything before proceeding to unnest_tokens(). We should however, reduce the number of columns we are looking at.

  1. Create tidy_AOC_tweets and tidy_MTG_tweets by first select()ing the status_id, created_at, screen_name, and text columns and then passing the result to unnest_tokens() to generate a new word column from the existing text column.

  2. Now that you’ve tokenized your tweets, generate frequency count()s for the words in each data frame. What are the most common words used in tweets involving AOC? What are the most common words used in tweets involving MTG? You can choose to display your results as a table or to use a visual, like a bar graph.

  3. The majority of these words are pretty innocuous. This is why we often eliminate stop_words from a corpus before working with it. Reproduce your earlier code, but before computing counts, add an anti_join(stop_words) to your pipeline prior to computing the word counts.

That’s a little better. We see some artifacts that could still be cleaned up. For example, t.co is the prefix to a bitly link to a twitter post, and http/https often show up in links as well. We can remove those by either filtering them out or adding rows to our list of stop_words. Doing this additional filtering looks something like what appears below, where the third line says we would like to filter to only include rows where the word is not (!) in (%in%) the list containing mtg, repmtg, t.co, http, and https.

tidy_MTG_tweets %>%
  anti_join(stop_words) %>%
  filter(!(word %in% c("mtg", "repmtg", "t.co", "http", "https"))) %>%
  count(word) %>%
  arrange(desc(n)) %>%
  slice(1:30) %>%
  ggplot() + 
  geom_col(aes(x = n, y = reorder(word, n))) +
  labs(title = "Most Common Words", subtitle = "Tweets involving Marj", x = "Count", y = NULL)

Now let’s think about a sentiment analysis on the two sets of tweets.

  1. Let’s try to mimic what is being done on page 18 of our textbook to get a net_sentiment score for each tweet. You won’t need to pass an index to the count() function. Instead, try grouping by status_id before creating the counts.

How do the two distributions compare? Let’s look a wider array of sentiments with the nrc dictionary.

  1. For each individual set of tidy tweets, use an inner_join() with the nrc sentiment dictionary, group the resulting data frames by sentiment and produce counts for each sentiment. What are the most prevalent emotions in each set of tweets?

Well, that was insightful! Let’s end this assignment by creating a comparison word cloud.

  1. Mimic the code our textbook uses to build a comparison cloud of positive and negative words used within each set of tidy tweets. Try doing this with the bing library first and then switch to nrc to see if you can compare a different pair of emotions.

Given the positions of these two Representatives, it is likely that the word “trump” snuck in there as a positive word. Its pretty likely that when these tweets reference the “trump” they are indicating the Former Guy rather than the noun/verb. Try reproducing your word clouds but filter out “trump” since the sentiment dictionary is incorrectly interpreting that word.

Final Thoughts

There it is! You’ve accomplished a lot here. You’ve gained the ability to pull tweets from Twitter’s Search API and you’ve performed a sentiment analysis on tweets from, at, and mentioning two United States Representatives. You can now take these new found super powers and apply them to different twitter topics, users, hashtags, and more. There’s also a lot more we can do with data from twitter, including topic modeling, analyzing the social network that lies beneath the surface (looking at the web of tweets, retweets, replies, and mentions to identify popular, influential, or important players within particular conversations), and more. We’ll probably revisit data from twitter later on in the workshop.


Previous, Chapter Two Homework Next, Week Four