This homework assignment is intended as an opportunity to practice working with n-grams, computing word-pair correlations, and visualizing co-occurrences via a graph structure.
In order to complete this assignment you’ll need to load the following libraries into an R Markdown document or an R Script: tidyverse, tidytext, widyr, igraph, ggraph.
This week let’s work with text from presidential speeches, which you can find available here. I’ve chosen several speeches given by Donald Trump between January and February 2020, including the following:
You can feel free to explore others. When you click on the data source from the main page, make sure you collect the link to the raw text file (you can get there using the icon/button on the right of the page which looks like a document with the symbols >_ on it).
Open a new R script or R Markdown file (your choice), and load the following libraries: tidyverse, tidytext, widyr, igraph, and ggraph.
We’ll read in the data from the Trump speeches:
soleimani <- read_delim("https://code.librehq.com/kfogel/presidential-speeches/-/raw/main/data/2020-01-03-remarks-killing-qasem-soleimani.txt", 
                        delim = "\n", 
                        col_names = "text", 
                        skip = 1)
marchLife <- read_delim("https://code.librehq.com/kfogel/presidential-speeches/-/raw/main/data/2020-01-24-speech-march-life.txt", 
                        delim = "\n", 
                        col_names = "text", 
                        skip = 1)
SoU_20 <- read_delim("https://code.librehq.com/kfogel/presidential-speeches/-/raw/main/data/2020-02-04-state-union-address.txt", 
                     delim = "\n", 
                     col_names = "text", 
                     skip = 1)
acquittal <- read_delim("https://code.librehq.com/kfogel/presidential-speeches/-/raw/main/data/2020-02-06-remarks-after-his-acquittal.txt", 
                        delim = "\n", 
                        col_names = "text", 
                        skip = 1)Note that the delim = "\n" argument says that delimeters here are “newlines”, col_names = "text" says that we want our sole column to be called text, and skip = 1 denotes that we would like to skip the first line, which reads “President: Donald Trump” in all cases.
Let’s add some information to each of our speech data frames. For each speech, we will add the context in which the speech was presented and line numbers. I’ll do this for the soleimani speech and you’ll mimic the code to do the same for the other three speeches (or the speeches you’ve decided to work with).
soleimani <- soleimani %>%
  mutate(speech = "Soleimani",
         linenumber = row_number())
marchLife <- marchLife %>%
  mutate(speech = "March for Life",
         linenumber = row_number())
SoU_20 <- SoU_20 %>%
  mutate(speech = "State of the Union 2020",
         linenumber = row_number())
acquittal <- acquittal %>%
  mutate(speech = "First Impeachment Acquittal",
         linenumber = row_number())Let’s now stack the separate data frames into a single data frame. We can do this because we’ve added the speech column, so that we are still able to track which lines are from which speeches. You can use the bind_rows() function and pass the four individual speech data frames as arguments to it. Store your result in a new data frame called speeches.
speeches <- bind_rows(acquittal, marchLife, soleimani, SoU_20)Now that we have a data frame containing the text from all four speeches, let’s tokenize into bigrams within each speech and then filter out any stopwords we encounter.
speeches data frame.group_by the speech columns.unnest_tokens() to extract bigrams from the text column.stop_words$word) – you may also want to add your own stop words such as audience, applause, laughter, etc.speeches_bigram_counts.speeches_bigram_counts <- speeches %>%
  group_by(speech) %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!(word1 %in% c(stop_words$word, "audience", "applause", "laughter"))) %>%
  filter(!(word2 %in% c(stop_words$word, "audience", "applause", "laughter"))) %>%
  count(word1, word2, sort = TRUE) %>%
  ungroup()Alright, now that we have these bigrams, let’s see if we can visualize each speech via a graph. I’ll show how we can do this with the speech from the March for Life rally. See if you can adapt the code to visualize the other three speeches.
marchForLife_graph <- speeches_bigram_counts %>%
  filter(str_detect(speech, "March")) %>%
  select(word1, word2, n) %>%
  filter(n > 1) %>%
  graph_from_data_frame()
ggraph(marchForLife_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE) +
  geom_node_point(color = "lightblue", size = 5) + 
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) + 
  theme_void()Let’s now look at the entire corpus of these four Trump speeches (as if it were one document). We’ll look for correlations and then build a plot. This will happen in three steps, first we will build an object called speeches_by_sentence which tokenizes each sentence of each speech, second we will build an object called word_cors which computes the within-sentence correlation of word pairs, and third we will construct a plot of correlated words.
speeches_by_sentence
speeches data frame.speech variable.unnest_tokens() to tokenize by sentence – you’ll need to pass the parameter token = "sentence" for this.mutate() function to create a sentence_number variable, using row_number().speech and sentence_number.unnest_tokens() once more to tokenize the words in each sentence.speeches_by_sentence.word_cors
speeches_by_sentence data frame.word variable.n() >= 10 – the n() here tells R to compute the counts.pairwise_cor() function, passing it the arguments word, sentence_number, and sort = TRUE to compute the within-sentence word correlations.word_cors.word_cors object you just created.correlation at least 0.15.graph_from_data_fram() function to create a graph object.ggraph(layout = "fr").geom_edge_link() layer with the aesthetic edges_alpha = correlation, and set the parameter show.legend to FALSE.geom_node_point() layer and set the node color to "lightblue", and size to 5.geom_node_text() layer with the aesthetic label = name, and set repel = TRUE to avoid label overlaps.theme_void() with no arguments to produce the graph on top of a plain white background.speeches_by_sentence <- speeches %>%
  group_by(speech) %>%
  unnest_tokens(sentence, text, token = "sentences") %>%
  mutate(sentence_number = row_number()) %>%
  group_by(speech, sentence_number) %>%
  unnest_tokens(word, sentence, token = "words") %>%
  ungroup() %>%
  filter(!(word %in% c(stop_words$word, "applause", "laughter")))
word_cors <- speeches_by_sentence %>%
  group_by(word) %>%
  filter(n() >= 10) %>%
  pairwise_cor(word, sentence_number, sort = TRUE)
word_cors %>%
  filter(correlation > 0.15) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void()Recap: In this assignment we looked at word co-occurrences and correlations. This gave us a bit more insight into the context in which words appeared. Additionally, we saw how we can use graph structures to model topics which are discussed within a corpus. In that last plot of correlated words within sentences in Trump’s speeches, we can see some components of the graph which are specific to each individual speech. We can also identify a connected mass, however, indicating themes which span (or at least serve to connect) the four separate Trump speeches. We could engage in a similar analysis using all of the speeches Trump gave while in office. The large connected mass in the resulting topic graph would give us insight into Trump’s core policy interests, as they would pervade through most of his speaking engagements. I hope you found this assignment interesting!