This homework set is intended to provide you with practice wrangling and visualizing data.

Feel free to ask questions or for help with errors in the Slack channel or during our weekly synchronous meeting.

For these problems, we will use the nycflights13 datasets, which can be downloaded using the commands

install.packages("nycflights13")
library(nycflights13)

You now have access to five data frames: airlines, airports, flights, weather, and planes. A portion of the flights dataframe is shown below:

flights
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

You can read more about what each of the variables represents by running ?flights in the console.

  1. Visualize the departure delays in the flights data set using a histogram. You should relabel the axes and give your plot a title. What information have you learned from the plot?
flights %>%
  ggplot() +
  geom_histogram(aes(dep_delay), bins = 30) +
  labs(x = "Departure Delay (Minutes)", 
       y = "Count",
       title = "Histogram of Departure Delays from NYC Airports, 2013")

The vast majority of departure delays are close to 0 minutes. The longer the departure delay, the smaller number of flights there are that have that departure delay.

  1. Create a new column in the data set called time_gained which is equal to the departure delay subtracted from the arrival delay. Visualize the time_gained variable using a box plot. Describe what you see.
flights %>%
  mutate(time_gained = arr_delay - dep_delay) %>%
  ggplot() +
  geom_boxplot(aes(time_gained)) +
  labs(x = "Time Gained During Flight (Minutes)", y = NULL)

An average flight gains time in the air, and the vast majority of flights only gain or lose a small amount of time. There are tons of outliers in the data, mostly due to flights that lose a significant amount of time in the air. The data is skewed to the right- the flights can only gain so much time in the air, but they can lose significantly more.

  1. Do some airlines tend to have more arrival delays than others? Use side-by-side box plots to visualize the arrival delays of each airline.
flights %>%
  ggplot() + 
  geom_boxplot(aes(x = carrier, y = arr_delay)) +
  labs(x = "Airline Carrier", y = "Arrival Delay (Minutes)")

  1. Let’s answer the question above in a slightly different way. Calculate the median arrival delays of each airline using the median() function. (You’ll want to add the argument na.rm = TRUE inside the median() function to ignore the NA values in the data frame.) Sort the airline carriers from the airlines by median arrival delay. Which airlines are consistently ahead of schedule? You can consult the airlines data frame to find the airline names that correspond to the abbreviations.
flights %>%
  group_by(carrier) %>%
  summarize(median_arr_delay = median(arr_delay, na.rm = TRUE)) %>%
  arrange(median_arr_delay) 
## # A tibble: 16 x 2
##    carrier median_arr_delay
##    <chr>              <dbl>
##  1 AS                   -17
##  2 HA                   -13
##  3 AA                    -9
##  4 VX                    -9
##  5 DL                    -8
##  6 9E                    -7
##  7 OO                    -7
##  8 UA                    -6
##  9 US                    -6
## 10 B6                    -3
## 11 WN                    -3
## 12 YV                    -2
## 13 EV                    -1
## 14 MQ                    -1
## 15 FL                     5
## 16 F9                     6

You can manually look up the airline names in the airlines data frame, or you can use a left_join function to add the names:

flights %>%
  group_by(carrier) %>%
  summarize(median_arr_delay = median(arr_delay, na.rm = TRUE)) %>%
  arrange(median_arr_delay) %>%
  left_join(airlines, by = "carrier") %>%
  select(-carrier) %>%
  select(name,median_arr_delay)
## # A tibble: 16 x 2
##    name                        median_arr_delay
##    <chr>                                  <dbl>
##  1 Alaska Airlines Inc.                     -17
##  2 Hawaiian Airlines Inc.                   -13
##  3 American Airlines Inc.                    -9
##  4 Virgin America                            -9
##  5 Delta Air Lines Inc.                      -8
##  6 Endeavor Air Inc.                         -7
##  7 SkyWest Airlines Inc.                     -7
##  8 United Air Lines Inc.                     -6
##  9 US Airways Inc.                           -6
## 10 JetBlue Airways                           -3
## 11 Southwest Airlines Co.                    -3
## 12 Mesa Airlines Inc.                        -2
## 13 ExpressJet Airlines Inc.                  -1
## 14 Envoy Air                                 -1
## 15 AirTran Airways Corporation                5
## 16 Frontier Airlines Inc.                     6

Alaska and Hawaiian Airlines are consistently ahead of schedule. (Must be something about not being in the contiguous United States!) It would be interesting to see how the air_time of a flight correlates with the arr_delay. One thing that these airlines likely have in common is having long flight durations!

  1. From the weather data frame, visualize the temperature at the EWR airport during the entire month of January in 2013. Hint: Use the origin, year, and month variables to filter() the data frame, and use time_hour as the x coordinate in the plot.
weather %>%
  filter(origin == "EWR",
         year == 2013,
         month == 1) %>%
  ggplot() +
  geom_line(aes(x = time_hour, y = temp)) + 
  labs(x = NULL, y = "Temperature") + 
  ggtitle("EWR Airport Temperatures January 2013")

  1. Make a bar plot which visualizes the top 30 destinations for planes which depart from JFK airport.
flights %>%
  filter(origin == "JFK") %>%
  count(dest) %>%
  arrange(desc(n)) %>%
  head(n=30) %>%
  ggplot() +
  geom_col(aes(x = n, y = reorder(dest,n))) +
  labs(x = "Count", y = "Destination Airport")

  1. Pick some other variables that you are interested in from these data frames and make appropriate visualizations.

Previous, R Markdown Template Next, Week Two