This homework set is intended to provide you with practice wrangling and visualizing data.

Feel free to ask questions or for help with errors in the Slack channel or during our weekly synchronous meeting.

For these problems, we will use the nycflights13 datasets, which can be downloaded using the commands

install.packages("nycflights13")
library(nycflights13)

You now have access to five data frames: airlines, airports, flights, weather, and planes. A portion of the flights dataframe is shown below:

flights
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

You can read more about what each of the variables represents by running ?flights in the console.

  1. Visualize the departure delays in the flights data set using a histogram. You should relabel the axes and give your plot a title. What information have you learned from the plot?

  2. Create a new column in the data set called time_gained which is equal to the departure delay subtracted from the arrival delay. Visualize the time_gained variable using a box plot. Describe what you see.

  3. Do some airlines tend to have more arrival delays than others? Use side-by-side box plots to visualize the arrival delays of each airline.

  4. Let’s answer the question above in a slightly different way. Calculate the median arrival delays of each airline using the median() function. (You’ll want to add the argument na.rm = TRUE inside the median() function to ignore the NA values in the data frame.) Sort the airline carriers from the airlines by median arrival delay. Which airlines are consistently ahead of schedule? You can consult the airlines data frame to find the airline names that correspond to the abbreviations.

  5. From the weather data frame, visualize the temperature at the EWR airport during the entire month of January in 2013. Hint: Use the origin, year, and month variables to filter() the data frame, and use time_hour as the x coordinate in the plot.

  6. Make a bar plot which visualizes the top 30 destinations for planes which depart from JFK airport.

  7. Pick some other variables that you are interested in from these data frames and make appropriate visualizations.


Previous, R Markdown Template Next, Week Two