This homework set is intended to provide you with practice wrangling and visualizing data.
Feel free to ask questions or for help with errors in the Slack channel or during our weekly synchronous meeting.
For these problems, we will use the nycflights13
datasets, which can be downloaded using the commands
You now have access to five data frames: airlines
, airports
, flights
, weather
, and planes
. A portion of the flights
dataframe is shown below:
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
You can read more about what each of the variables represents by running ?flights
in the console.
Visualize the departure delays in the flights
data set using a histogram. You should relabel the axes and give your plot a title. What information have you learned from the plot?
Create a new column in the data set called time_gained
which is equal to the departure delay subtracted from the arrival delay. Visualize the time_gained
variable using a box plot. Describe what you see.
Do some airlines tend to have more arrival delays than others? Use side-by-side box plots to visualize the arrival delays of each airline.
Let’s answer the question above in a slightly different way. Calculate the median arrival delays of each airline using the median()
function. (You’ll want to add the argument na.rm = TRUE
inside the median()
function to ignore the NA
values in the data frame.) Sort the airline carriers from the airlines by median arrival delay. Which airlines are consistently ahead of schedule? You can consult the airlines
data frame to find the airline names that correspond to the abbreviations.
From the weather
data frame, visualize the temperature at the EWR
airport during the entire month of January in 2013. Hint: Use the origin
, year
, and month
variables to filter()
the data frame, and use time_hour
as the x coordinate in the plot.
Make a bar plot which visualizes the top 30 destinations for planes which depart from JFK
airport.
Pick some other variables that you are interested in from these data frames and make appropriate visualizations.