This homework set is intended to provide you with practice wrangling and visualizing data.
Feel free to ask questions or for help with errors in the Slack channel or during our weekly synchronous meeting.
For these problems, we will use the nycflights13
datasets, which can be downloaded using the commands
You now have access to five data frames: airlines
, airports
, flights
, weather
, and planes
. A portion of the flights
dataframe is shown below:
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
You can read more about what each of the variables represents by running ?flights
in the console.
flights
data set using a histogram. You should relabel the axes and give your plot a title. What information have you learned from the plot?flights %>%
ggplot() +
geom_histogram(aes(dep_delay), bins = 30) +
labs(x = "Departure Delay (Minutes)",
y = "Count",
title = "Histogram of Departure Delays from NYC Airports, 2013")
The vast majority of departure delays are close to 0 minutes. The longer the departure delay, the smaller number of flights there are that have that departure delay.
time_gained
which is equal to the departure delay subtracted from the arrival delay. Visualize the time_gained
variable using a box plot. Describe what you see.flights %>%
mutate(time_gained = arr_delay - dep_delay) %>%
ggplot() +
geom_boxplot(aes(time_gained)) +
labs(x = "Time Gained During Flight (Minutes)", y = NULL)
An average flight gains time in the air, and the vast majority of flights only gain or lose a small amount of time. There are tons of outliers in the data, mostly due to flights that lose a significant amount of time in the air. The data is skewed to the right- the flights can only gain so much time in the air, but they can lose significantly more.
flights %>%
ggplot() +
geom_boxplot(aes(x = carrier, y = arr_delay)) +
labs(x = "Airline Carrier", y = "Arrival Delay (Minutes)")
median()
function. (You’ll want to add the argument na.rm = TRUE
inside the median()
function to ignore the NA
values in the data frame.) Sort the airline carriers from the airlines by median arrival delay. Which airlines are consistently ahead of schedule? You can consult the airlines
data frame to find the airline names that correspond to the abbreviations.flights %>%
group_by(carrier) %>%
summarize(median_arr_delay = median(arr_delay, na.rm = TRUE)) %>%
arrange(median_arr_delay)
## # A tibble: 16 x 2
## carrier median_arr_delay
## <chr> <dbl>
## 1 AS -17
## 2 HA -13
## 3 AA -9
## 4 VX -9
## 5 DL -8
## 6 9E -7
## 7 OO -7
## 8 UA -6
## 9 US -6
## 10 B6 -3
## 11 WN -3
## 12 YV -2
## 13 EV -1
## 14 MQ -1
## 15 FL 5
## 16 F9 6
You can manually look up the airline names in the airlines
data frame, or you can use a left_join
function to add the names:
flights %>%
group_by(carrier) %>%
summarize(median_arr_delay = median(arr_delay, na.rm = TRUE)) %>%
arrange(median_arr_delay) %>%
left_join(airlines, by = "carrier") %>%
select(-carrier) %>%
select(name,median_arr_delay)
## # A tibble: 16 x 2
## name median_arr_delay
## <chr> <dbl>
## 1 Alaska Airlines Inc. -17
## 2 Hawaiian Airlines Inc. -13
## 3 American Airlines Inc. -9
## 4 Virgin America -9
## 5 Delta Air Lines Inc. -8
## 6 Endeavor Air Inc. -7
## 7 SkyWest Airlines Inc. -7
## 8 United Air Lines Inc. -6
## 9 US Airways Inc. -6
## 10 JetBlue Airways -3
## 11 Southwest Airlines Co. -3
## 12 Mesa Airlines Inc. -2
## 13 ExpressJet Airlines Inc. -1
## 14 Envoy Air -1
## 15 AirTran Airways Corporation 5
## 16 Frontier Airlines Inc. 6
Alaska and Hawaiian Airlines are consistently ahead of schedule. (Must be something about not being in the contiguous United States!) It would be interesting to see how the air_time
of a flight correlates with the arr_delay
. One thing that these airlines likely have in common is having long flight durations!
weather
data frame, visualize the temperature at the EWR
airport during the entire month of January in 2013. Hint: Use the origin
, year
, and month
variables to filter()
the data frame, and use time_hour
as the x coordinate in the plot.weather %>%
filter(origin == "EWR",
year == 2013,
month == 1) %>%
ggplot() +
geom_line(aes(x = time_hour, y = temp)) +
labs(x = NULL, y = "Temperature") +
ggtitle("EWR Airport Temperatures January 2013")
JFK
airport.flights %>%
filter(origin == "JFK") %>%
count(dest) %>%
arrange(desc(n)) %>%
head(n=30) %>%
ggplot() +
geom_col(aes(x = n, y = reorder(dest,n))) +
labs(x = "Count", y = "Destination Airport")