library(tidyverse)Working with multiple data frames
STA 199
Bulletin
- this
aeis due for grade (48 hours from class). To turn in, simply push to GitHub - homework 01
- lab 01 due Monday at 5:00pm
Getting started
Clone your ae5-username repo from the GitHub organization.
Today
By the end of today you will…
- join data frames
Load packages
Joins
There are six join functions in the dplyr package. Join functions take two data frames as arguments and return a data frame.
The six joins can be broken down into two categories:
- Mutating joins: joining data frames results in mutating new columns
- Filtering joins: joining data frames results in filtering observations in one data frame based on another.
In all of the following examples, x and y are two data frames:
x = tibble(value = c(100, 200, 300),
xcol = c("x1", "x2", "x3"))
y = tibble(value = c(100, 200, 400),
ycol = c("y1", "y2", "y4"))
x# A tibble: 3 × 2
value xcol
<dbl> <chr>
1 100 x1
2 200 x2
3 300 x3
y# A tibble: 3 × 2
value ycol
<dbl> <chr>
1 100 y1
2 200 y2
3 400 y4
Mutating joins
The most popular 2 joins:
left_join(x, y): keep all rows fromxand adds columns fromyright_join(x, y): keeps all rows fromyand adds columns fromx
Two more helpful joins:
inner_join(x, y): join all rows fromxwhere there are matching values iny.Returns all combinations in case of multiple matchesfull_join(x, y): include all rows inxory
Toy examples:
x %>%
left_join(y)Joining, by = "value"
# A tibble: 3 × 3
value xcol ycol
<dbl> <chr> <chr>
1 100 x1 y1
2 200 x2 y2
3 300 x3 <NA>
- What do you think
Joining, by = "value"means?
x %>%
right_join(y, by = "value")# A tibble: 3 × 3
value xcol ycol
<dbl> <chr> <chr>
1 100 x1 y1
2 200 x2 y2
3 400 <NA> y4
x %>%
inner_join(y)Joining, by = "value"
# A tibble: 2 × 3
value xcol ycol
<dbl> <chr> <chr>
1 100 x1 y1
2 200 x2 y2
full_join(x, y)Joining, by = "value"
# A tibble: 4 × 3
value xcol ycol
<dbl> <chr> <chr>
1 100 x1 y1
2 200 x2 y2
3 300 x3 <NA>
4 400 <NA> y4
Filtering joins
semi_join(x, y): return all rows fromxwith match inyanti_join(x, y): return all rows fromxwithout a match iny
In both of these “filtering” cases we do not add any new columns to our first argument (the data frame x).
Toy examples:
x %>%
semi_join(y)Joining, by = "value"
# A tibble: 2 × 2
value xcol
<dbl> <chr>
1 100 x1
2 200 x2
x %>%
anti_join(y)Joining, by = "value"
# A tibble: 1 × 2
value xcol
<dbl> <chr>
1 300 x3
We can manually specify which columns to join by and the columns do not have to have the same name! See the example below.
Check out the new data frame x2:
x2 = x %>%
mutate(new_value = value) %>%
select(new_value, xcol)
x2# A tibble: 3 × 2
new_value xcol
<dbl> <chr>
1 100 x1
2 200 x2
3 300 x3
We can still join x2 with y but left_join(x2, y) won’t work. We have to manually specify which columns to join by:
x2 %>%
left_join(y, by = c("new_value" = "value"))# A tibble: 3 × 3
new_value xcol ycol
<dbl> <chr> <chr>
1 100 x1 y1
2 200 x2 y2
3 300 x3 <NA>
Practice
We’ll take a look at some New York flights data.
This data set contains on-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013.
This data is a subset of the data set provided by the nycflights13 package.
Codebook
Check out codebook for flights, airlines and airports with ?flights, ?airlines and ?airports
library(nycflights13)
data(flights)
data(airlines)
data(airports)Exercise 1
flights contains the two letter carrier abbreviations and airlines contains a dictionary. It would be nice if this information was in one data frame so we could read off easily e.g. which airlines are most likely to have arrival delays.
Use an appropriate join to add the column of airlines to flights and save the resulting data frame as flights2.
# code hereExercise 2
Using flights2, report the average arrival delay for each carrier. Print only the 5 carrier airline with the worst arrival delays on average. No joins required here.
# code hereExercise 3
Create a new data set called dest_delays that reports the median arrival delay at each destination airport.
# code hereExercise 4
Which destination has the worst arrival delay? (Note: you will need to join dest_delays with airports to answer this question)
# code hereOptional bonus
- Is there anything else you might want to learn from the data before declaring one airport is most likely to have delayed arrival or one carrier is most likely to result in a delayed flight?