library(tidyverse)
Working with multiple data frames
STA 199
Bulletin
- this
ae
is due for grade (48 hours from class). To turn in, simply push to GitHub - homework 01
- lab 01 due Monday at 5:00pm
Getting started
Clone your ae5-username
repo from the GitHub organization.
Today
By the end of today you will…
- join data frames
Load packages
Joins
There are six join functions in the dplyr
package. Join functions take two data frames as arguments and return a data frame.
The six joins can be broken down into two categories:
- Mutating joins: joining data frames results in mutating new columns
- Filtering joins: joining data frames results in filtering observations in one data frame based on another.
In all of the following examples, x
and y
are two data frames:
= tibble(value = c(100, 200, 300),
x xcol = c("x1", "x2", "x3"))
= tibble(value = c(100, 200, 400),
y ycol = c("y1", "y2", "y4"))
x
# A tibble: 3 × 2
value xcol
<dbl> <chr>
1 100 x1
2 200 x2
3 300 x3
y
# A tibble: 3 × 2
value ycol
<dbl> <chr>
1 100 y1
2 200 y2
3 400 y4
Mutating joins
The most popular 2 joins:
left_join(x, y)
: keep all rows fromx
and adds columns fromy
right_join(x, y)
: keeps all rows fromy
and adds columns fromx
Two more helpful joins:
inner_join(x, y)
: join all rows fromx
where there are matching values iny
.Returns all combinations in case of multiple matchesfull_join(x, y)
: include all rows inx
ory
Toy examples:
%>%
x left_join(y)
Joining, by = "value"
# A tibble: 3 × 3
value xcol ycol
<dbl> <chr> <chr>
1 100 x1 y1
2 200 x2 y2
3 300 x3 <NA>
- What do you think
Joining, by = "value"
means?
%>%
x right_join(y, by = "value")
# A tibble: 3 × 3
value xcol ycol
<dbl> <chr> <chr>
1 100 x1 y1
2 200 x2 y2
3 400 <NA> y4
%>%
x inner_join(y)
Joining, by = "value"
# A tibble: 2 × 3
value xcol ycol
<dbl> <chr> <chr>
1 100 x1 y1
2 200 x2 y2
full_join(x, y)
Joining, by = "value"
# A tibble: 4 × 3
value xcol ycol
<dbl> <chr> <chr>
1 100 x1 y1
2 200 x2 y2
3 300 x3 <NA>
4 400 <NA> y4
Filtering joins
semi_join(x, y)
: return all rows fromx
with match iny
anti_join(x, y)
: return all rows fromx
without a match iny
In both of these “filtering” cases we do not add any new columns to our first argument (the data frame x
).
Toy examples:
%>%
x semi_join(y)
Joining, by = "value"
# A tibble: 2 × 2
value xcol
<dbl> <chr>
1 100 x1
2 200 x2
%>%
x anti_join(y)
Joining, by = "value"
# A tibble: 1 × 2
value xcol
<dbl> <chr>
1 300 x3
Check out the new data frame x2
:
= x %>%
x2 mutate(new_value = value) %>%
select(new_value, xcol)
x2
# A tibble: 3 × 2
new_value xcol
<dbl> <chr>
1 100 x1
2 200 x2
3 300 x3
We can still join x2
with y
but left_join(x2, y)
won’t work. We have to manually specify which columns to join by:
%>%
x2 left_join(y, by = c("new_value" = "value"))
# A tibble: 3 × 3
new_value xcol ycol
<dbl> <chr> <chr>
1 100 x1 y1
2 200 x2 y2
3 300 x3 <NA>
Practice
We’ll take a look at some New York flights data.
This data set contains on-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013.
This data is a subset of the data set provided by the nycflights13
package.
Codebook
Check out codebook for flights
, airlines
and airports
with ?flights
, ?airlines
and ?airports
library(nycflights13)
data(flights)
data(airlines)
data(airports)
Exercise 1
flights
contains the two letter carrier abbreviations and airlines
contains a dictionary. It would be nice if this information was in one data frame so we could read off easily e.g. which airlines are most likely to have arrival delays.
Use an appropriate join to add the column of airlines
to flights
and save the resulting data frame as flights2
.
# code here
Exercise 2
Using flights2
, report the average arrival delay for each carrier. Print only the 5 carrier airline with the worst arrival delays on average. No joins required here.
# code here
Exercise 3
Create a new data set called dest_delays
that reports the median arrival delay at each destination airport.
# code here
Exercise 4
Which destination has the worst arrival delay? (Note: you will need to join dest_delays
with airports
to answer this question)
# code here
Optional bonus
- Is there anything else you might want to learn from the data before declaring one airport is most likely to have delayed arrival or one carrier is most likely to result in a delayed flight?