Grammar of data wrangling

STA 199

Bulletin

  • this ae is due for grade (48 hours from class). To turn in, simply push to GitHub
  • homework 01

Getting started

Clone your ae4-username repo from the GitHub organization.

Today

By the end of today you will…

  • understand how data is organized according to a consistent set of “tidy” principles
  • use seven key verbs to wrangle data and extract meaning

Load packages

library(tidyverse)
library(nycflights13)

The data frame has over 336,000 observations (rows), 336776 observations to be exact, so we will not view the entire data frame. Instead we’ll use the commands below to help us explore the data.

glimpse(flights)
Rows: 336,776
Columns: 19
$ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
$ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
$ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
$ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
$ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
$ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
$ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
$ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
$ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
$ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
names(flights)
 [1] "year"           "month"          "day"            "dep_time"      
 [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
 [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
[13] "origin"         "dest"           "air_time"       "distance"      
[17] "hour"           "minute"         "time_hour"     
head(flights)
# A tibble: 6 × 19
   year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
  <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
1  2013     1     1      517         515       2     830     819      11 UA     
2  2013     1     1      533         529       4     850     830      20 UA     
3  2013     1     1      542         540       2     923     850      33 AA     
4  2013     1     1      544         545      -1    1004    1022     -18 B6     
5  2013     1     1      554         600      -6     812     837     -25 DL     
6  2013     1     1      554         558      -4     740     728      12 UA     
# … with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
#   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>, and abbreviated variable names ¹​sched_dep_time,
#   ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

The head() function returns “A tibble: 6 x 19” and then the first six rows of the flights data.

Tibble vs. data frame

A tibble is a slightly specialized version of the R data frame. All tibbles are data frames, but not all data frames are tibbles.

There are two main differences between a tibble and a data frame:

  1. When you print a tibble, the first ten rows and all of the columns that fit on the screen will display, along with the type of each column.

Let’s look at the differences in the output when we type flights (tibble) in the console versus typing cars (data frame) in the console.

  1. Second, tibbles are somewhat more strict than data frames when it comes to subsetting data. You will get NULL if you try to access a variable that doesn’t exist in a data frame. You’ll get NULL and a warning if you try to do the same for a tibble.
flights$apple
Warning: Unknown or uninitialised column: `apple`.
NULL
cars$apple
NULL

Data wrangling with dplyr

dplyr is the primary package in the tidyverse for data wrangling. Click here for the dplyr reference page. Click here for the dplyr cheatsheet.

Quick summary of key dplyr functions1:

Rows:

  • filter():chooses rows based on column values.
  • slice(): chooses rows based on location.
  • arrange(): changes the order of the rows
  • sample_n(): take a random subset of the rows

Columns:

  • select(): changes whether or not a column is included.
  • rename(): changes the name of columns.
  • mutate(): changes the values of columns and creates new columns.

Groups of rows:

  • summarise(): collapses a group into a single row.
  • count(): count unique values of one or more variables.
  • group_by(): perform calculations separately for each value of a variable

The pipe

We saw the pipe in a previous lecture. Here it is again, with a demo of why it’s useful.

Remember, a pipe |> or %>% takes what comes before it and makes it the input to the following function.

When reading code in English, say “and then” whenever you see a pipe.

Exercise 1

Re-write the code below with pipes.

head(select(flights, dep_delay, arr_delay))
# A tibble: 6 × 2
  dep_delay arr_delay
      <dbl>     <dbl>
1         2        11
2         4        20
3         2        33
4        -1       -18
5        -6       -25
6        -4        12

select()

  • Demo: Make a data frame that only contains the variables dep_delay and arr_delay.
flights |>
  select(dep_delay, arr_delay)
# A tibble: 336,776 × 2
   dep_delay arr_delay
       <dbl>     <dbl>
 1         2        11
 2         4        20
 3         2        33
 4        -1       -18
 5        -6       -25
 6        -4        12
 7        -5        19
 8        -3       -14
 9        -3        -8
10        -2         8
# … with 336,766 more rows
  • Demo: Make a data frame that keeps every variable except dep_delay.
flights |>
  select(-dep_delay)
# A tibble: 336,776 × 18
    year month   day dep_time sched_dep…¹ arr_t…² sched…³ arr_d…⁴ carrier flight
   <int> <int> <int>    <int>       <int>   <int>   <int>   <dbl> <chr>    <int>
 1  2013     1     1      517         515     830     819      11 UA        1545
 2  2013     1     1      533         529     850     830      20 UA        1714
 3  2013     1     1      542         540     923     850      33 AA        1141
 4  2013     1     1      544         545    1004    1022     -18 B6         725
 5  2013     1     1      554         600     812     837     -25 DL         461
 6  2013     1     1      554         558     740     728      12 UA        1696
 7  2013     1     1      555         600     913     854      19 B6         507
 8  2013     1     1      557         600     709     723     -14 EV        5708
 9  2013     1     1      557         600     838     846      -8 B6          79
10  2013     1     1      558         600     753     745       8 AA         301
# … with 336,766 more rows, 8 more variables: tailnum <chr>, origin <chr>,
#   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>, and abbreviated variable names ¹​sched_dep_time,
#   ²​arr_time, ³​sched_arr_time, ⁴​arr_delay
  • Demo: Make a data frame that includes all variables between year through dep_delay (inclusive). These are all variables that provide information about the departure of each flight.
flights |>
  select(year:dep_delay)
# A tibble: 336,776 × 6
    year month   day dep_time sched_dep_time dep_delay
   <int> <int> <int>    <int>          <int>     <dbl>
 1  2013     1     1      517            515         2
 2  2013     1     1      533            529         4
 3  2013     1     1      542            540         2
 4  2013     1     1      544            545        -1
 5  2013     1     1      554            600        -6
 6  2013     1     1      554            558        -4
 7  2013     1     1      555            600        -5
 8  2013     1     1      557            600        -3
 9  2013     1     1      557            600        -3
10  2013     1     1      558            600        -2
# … with 336,766 more rows
  • Demo: Use the select helper contains() to make a data frame that includes the variables associated with the arrival, i.e., contains the string "arr\_" in the name.
flights |>
  select(contains("arr_"))
# A tibble: 336,776 × 3
   arr_time sched_arr_time arr_delay
      <int>          <int>     <dbl>
 1      830            819        11
 2      850            830        20
 3      923            850        33
 4     1004           1022       -18
 5      812            837       -25
 6      740            728        12
 7      913            854        19
 8      709            723       -14
 9      838            846        -8
10      753            745         8
# … with 336,766 more rows

slice()

  • Demo: Display the first five rows of the flights data frame.
flights |>
  slice(1:5)
# A tibble: 5 × 19
   year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
  <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
1  2013     1     1      517         515       2     830     819      11 UA     
2  2013     1     1      533         529       4     850     830      20 UA     
3  2013     1     1      542         540       2     923     850      33 AA     
4  2013     1     1      544         545      -1    1004    1022     -18 B6     
5  2013     1     1      554         600      -6     812     837     -25 DL     
# … with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
#   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>, and abbreviated variable names ¹​sched_dep_time,
#   ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
  • Demo: Display the last two rows of the flights data frame.
flights |>
  slice((n()-1):n())
# A tibble: 2 × 19
   year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
  <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
1  2013     9    30       NA        1159      NA      NA    1344      NA MQ     
2  2013     9    30       NA         840      NA      NA    1020      NA MQ     
# … with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
#   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>, and abbreviated variable names ¹​sched_dep_time,
#   ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

arrange()

  • Demo: Let’s arrange the data by departure delay, so the flights with the shortest departure delays will be at the top of the data frame.

Question: What does it mean for the dep_delay to have a negative value?

flights |>
  arrange(dep_delay)
# A tibble: 336,776 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013    12     7     2040       2123     -43      40    2352      48 B6     
 2  2013     2     3     2022       2055     -33    2240    2338     -58 DL     
 3  2013    11    10     1408       1440     -32    1549    1559     -10 EV     
 4  2013     1    11     1900       1930     -30    2233    2243     -10 DL     
 5  2013     1    29     1703       1730     -27    1947    1957     -10 F9     
 6  2013     8     9      729        755     -26    1002     955       7 MQ     
 7  2013    10    23     1907       1932     -25    2143    2143       0 EV     
 8  2013     3    30     2030       2055     -25    2213    2250     -37 MQ     
 9  2013     3     2     1431       1455     -24    1601    1631     -30 9E     
10  2013     5     5      934        958     -24    1225    1309     -44 B6     
# … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
  • Demo: Now let’s arrange the data by descending departure delay, so the flights with the longest departure delays will be at the top.
flights |>
  arrange(desc(dep_delay))
# A tibble: 336,776 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013     1     9      641        900    1301    1242    1530    1272 HA     
 2  2013     6    15     1432       1935    1137    1607    2120    1127 MQ     
 3  2013     1    10     1121       1635    1126    1239    1810    1109 MQ     
 4  2013     9    20     1139       1845    1014    1457    2210    1007 AA     
 5  2013     7    22      845       1600    1005    1044    1815     989 MQ     
 6  2013     4    10     1100       1900     960    1342    2211     931 DL     
 7  2013     3    17     2321        810     911     135    1020     915 DL     
 8  2013     6    27      959       1900     899    1236    2226     850 DL     
 9  2013     7    22     2257        759     898     121    1026     895 DL     
10  2013    12     5      756       1700     896    1058    2020     878 AA     
# … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

Exercise 2

Create a data frame that only includes the plane tail number (tailnum), carrier (carrier), and departure delay for the flight with the longest departure delay. What is the plane tail number (tailnum) for this flight?

# code here

filter()

  • Demo: Filter the data frame by selecting the rows where the destination airport is RDU.
flights |>
  filter(dest == "RDU")
# A tibble: 8,163 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013     1     1      800        810     -10     949     955      -6 MQ     
 2  2013     1     1      832        840      -8    1006    1030     -24 MQ     
 3  2013     1     1      851        851       0    1032    1036      -4 EV     
 4  2013     1     1      917        920      -3    1052    1108     -16 B6     
 5  2013     1     1     1024       1030      -6    1204    1215     -11 MQ     
 6  2013     1     1     1127       1129      -2    1303    1309      -6 EV     
 7  2013     1     1     1157       1205      -8    1342    1345      -3 MQ     
 8  2013     1     1     1240       1235       5    1415    1415       0 MQ     
 9  2013     1     1     1317       1325      -8    1454    1505     -11 MQ     
10  2013     1     1     1449       1450      -1    1651    1640      11 MQ     
# … with 8,153 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
  • Demo: We can also filter using more than one condition. Here we select all rows where the destination airport is RDU and the arrival delay is less than 0.
flights |>
  filter(dest == "RDU", arr_delay < 0)
# A tibble: 4,232 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013     1     1      800        810     -10     949     955      -6 MQ     
 2  2013     1     1      832        840      -8    1006    1030     -24 MQ     
 3  2013     1     1      851        851       0    1032    1036      -4 EV     
 4  2013     1     1      917        920      -3    1052    1108     -16 B6     
 5  2013     1     1     1024       1030      -6    1204    1215     -11 MQ     
 6  2013     1     1     1127       1129      -2    1303    1309      -6 EV     
 7  2013     1     1     1157       1205      -8    1342    1345      -3 MQ     
 8  2013     1     1     1317       1325      -8    1454    1505     -11 MQ     
 9  2013     1     1     1505       1510      -5    1654    1655      -1 MQ     
10  2013     1     1     1800       1800       0    1945    1951      -6 B6     
# … with 4,222 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

We can do more complex tasks using logical operators:

operator definition
< is less than?
<= is less than or equal to?
> is greater than?
>= is greater than or equal to?
== is exactly equal to?
!= is not equal to?
x & y is x AND y?
x \| y is x OR y?
is.na(x) is x NA?
!is.na(x) is x not NA?
x %in% y is x in y?
!(x %in% y) is x not in y?
!x is not x?

The final operator only makes sense if x is logical (TRUE / FALSE).

Exercise 3

Describe what the code is doing in words.

flights |>
  filter(
    dest %in% c("RDU", "GSO"),
    arr_delay < 0 | dep_delay < 0
    )
# A tibble: 6,203 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013     1     1      800        810     -10     949     955      -6 MQ     
 2  2013     1     1      832        840      -8    1006    1030     -24 MQ     
 3  2013     1     1      851        851       0    1032    1036      -4 EV     
 4  2013     1     1      917        920      -3    1052    1108     -16 B6     
 5  2013     1     1     1024       1030      -6    1204    1215     -11 MQ     
 6  2013     1     1     1127       1129      -2    1303    1309      -6 EV     
 7  2013     1     1     1157       1205      -8    1342    1345      -3 MQ     
 8  2013     1     1     1317       1325      -8    1454    1505     -11 MQ     
 9  2013     1     1     1449       1450      -1    1651    1640      11 MQ     
10  2013     1     1     1505       1510      -5    1654    1655      -1 MQ     
# … with 6,193 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

count()

  • Demo: Create a frequency table of the destination locations for flights from New York.
flights |>
  count(dest)
# A tibble: 105 × 2
   dest      n
   <chr> <int>
 1 ABQ     254
 2 ACK     265
 3 ALB     439
 4 ANC       8
 5 ATL   17215
 6 AUS    2439
 7 AVL     275
 8 BDL     443
 9 BGR     375
10 BHM     297
# … with 95 more rows
  • Demo: In which month was there the fewest number of flights? How many flights were there in that month?
flights |>
  count(month) |>
  filter(n == min(n))
# A tibble: 1 × 2
  month     n
  <int> <int>
1     2 24951

Exercise 4

On which date (month + day) was there the largest number of flights? How many flights were there on that day?

# code here

mutate()

Use mutate() to create a new variable.

  • Demo: In the code chunk below, air_time (minutes in the air) is converted to hours, and then new variable mph is created, corresponding to the miles per hour of the flight.
flights |>
  mutate(
    hours = air_time / 60,
    mph = distance / hours
    ) |>
  select(air_time, distance, hours, mph)
# A tibble: 336,776 × 4
   air_time distance hours   mph
      <dbl>    <dbl> <dbl> <dbl>
 1      227     1400 3.78   370.
 2      227     1416 3.78   374.
 3      160     1089 2.67   408.
 4      183     1576 3.05   517.
 5      116      762 1.93   394.
 6      150      719 2.5    288.
 7      158     1065 2.63   404.
 8       53      229 0.883  259.
 9      140      944 2.33   405.
10      138      733 2.3    319.
# … with 336,766 more rows
  • Your turn (4 minutes): Create a new variable to calculate the percentage of flights in each month. What percentage of flights take place in July?
flights |>
  count(month) |>
  mutate(perc = n / sum(n) * 100)
# A tibble: 12 × 3
   month     n  perc
   <int> <int> <dbl>
 1     1 27004  8.02
 2     2 24951  7.41
 3     3 28834  8.56
 4     4 28330  8.41
 5     5 28796  8.55
 6     6 28243  8.39
 7     7 29425  8.74
 8     8 29327  8.71
 9     9 27574  8.19
10    10 28889  8.58
11    11 27268  8.10
12    12 28135  8.35

summarize()

summarize() collapses the rows into summary statistics and removes columns irrelevant to the calculation.

Be sure to name your columns!

flights |>
  summarize(mean_dep_delay = mean(dep_delay))
# A tibble: 1 × 1
  mean_dep_delay
           <dbl>
1             NA

Why did this code return NA?

Let’s fix it!

flights |>
  summarize(mean_dep_delay = mean(dep_delay, na.rm = TRUE))
# A tibble: 1 × 1
  mean_dep_delay
           <dbl>
1           12.6

group_by()

group_by() is used for grouped operations. It’s very powerful when paired with summarise() to calculate summary statistics by group.

Here we find the mean and standard deviation of departure delay for each month.

flights |>
  group_by(month) |>
  summarize(
    mean_dep_delay = mean(dep_delay, na.rm = TRUE), 
    sd_dep_delay = sd(dep_delay, na.rm = TRUE)
    )
# A tibble: 12 × 3
   month mean_dep_delay sd_dep_delay
   <int>          <dbl>        <dbl>
 1     1          10.0          36.4
 2     2          10.8          36.3
 3     3          13.2          40.1
 4     4          13.9          43.0
 5     5          13.0          39.4
 6     6          20.8          51.5
 7     7          21.7          51.6
 8     8          12.6          37.7
 9     9           6.72         35.6
10    10           6.24         29.7
11    11           5.44         27.6
12    12          16.6          41.9

Exercise 5

What is the median departure delay for each airports around NYC (origin)?

flights |>
  group_by(origin) |>
  summarize(
    med_dep_delay = median(dep_delay, na.rm = TRUE)
    )
# A tibble: 3 × 2
  origin med_dep_delay
  <chr>          <dbl>
1 EWR               -1
2 JFK               -1
3 LGA               -3

Additional Practice

  1. Create a new data set that only contains flights that do not have a missing departure time. Include the columns year, month, day, dep_time, dep_delay, and dep_delay_hours (the departure delay in hours). Hint: Note you may need to use mutate() to make one or more of these variables.
# add code here
  1. For each airplane (uniquely identified by tailnum), use a group_by() paired with summarize() to find the sample size, mean, and standard deviation of flight distances. Then include only the top 5 and bottom 5 airplanes in terms of mean distance traveled per flight in the final data frame.
# add code here

Footnotes

  1. From dplyr vignette↩︎