library(tidyverse)
library(tidymodels)
Bootstrap
STA 199
Bulletin
- this
ae
is due for grade. Push your completed ae to GitHub within 48 hours to receive credit - project proposal feedback
- homework 4 released
Getting started
Clone your ae18-username
repo from the GitHub organization.
Today
By the end of today you will…
- be able to draw a bootstrap sample and calculate a bootstrap statistic
- use
infer
to obtain a bootstrap distribution - calculate a confidence interval from the bootstrap distribution
- interpret a confidence interval in context of the data
Load packages
Load data
= read_csv(
manhattan "https://sta101-fa22.netlify.app/static/appex/data/manhattan.csv"
)
Notes (for reference)
Bootstrapping is a re-sampling technique. The key idea is you have already collected a sample of size \(N\) from the population. To create a bootstrap sample, you sample with replacement from your original sample \(N\) times.
Let’s say you measure the height of five Duke students in meters:
= c(1.51, 1.62, 1.89, 2.01, 1.78)
heights
= data.frame(heights) students
There are many ways to create a bootstrap sample in R. We will focus on the tidy
way below. which uses the infer
package that loads with tidymodels
.
Example
set.seed(2)
%>%
students specify(response = heights) %>%
generate(reps = 1, type = "bootstrap")
Response: heights (numeric)
# A tibble: 5 × 2
# Groups: replicate [1]
replicate heights
<int> <dbl>
1 1 1.78
2 1 1.51
3 1 1.78
4 1 1.51
5 1 2.01
From here, we can compute a bootstrap statistic. E.g.
set.seed(2)
%>%
students specify(response = heights) %>%
generate(reps = 1, type = "bootstrap") %>%
calculate(stat = "median")
Response: heights (numeric)
# A tibble: 1 × 1
stat
<dbl>
1 1.78
Example: rent in Manhattan
On a given day in 2018, twenty one-bedroom apartments were randomly selected on Craigslist Manhattan from apartments listed as “by owner”. The data are in the manhattan
data frame. We will use this sample to conduct inference on the typical rent of 1 bedroom apartments in Manhattan.
Part 1: Drawing a bootstrap sample
Let’s start by using bootstrapping to estimate the mean rent of one-bedroom apartments in Manhattan.
Exercise 1
What is a point estimate (i.e. single number summary) of the typical rent?
Exercise 2
Let’s bootstrap!
- To bootstrap we will sample with replacement by drawing a value from the box.
- How many draws do we need for our bootstrap sample?
Fill in the values from the bootstrap sample conducted in class. Once the values are filled in, un-comment the code.
# class_bootstrap = c()
Exercise 3
- About what value do you expect the bootstrap statistic to take?
- Calculate the statistic from the bootstrap sample.
# add code
Part 2: Bootstrap confidence interval
We will calculate a 95% confidence interval for the mean rent of one-bedroom apartments in Manhattan.
We start by setting a seed to ensure our analysis is reproducible.
Generating the bootstrap distribution
We can use R to take many bootstrap samples, compute a statistic and then view the bootstrap distribution of that statistic.
Un-comment the lines and fill in the blanks to create the bootstrap distribution of sample means and save the results in the data frame boot_dist
.
Use 1000 reps for the in-class activity. (You will use about 10,000 reps for assignments outside of class.)
set.seed(7182022)
= manhattan #%>%
boot_dist #specify(______) %>%
#generate(______) %>%
#calculate(______)
- How many rows are in
boot_dist
? - What does each row represent?
- What are the variables in
boot_dist
? What do they mean?
Visualize the bootstrap distribution
A sample statistic is a random variable, we can look at its distribution.
Visualize the bootstrap distribution using a histogram. Describe the shape and center of the distribution.
# add code
Calculate the confidence interval
Uncomment the lines and fill in the blanks to construct the 95% bootstrap confidence interval for the mean rent of one-bedroom apartments in Manhattan.
#___ %>%
# summarize(lower = quantile(______),
# upper = quantile(______))
Interpret the interval
Write the interpretation for the interval calculated above.
Question: Does a confidence interval have to be symmetric?
What is one advantage to using a 90% confidence interval instead of a 95% confidence interval to estimate a parameter? - What is one advantage to using a 99% confidence interval instead of a 95% confidence interval to estimate a parameter?