library(tidyverse)
library(tidymodels)
library(dsbox)
Lab 6 - Logistic regression
Packages
You’ll need the following packages for today’s lab.
Data
The data can be found in the dsbox package, and it’s called gss16
. Since the data set is distributed with the package, we don’t need to load it separately; it becomes available to us when we load the package.
If you would like to explicitly load the data into your environment so you can view it, you can do so by running this code.
<- gss16 gss16
You can find out more about the data set by inspecting its documentation, which you can access by running ?gss16
in the Console or using the Help menu in RStudio to search for gss16
. You can also find this information here.
Exercises
Exercise 1 - Data wrangling
Create a new data frame called
gss16_advfront
that includes the variablesadvfront
,educ
,polviews
, andwrkstat
. Then, use thedrop_na()
function to remove rows that containNA
s from this new data frame.Transform the
advfront
variable such that it has two levels:"Strongly agree"
and"Agree"
should both be mapped to"Agree"
and the remaining levels should all be relabeled"Not agree"
. Make sure the resulting levels are in the following order:"Agree"
and"Not agree"
.
Hint: use the factor()
function inside a mutate()
statement to relabel the original levels. Be sure to list the levels in order so that they are correctly ordered after relabeling.
- Similarly to part b, combine the levels of the
polviews
variable such that levels that have the word “liberal” in them are lumped into a level called"Liberal"
and those that have the word “conservative” in them are lumped into a level called"Conservative"
. Make sure the levels are in the following order:"Conservative"
,"Moderate"
, and"Liberal"
. Finally,count()
how many times each new level appears in thepolviews
variable.
Hint: be careful if you manually type out the levels in the original polviews
variable to note that there are typos in two of the original levels “slightly conservative” and “extremely conservative” are both misspelled, and so you will need to match those misspellings in your call to factor()
.
Exercise 2 - Train and test sets
Now, let’s split the data into training and test sets so that we can evaluate the models we’re going to fit by how well they predict outcomes on data that wasn’t used to fit the models.
Specify a random seed of 1234 (i.e., include set.seed(1234)
at the beginning of your code chunk), and then split gss16_advfront
randomly into a training set train_data
and a test set test_data
. Do this so that the training set contains 80% of the rows of the original data.
Exercise 3 - Logistic Regression
Using the training data, fit a logistic regression model that predicts
advfront
usingeduc
. In particular, the model should predict the probability thatadvfront
has value"Not agree"
. Name this modelmodel1
. Report the tidy model output.Write out the fitted model equation in proper notation. State the meaning of any variables in the context of the data.
Using your fitted model, report the estimated probability of agreeing with the following statement: Even if it brings no immediate benefits, scientific research that advances the frontiers of knowledge is necessary and should be supported by the federal government (
Agree
in advfront) if you have an education of 7 years.
Exercise 4 - Another model
Again using the training data, fit a new logistic regression model that adds the additional explanatory variable of
polviews
. Name this modelmodel2
. Report the tidy output.Now, report the estimated probability of agreeing with the following statement: Even if it brings no immediate benefits, scientific research that advances the frontiers of knowledge is necessary and should be supported by the federal government (
Agree
in advfront) if you have an education of 7 years and are Conservative.
Exercise 5 - Evaluating models with AIC
Report the AIC values for each of
model1
andmodel2
.Based on your results in part a, does it appear that including political views in addition to years of education is useful for modeling whether employees agree with the statement “Even if it brings no immediate benefits, scientific research that advances the frontiers of knowledge is necessary and should be supported by the federal government”? Explain.
Exercise 6 - Evaluating models using test data
For each of
model1
andmodel2
, report the number of false positive and false negatives when making predictions on thetest_data
with a decision boundary of 0.5.Do these results provide much information about which model you would prefer for a prediction task? If so, which model would you choose?
Do you think a decision boundary of 0.5 makes sense here or would you adjust it?
Submission
To submit your assignment:
- Go to http://www.gradescope.com and click Log in in the top right corner.
- Click School Credentials \(\rightarrow\) Duke NetID and log in using your NetID credentials.
- Click on your STA 199 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Mark all the pages associated with exercise. All the pages of your lab should be associated with at least one question (i.e., should be “checked”). If you do not do this, you will be subject to lose points on the assignment.
- Select all pages of your .pdf submission to be associated with the “Workflow & formatting” question.
Grading
Component | Points |
---|---|
Ex 1 | 8 |
Ex 2 | 3 |
Ex 3 | 10 |
Ex 4 | 5 |
Ex 5 | 5 |
Ex 6 | 14 |
Workflow & formatting | 5 |
Total | 50 |