library(tidyverse)
library(knitr)
Intro to probability
STA 199
Click here while you wait
Bulletin
- this
ae
is not due for grade. - team announcement in slack
Getting started
Clone your ae10-username
repo from the GitHub organization.
Today
By the end of today you will
- have a working understanding of the terms probability, sample space, event, population and sample.
- compute probabilities of events from data
- create a contingency table using
pivot_wider()
andkable()
- use a contingency table to explore the relationship between two categorical variables.
Definitions
- The probability of an event tells us how likely an event is to occur, and it can take values from 0 to 1, inclusive. It can be viewed as
- the proportion of times the event would occur if it could be observed an infinite number of times.
- our degree of belief an event will happen.
- An event is the basic element to which probability is applied, e.g. the result of an observation or experiment.
- Example: \(A\) is the event a student in STA 199 is a sophomore.
- We use capital letters, e.g. \(A\) to denote events.
- For any event \(A\) and its complement, \(A^C\), Pr(A) + Pr(\(A^C\)) = 1.
- A sample space is the set of all possible outcomes. Each outcome in the sample space is disjoint or mutually exclusive meaning they can’t occur simultaneously.
- Example: The sample space for year is {First-year, Sophomore, Junior, Senior}, each item brackets is a distinct outcome from the questionnaire.
- The probability of the entire sample space is 1.
Introduction
#sta199 = read_csv()
= sta199 |>
sta199 rename(year = `What year are you in school?`,
animal = `Choose your favorite`,
tv = `Favorite TV`,
major = `Probable Major?`)
For this Application Exercise, we will look at our newly collected data.
Data includes
year
: Year in schoolanimal
: Whether you prefer cats or dogstv
: Favorite TV genremajor
: probable major (statistical science or not)
Exercise 1
Give two examples of an event from the data set.
Exercise 2
Let’s take a look at favorite TV genre. Note that we have categorized genres so that each person can only have one favorite genre.
- What is the sample space for favorite TV genre? You can use code to identify the sample space.
# code here
Exercise 3
- Let’s make a table that includes the TV genre, the number of people who prefer each, and the associated probabilities.
# code here
Exercise 4
How large is the sample space of any individual’s response? Can we check this in R
?
# code here
Exercise 5
- What is the probability a randomly selected STA 199 student favors cats?
# code here
- What is the probability a randomly selected STA 199 student is not a senior and prefers dogs?
# code here
- What is the probability a randomly selected STA 199 student is a first year and a statistics major?
# code here
Exercise 6
Now let’s make at table looking at the relationship between year and favorite tv.
%>%
sta199 count(year, tv)
We’ll reformat the data into a contingency table, a table frequently used to study the association between two categorical variables. In this contingency table, each row will represent a year, each column will represent a tv show, and each cell is the number of students have a particular combination of year and major.
To make the contingency table, we will use a new function in dplry
called pivot_wider()
. It will take the data frame produced by count()
that is current in a “long” format and reshape it to be in a “wide” format.
We will also use the kable()
function in the knitr
package to neatly format our new table.
%>%
sta199 count(year, tv) %>%
pivot_wider(names_from = tv, #how we will name the columns
values_from = n, #values used for each cell
values_fill = 0) %>% #how to fill cells with 0 observations
kable() # neatly display the results
- How many students in STA 199 are juniors and like dramas?
Exercise 7
For each of the following exercises:
Calculate the probability using the contingency table above.
Then write code to check your answer using the
sta199
data frame anddplyr
functions.
- What is the probability a randomly selected STA 199 student is a sophomore?
# code here
- What is the probability that a randomly selected STA 199 student likes anime or action/adventure?
# code here
- What is the probability that a randomly selected STA 199 student is a sophomore or likes comedy?
# code here
- What is the probability that a randomly selected STA 199 student is a sophomore and and likes drama?
# code here
More definitions (next time)
Population: the entire group you want to learn about. Often, it’s useful to think the population is “truth”
Sample: Your sample of the population from which you draw inference.