Intro to probability

STA 199

Click here while you wait

Bulletin

this ae is not due for grade.
team announcement in slack

Getting started

Clone your ae10-username repo from the GitHub organization.

Today

By the end of today you will

have a working understanding of the terms probability, sample space, event, population and sample.
compute probabilities of events from data
create a contingency table using pivot_wider() and kable()
use a contingency table to explore the relationship between two categorical variables.

Definitions

The probability of an event tells us how likely an event is to occur, and it can take values from 0 to 1, inclusive. It can be viewed as
- the proportion of times the event would occur if it could be observed an infinite number of times.
- our degree of belief an event will happen.
An event is the basic element to which probability is applied, e.g. the result of an observation or experiment.
- Example: \(A\) is the event a student in STA 199 is a sophomore.
- We use capital letters, e.g. \(A\) to denote events.
- For any event \(A\) and its complement, \(A^C\), Pr(A) + Pr(\(A^C\)) = 1.
A sample space is the set of all possible outcomes. Each outcome in the sample space is disjoint or mutually exclusive meaning they can’t occur simultaneously.
- Example: The sample space for year is {First-year, Sophomore, Junior, Senior}, each item brackets is a distinct outcome from the questionnaire.
- The probability of the entire sample space is 1.

Introduction

library(tidyverse)
library(knitr)

#sta199 = read_csv()

sta199 = sta199 |>
  rename(year = `What year are you in school?`,
         animal = `Choose your favorite`,
         tv = `Favorite TV`,
         major = `Probable Major?`)

For this Application Exercise, we will look at our newly collected data.

Data includes

year: Year in school
animal: Whether you prefer cats or dogs
tv: Favorite TV genre
major: probable major (statistical science or not)

Exercise 1

Give two examples of an event from the data set.

Exercise 2

Let’s take a look at favorite TV genre. Note that we have categorized genres so that each person can only have one favorite genre.

What is the sample space for favorite TV genre? You can use code to identify the sample space.

# code here

Exercise 3

Let’s make a table that includes the TV genre, the number of people who prefer each, and the associated probabilities.

# code here

Exercise 4

How large is the sample space of any individual’s response? Can we check this in R?

# code here

Exercise 5

What is the probability a randomly selected STA 199 student favors cats?

# code here

What is the probability a randomly selected STA 199 student is not a senior and prefers dogs?

# code here

What is the probability a randomly selected STA 199 student is a first year and a statistics major?

# code here

Exercise 6

Now let’s make at table looking at the relationship between year and favorite tv.

sta199 %>%
  count(year, tv)

We’ll reformat the data into a contingency table, a table frequently used to study the association between two categorical variables. In this contingency table, each row will represent a year, each column will represent a tv show, and each cell is the number of students have a particular combination of year and major.

To make the contingency table, we will use a new function in dplry called pivot_wider(). It will take the data frame produced by count() that is current in a “long” format and reshape it to be in a “wide” format.

We will also use the kable() function in the knitr package to neatly format our new table.

sta199 %>%
  count(year, tv) %>%
  pivot_wider(names_from = tv, #how we will name the columns
              values_from = n, #values used for each cell
              values_fill = 0) %>% #how to fill cells with 0 observations
  kable() # neatly display the results

How many students in STA 199 are juniors and like dramas?

Exercise 7

For each of the following exercises:

Calculate the probability using the contingency table above.
Then write code to check your answer using the sta199 data frame and dplyr functions.

What is the probability a randomly selected STA 199 student is a sophomore?

# code here

What is the probability that a randomly selected STA 199 student likes anime or action/adventure?

# code here

What is the probability that a randomly selected STA 199 student is a sophomore or likes comedy?

# code here

What is the probability that a randomly selected STA 199 student is a sophomore and and likes drama?

# code here

More definitions (next time)

Population: the entire group you want to learn about. Often, it’s useful to think the population is “truth”

Sample: Your sample of the population from which you draw inference.