Intro to probability

STA 199

Click here while you wait

Bulletin

  • this ae is not due for grade.
  • team announcement in slack

Getting started

Clone your ae10-username repo from the GitHub organization.

Today

By the end of today you will

  • have a working understanding of the terms probability, sample space, event, population and sample.
  • compute probabilities of events from data
  • create a contingency table using pivot_wider() and kable()
  • use a contingency table to explore the relationship between two categorical variables.

Definitions

  • The probability of an event tells us how likely an event is to occur, and it can take values from 0 to 1, inclusive. It can be viewed as
    • the proportion of times the event would occur if it could be observed an infinite number of times.
    • our degree of belief an event will happen.
  • An event is the basic element to which probability is applied, e.g. the result of an observation or experiment.
    • Example: \(A\) is the event a student in STA 199 is a sophomore.
    • We use capital letters, e.g. \(A\) to denote events.
    • For any event \(A\) and its complement, \(A^C\), Pr(A) + Pr(\(A^C\)) = 1.
  • A sample space is the set of all possible outcomes. Each outcome in the sample space is disjoint or mutually exclusive meaning they can’t occur simultaneously.
    • Example: The sample space for year is {First-year, Sophomore, Junior, Senior}, each item brackets is a distinct outcome from the questionnaire.
    • The probability of the entire sample space is 1.

Introduction

library(tidyverse)
library(knitr)
#sta199 = read_csv()
sta199 = sta199 |>
  rename(year = `What year are you in school?`,
         animal = `Choose your favorite`,
         tv = `Favorite TV`,
         major = `Probable Major?`)

For this Application Exercise, we will look at our newly collected data.

Data includes

  • year: Year in school
  • animal: Whether you prefer cats or dogs
  • tv: Favorite TV genre
  • major: probable major (statistical science or not)

Exercise 1

Give two examples of an event from the data set.

Exercise 2

Let’s take a look at favorite TV genre. Note that we have categorized genres so that each person can only have one favorite genre.

  • What is the sample space for favorite TV genre? You can use code to identify the sample space.
# code here

Exercise 3

  • Let’s make a table that includes the TV genre, the number of people who prefer each, and the associated probabilities.
# code here

Exercise 4

How large is the sample space of any individual’s response? Can we check this in R?

# code here

Exercise 5

  • What is the probability a randomly selected STA 199 student favors cats?
# code here
  • What is the probability a randomly selected STA 199 student is not a senior and prefers dogs?
# code here
  • What is the probability a randomly selected STA 199 student is a first year and a statistics major?
# code here

Exercise 6

Now let’s make at table looking at the relationship between year and favorite tv.

sta199 %>%
  count(year, tv)

We’ll reformat the data into a contingency table, a table frequently used to study the association between two categorical variables. In this contingency table, each row will represent a year, each column will represent a tv show, and each cell is the number of students have a particular combination of year and major.

To make the contingency table, we will use a new function in dplry called pivot_wider(). It will take the data frame produced by count() that is current in a “long” format and reshape it to be in a “wide” format.

We will also use the kable() function in the knitr package to neatly format our new table.

sta199 %>%
  count(year, tv) %>%
  pivot_wider(names_from = tv, #how we will name the columns
              values_from = n, #values used for each cell
              values_fill = 0) %>% #how to fill cells with 0 observations
  kable() # neatly display the results
  • How many students in STA 199 are juniors and like dramas?

Exercise 7

For each of the following exercises:

  1. Calculate the probability using the contingency table above.

  2. Then write code to check your answer using the sta199 data frame and dplyr functions.

  • What is the probability a randomly selected STA 199 student is a sophomore?
# code here
  • What is the probability that a randomly selected STA 199 student likes anime or action/adventure?
# code here
  • What is the probability that a randomly selected STA 199 student is a sophomore or likes comedy?
# code here
  • What is the probability that a randomly selected STA 199 student is a sophomore and and likes drama?
# code here

More definitions (next time)

Population: the entire group you want to learn about. Often, it’s useful to think the population is “truth”

Sample: Your sample of the population from which you draw inference.