Forensic genetic analysis

STA 199

Bulletin

  • this ae is not due for grade.
  • Project reports due today to Gradescope
  • Team evaluation survey
  • Statistics experience due today
  • course evaluations open. \(>80\%\) response \(\rightarrow\) +1pt final project
  • my final office hours today
  • public final project portfolio

Getting started

Clone your ae27-username repo from the GitHub organization.

Load packages

library(tidyverse)

Background

DNA evidence is sometimes used in court. In this AE, you will learn about the case of Dr. Schmidt from Lafayette, Louisiana, who was accused of infecting his former lover with HIV through a contaminated blood sample of one of his patients. Read more about this court case here and here.

Today

By the end of today you will

  • be able to critical think about the use of DNA evidence and statistics in court

  • analyze non-numerical data rigorously

Explore the data

df_HIV = read_csv("https://sta101-fa22.netlify.app/static/appex/data/HIV.csv")

df_HIV contains several observations of HIV genomes. Within this data set are two samples with special ids: sample1 and sample2.

For the purpose of this exercise, you might imagine sample1 is associated with the HIV sampled from the plaintiff while sample2 belongs to that of the defendant’s patient.

Fundamentally, we are interested in whether or not sample1 and sample2 are closely related.

Exercise 1

  • how many observations are present in the data set?

  • what are the observational units?

  • how many bases does the first DNA sequence contain? Hint: use the R function nchar.

Extracting a sub-sequence of DNA

For computational speed, we will have to work with shorter sub-sequences of DNA.

The function str_sub from the package stringr in the tidyverse can be used to extract a sub-string from a character vector.

Exercise 2

  • how many arguments does the function str_sub take? What does each argument do?

  • use str_sub to extract (i) the first four letters of the words statistics, (ii) the sub-string between the third and the seventh letters, and (iii) the last four letters

term = "statistics"

Let’s use str_sub to extract the first 500 bases from the DNA sequences using the template below.

HIV = df_HIV %>% 
  mutate(dna_short = ___) %>%
  select(-dna)
  • use nchar to verify that each sub-sequence dna_short contains 500 bases. Hint: nchar is vectorized, meaning if given a vector input, it will return a vector output.

Computing the pairwise distances between the DNA sequence

Now that the data have been prepared, we will establish how similar/different each DNA sequence is to the others. To accomplish this, given two DNA sequences we will count the number of bases for which they differ. The rationale for this step is the following. If two DNA differ on many bases, it means that they have evolved separately for a while and had the time to undergo numerous mutations. On the other hand, if they only differ on a few bases, it means that the two sequences have only recently began to evolve separately.

The following code creates a data frame where each row corresponds to a pair of DNA sequences.

d_pairs <- combn(HIV$person_id, 2) %>%
  t() %>% # go from wide matrix (2 rows) to long matrix (2 columns)
  as_tibble() %>%
  rename(id1 = V1, id2 = V2) %>%
  left_join(HIV, by = c("id1" = "person_id")) %>% # add dna for person 1
  rename(dna1 = dna_short) %>%
  left_join(HIV, by = c("id2" = "person_id")) %>% # add dna for person 1
  rename(dna2 = dna_short)

Check the dimensions of d_pairs

Hamming distance

To measure the distance between two sequences, we first consider the Hamming distance

\[ d(i,j) = \sum_{k=1}^n 1\{\text{dna}_{ik} \neq \text{dna}_{jk}\} \]

which counts the number of elements that are different between two sequences. Here \(d(i,j)\) denotes the Hamming distance between sequences \(i\) and \(j\), and \(1\{\text{dna}_{ik} \neq \text{dna}_{jk}\}\) is equal to \(1\) is the \(k\)-th element of sequence \(i\) is different from that of sequence \(j\).

The following code computes the Hamming distance between each pair of sequences. We first construct function compute_hamming which computes the Hamming distance between two DNA sequences. We then apply this function to each row of the d_pairs data frame.

compute_hamming <- function(dna1, dna2) {
  
  dna1_split <- str_split(dna1, pattern = "", simplify = TRUE)
  dna2_split <- str_split(dna2, pattern = "", simplify = TRUE)
  
  hamming_distance <- sum(dna1_split != dna2_split)
  return(hamming_distance)
}

d_hamming <- d_pairs %>%
  mutate(
    distance_ham = list(dna1, dna2) %>% pmap_dbl(compute_hamming)
  )

Exercise 3

  • Make a histogram of the Hamming distances and describe the distribution.
  • Find the 10 pairs of DNA sequences that are the closest to the plaintiff’s sequence in terms of Hamming distance.
  • Which sequence is most closely related to the plaintiff’s sequence?

  • How many differences are there between the DNA sequence of the plaintiff and that of the defendant’s patient?

  • Can you identify a shortcoming of the Hamming distance? Does a large Hamming distance necessarily imply that the two DNA sequences are very different?

An alternative measure of distance for DNA sequences

Let us now consider an alternative measure of distance for DNA sequences.

d_biology <- d_pairs %>%
  mutate(
    distance_bio = list(dna1, dna2) %>% pmap_dbl(adist)
    )

Exercise 4

  • Again, make a histogram of the new distance and find the 10 DNA sequences closest to the plaintiff’s.
  • Which of the two measures do you find more adequate for these data?

Exercise 5

  • Imagine that you are a juror in this court case, would you find this piece of evidence convincing? Would you like to have more information? If so, what additional information would you need? Now, imagine that you are a judge; do you think that this piece of evidence should be admitted to court? Why?