Cryptanalysis

STA 199

Bulletin

  • this ae is due for grade. Push your completed ae to GitHub within 48 hours to receive credit
  • Project presentations and repo due Monday lab (by pushing to GitHub)
  • Extended deadline: project reports now due Wednesday
  • Don’t forget about the statistics experience
  • course evaluations open. \(>80\%\) response \(\rightarrow\) +1pt final project

Getting started

Clone your ae26-username repo from the GitHub organization.

Today

By the end of today, you will…

  • break a substitution cipher
  • understand the intuition behind an algorithm to do so, Markov chain Monte Carlo (MCMC)

Load packages

library(tidyverse)
library(tidymodels)
library(reshape2)

Our data includes

  1. Complete text of “War and Peace” by Leo Tolstoy
  2. Freq. analysis data frame of “War and Peace”
  3. A secret message

Background

This application exercise builds on the R code found here and The Markov Chain Monte Carlo Revolution by Persi Diaconis.

Let’s load the data

warandpeace = readLines("https://sta101.github.io/static/appex/data/warandpeace.txt")
frequency = read.table("https://sta101.github.io/static/appex/data/frequencies.txt")
colnames(frequency) = c(toupper(letters), "") # edit column names
secret_message = readLines("https://sta101.github.io/static/appex/data/secret-message.txt")

Exercise 1

Take a look at secret-message. Each letter is a stand in for exactly one other letter. This sort of cipher is known as a “substitution cipher”.

‘A’ could be encoded as one of 26 characters (A, B, C, …). Once the encoding for ‘A’ is chosen, ‘B’ has 25 possibilities and so on so there are, in total \(26 \times 25 \times 24 \times \ldots \times 3 \times 2 \times 1\) possibilities.

n = 26
input = n + 1
keys = gamma(input)
keys
[1] 4.032915e+26

That’s over \(4 \times 10^{26}\) possible keys! If you could check 10M keys per second, it would take approximately \(1 \times 10^{12}\) (trillion) years to check every possible key. Trying every possible key is known as a “brute force” approach.

  • Chat with your neighbor and develop a strategy better than the brute force approach. Detail your strategy below.

Exercise 2

Here we determine how often one character follows another using the text from the very long book, War and Peace. We include a whitespace character as a 27th character in our alphabet.

To reduce computational demand, we will load the object created by this analysis but leave the code below for reference.

The result of the below analysis is in the object frequency.

The letter row denotes the first character in a 2 character sequence while columns determine the second character in the character sequence.

  • What should be the sum of a row? Check this for the first row of the data frame.
# code here

Create a heatmap of character frequency, where the y-axis is the first letter and the x-axis is the second letter in a two letter chain.

  • Uncomment and complete the code below.

Hint: We’ll use melt from the reshape2 package. Click here for an example.

melt_freq = melt(as.matrix(frequency))

# melt_freq %>%
#   ggplot(aes(x = __, y = __, fill = __))

MCMC

We will use a famous statistical algorithm, Markov chain Monte Carlo (MCMC) to break the substitution cipher.

MCMC is composed of three essential components:

  1. Ability. A way to propose any possible key.
  2. Feedback. A way to evaluate how good a given key is.
  3. Curiosity. A way to leave a good key for a worse key.

The analogy of the blind monkey.

There’s a monkey on an island with many ponds. The monkey has been tasked with finding the largest body of water on the island. The only trouble is, he is blind. In order to find the largest body of water, he throws rocks randomly and listens for a splash. If he hears a splash, he knows there’s a body of water where he last threw. He continues to throw more rocks in that direction to find out how big the pool is. Occasionally he gets bored and wanders off to look for another pond. In this way, he uses the MCMC algorithm to find the largest body of water.

Making the connection.

  1. Ability (throw a rock). The Monkey can walk around and throw the rock anywhere on the island, ensuring that given enough time, he will cover every inch of the island.

  2. Feedback (test for waters). Every time the Monkey throws the rock, he receives feedback by listening for a splash. A large splash means a deep pond and encourages him to continue throwing in that direction to figure out the perimeter of the pond.

  3. Curiosity (boredom). If the monkey finds the second largest pond on the island, he might get stuck throwing rocks in it for a long time. By occasionally walking away from a large pond, he will reach the largest pond quicker.

Exercise 3

Let’s write out together what MCMC looks like when decrpyting a secret message.

  1. Ability

  2. Feedback

  3. Curiosity

Exercise 4

Here we load some functions that will help us decode the message.

Run the code below to break the secret message. If the message is unintelligible after several iterations, you may try re-starting with a new seed. What is this equivalent to in the monkey analogy above?

Exercise 5

Try your own message!

Create your own message in the code below and call mcmcAttack(coded) on your message to decode it!

  • You might think about what makes the message easy or difficult to attack, e.g. does length of the message affect its susceptibility to attack? What else might?