[Math] Calculating Probabilities for Substitution Ciphers using Frequency Analysis

bayesiancryptographyprobability

I have been trying to put together a tool that can take in cipher text encrypted via a simple substitution cipher and calculate the most likely "key" (that is, how the plain text letters were mapped to the cipher text letters).

I think I've had some success with an approach I've been using, but I suspect it could be a lot better.

What I've been doing to this point:

For a given mystery letter (that is, that which appeared in the cipher text), I calculated that it actually decrypts to 'a', given the frequency that this mystery letter appears in the cipher text. I did that by using Bayes formula to render it as the likelihood that 'a' appears in the cipher text the number of times that the mystery letter does (knowing usual frequency of 'a' in English). In other words, if we denote $X$ as the mystery letter and $F_X$ denotes the frequency that $X$ appears, then:

$$P(X = a \mid F_{X} ) = \frac{ P(F_X \mid X = a) P(X = a) }{ P(F_X) }$$

$P(F_X \mid X = a)$ can be easily calculated by using the binomial distribution mass function knowing the general frequency that 'a' appears in the English language. Without any other information, I assume that $P(X = a)$ equals $1 \over 26$. I neglected to divide by the bottom for reasons which I'll explain in a moment.

Then I did the same for every other letter; that is $P(X = b \mid F_X), \ldots, P(X = z \mid F_X)$. I summed up all of what I had calculated for this mystery letter, and then divided each one by the sum, to complete the calculation for each possibility.

Then I did the same for every other mystery letter. So, what I ended up with was a 26-by-26 matrix of probabilities that a mystery letter corresponded to a plaintext letter. It then becomes an assignment problem of selecting 26 elements, exactly one from each column and one from each row and trying to maximize probability for a particular key.

What I don't like about this (beyond that I'm generally inexperienced and uneasy on this topic), is that I feel that I could definitely do better. Each probability calculated for a mystery letter only takes into account the frequency of that particular mystery letter, and not the frequency of other mystery letters. I've thought about trying to do this as a Bayesian Network, but I feel unsure if that is valid, and I'm having a hard time getting my mind around it.

Thanks for any help and insight!

Best Answer

You might be interested in the paper "The Markov Chain Monte Carlo Revolution" by Diaconis, which you can find online. In this paper, Diaconis shows how Stanford students were able to break a simple substitution cipher by using the Metropolis algorithm, using digraph frequencies. Diaconis says an attempt based on the frequencies of single letters failed.

Related Question