Help solving Bigram Model with the following probabilities

probabilitystatistics

I came across the following problem involving bigram models which I am struggling to solve. Following this tutorial I have a basic understanding of how bigram possibilities are calculated.

Problem:

Let's consider sequences of length 6 made out of characters ['o', 'p', 'e', 'n', 'a', 'i']. There are 6^6 such sequences.

We consider bigram model with the following probabilities:


For the first character in the sequence:

$p( 'o' ) = 0.05;$ $p( 'p' ) = 0.00;$ $p( 'e' ) = 0.03;$ $p( 'n' ) = 0.76;$ $p( 'a' ) = 0.07;$ $p( 'i' ) = 0.09;$

in short: $[0.05, 0, 0.03, 0.76, 0.07, 0.09]$

For the transitions:

$p( 'o' | 'o' ) = 0.73;$ $p( 'p' | 'o' ) = 0.02;$ $p( 'e' | 'o' ) = 0.04;$ $p( 'n' | 'o' ) = 0.07;$ $p( 'a' | 'o' ) = 0.06;$ $p( 'i' | 'o' ) = 0.08;$

in short: $[0.73, 0.02, 0.04, 0.07, 0.06, 0.08]$

$p( 'o' | 'p' ) = 0.01;$ $p( 'p' | 'p' ) = 0.06;$ $p( 'e' | 'p' ) = 0.07;$ $p( 'n' | 'p' ) = 0.07;$ $p( 'a' | 'p' ) = 0.08;$ $p( 'i' | 'p' ) = 0.71;$

in short: $[0.01, 0.06, 0.07, 0.07, 0.08, 0.71]$

$( 'o' | 'e' ) = 0.09;$ $p( 'p' | 'e' ) = 0.08;$ $p( 'e' | 'e' ) = 0.09;$ $p( 'n' | 'e' ) = 0.71;$ $p( 'a' | 'e' ) = 0.03;$ $p( 'i' | 'e' ) = 0.00;$

in short: $[0.09, 0.08, 0.09, 0.71, 0.03, 0]$

$p( 'o' | 'n' ) = 0.05;$ $p( 'p' | 'n' ) = 0.00;$ $p( 'e' | 'n' ) = 0.02;$ $p( 'n' | 'n' ) = 0.84;$ $p( 'a' | 'n' ) = 0.08;$ $p( 'i' | 'n' ) = 0.01;$

in short: $[0.05, 0, 0.02, 0.84, 0.08, 0.01]$

$p( 'o' | 'a' ) = 0.03;$ $p( 'p' | 'a' ) = 0.80;$ $p( 'e' | 'a' ) = 0.07;$ $p( 'n' | 'a' ) = 0.00;$ $p( 'a' | 'a' ) = 0.01;$ $p( 'i' | 'a' ) = 0.09;$

in short: $[0.03, 0.8, 0.07, 0, 0.01, 0.09]$

$p( 'o' | 'i' ) = 0.00;$ $p( 'p' | 'i' ) = 0.04;$ $p( 'e' | 'i' ) = 0.07;$ $p( 'n' | 'i' ) = 0.03;$ $p( 'a' | 'i' ) = 0.79;$ $p( 'i' | 'i' ) = 0.07;$

in short: $[0, 0.04, 0.07, 0.03, 0.79, 0.07]$


Find the most probable sequence in this model and write the answer here:

I am a complete beginner in this field so please bear with me.

So far:

  1. The first character is $'n'$ with the highest probability of $0.76$.
  2. Next I need to find the probability of which letter follows $'n'$. This is the 4th transition.
  3. $p( 'n' | 'n' ) = 0.84$ seems to have the highest probability, so $'n'$ is followed by 'n' and so on.
  4. $'n', 'n', 'n', 'n', 'n', 'n'$

I think that I am failing to understand some core concept or misunderstanding the question.

Best Answer

You're following a greedy approach, which might not necessarily work. What you've to do is to compute the probabilities of all possible sequences and pick the one with the highest probability. And, there is a clever way to do it, known as the Viterbi decoding. I refer you to the video here.

So, the first character probabilities will be the initial state probabilities and the probabilities of the form p(x|y) will be the transition probabilities from y to x.

The trick of Viterbi algorithm is that, at each time step k, for every symbol in [o,p,e,n,a,i] it is sufficient to remember the most likely path to that symbol from timestep k-1 instead of every path into that symbol.

PS: I haven't answered it in detail since I was only going to post a comment but I was unable to because of my low 'credits'

Related Question