Solved – Log-likelihood distance measure validity for clustering

distance-functionsk medoidslikelihoodmarkov chainr

I have calculated log-likelihood distances between 50 sequences according to the Formula (1):

$$
D(X_i,X_j)= 1/2(\log p(X_i|Mod_j)+\log p(X_j|Mod_i)),
$$
where $
p(X_i|Mod_j)
$ is the likelihood of sequence $X_i$ being produced by model $Mod_j$, where $Mod_j$ is a corresponding Markov model of the given $Seq_j$, defined by its Transition Probability Matrix and Start Probabilities Vector. The measure is symmetrical as seen from the definition. To make the measure more "legible" and similar to the traditional measures, I compute distance$=(1-D)$ from formula (1). Thus, $D(X_i,X_i) = 0$ and the distance increases if the likelihood decreases.

Now, I have a 50×50 Distance Matrix.I have run a "meaningfullness" check, and it seemed ok for me – i.e. more similar sequences had smaller distance and very different ones had very large distance. The distances seemed to satisfy the triangle inequality. However, I have noticed that:

1) the shorter sequences seem to be "closer" to all other sequences than longer ones. It seems that this distance measure is biased to favor short distances.

2) I have tried PAM-clustering with the distance matrix by converting my distance matrix to dist object in R by using as.dist(), and my results were very bad, even for 2 clusters or 49 ( max avg.silhouette width produced by R function pam was 0.28). With some numbers of clusters the avg.silhouette widths were even negative.

I am coming to conclusion that my way of computing medoids is invalid/conceptually wrong. What could be the problem? Can log-likelihood distance matrix be used with medoids clustering at all?

edit: I am including the heatmap of the distance matrix, where x- and y-axis represent sequences (1 through 50th). It looks strange to me but I cannot pinpoint what exactly doesn't feel right.

heatmap

Best Answer

What definition of log-likelihood is that? I've seen $$r(a,b) = \log \frac{P(a|Mod)}{P(b|Mod)} = \log(P(a|Mod)) - \log(P(b|Mod)) ,$$ but here you're subtracting your two probabilities.