Solved – How to compare the following mutual information values

clusteringmutual information

How can I compare the following mutual information values ? I'm just wondering what's the most appropriate way to display them in my report table.

I'm computing them with this formula http://d.pr/chkK

where e and c are clusters and the intersection is the number of elements in common.

For each couple e and c I have a I value (mutual information). Successively I average over all e belonging to the same category (not shown in the formula) and I end up with a table like:

cat1 0.0123
cat2 0.0012
cat3 0.0009
cat4 0.0100
...

The mutual dependency values are usually very low (around 0.01), because n (total amount of documents in the collection) is very high.

Should I use another measure, or… what do you suggest ?

thanks

Best Answer

Are you after the mutual information between two clusterings? Marina Meila has introduced the 'variation of information' metric based on mutual information (see e.g. http://www.stat.washington.edu/mmp/Papers/icml05-compare-axioms.pdf). That would be quite appropriate to use. She also discusses alternative metric distances between clusterings. One of these (the split/join distance) is a bit more easily interpretable as the number of nodes that need rearranging between clusterings.

Alternatively, if you are not after a clustering-clustering comparison but more interested in individual events, you may consider using the hypergeometric P-value to consider the significance of intersection sizes between sets.

Related Question