Jensen-Shannon Divergence – Should You Use It or Its Square in Clustering?

clusteringdistance-functionsentropymachine learning

I am clustering probability distributions using the Affinity Propagation algorithm, and I plan to use Jensen-Shannon Divergence as my distance metric.

Is it correct to use JSD itself as the distance, or JSD squared? Why? What differences would result from choosing one or the other?

Best Answer

I think it depends on how it is to be used.

Just for reference for other readers, if $P$ and $Q$ are probability measures, then the Jensen-Shannon Divergence is $$ J(P,Q) = \frac{1}{2} \big( D(P \mid\mid R) + D(Q\mid\mid R) \big) $$ where $R = \frac{1}{2} (P + Q)$ is the mid-point measure and $D(\cdot\mid\mid\cdot)$ is the Kullback-Leibler divergence.

Now, I would be tempted to use the square root of the Jensen-Shannon Divergence since it is a metric, i.e. it satisfies all the "intuitive" properties of a distance measure.

For more details on this, see

Endres and Schindelin, A new metric for probability distributions, IEEE Trans. on Info. Thy., vol. 49, no. 3, Jul. 2003, pp. 1858-1860.

Of course, in some sense, it depends on what you need it for. If all you are using it for is to evaluate some pairwise measure, then any monotonic transformation of JSD would work. If you're looking for something that's closest to a "squared-distance", then the JSD itself is the analogous quantity.

Incidentally, you might also be interested in this previous question and the associated answers and discussions.