Solved – how to measure similarity of two lists of continuous data with different length

density functionhistogramsimilarities

I have two lists of continuous data with different length.
a) How should I measure (dis)similarity of these two lists? or, as these lists can be formed into histogram, how can I quantify (dis)similarity of these two histograms? I want to take into account the shape and the location of these histograms, but want the frequency normalised.
b) are there any python implementation for that purpose?
to put into context, these list contain likelihood values from HMM score function

Best Answer

Regarding your question a), I would suggest you look into the Kolmogorov-Smirnov Test. The KS Test compares the similarities between distributions by comparing the respective cumulative distribution functions (CDF). This can be used when the samples from the distributions have different numbers of values. This also considers both shape and location of the distributions. The null hypothesis here is that the two samples are from the same distribution. If the $p$-value is less than your predefined limit, you reject the null and declare them to be from different empirical CDFs. A measure of dissimilarity, then, is the $D$ statistic, which is the maximum deviation between the two CDFs.

This is a nonparametric test, so the only major assumption you need to be confident of is that your two distributions are sampled in a way that they represent the populations of interest.

Regarding question b), I have little experience with Python, so I will simply drop a link.

Related Question