Distance Measurement – How to Calculate Distance Between Empirically Generated Distributions in R

I'm not a statistician, but I sometimes need to play around with data. I have two data sets, lists of values in the unit interval. I've plotted them as histograms, so I have an intuitive idea of how "far apart" they are. But I want to do something a little more formal.

My first thought was to just sum the differences of the values in the bins, but this isn't that satisfactory. Then I thought of taking a three-bin average and sum differences over these. (Apologies if I'm mangling statistics terminology)

But I was thinking I'm probably reinventing the wheel, so I came here. Similar questions seem to point to "Kolmogorov Smirnov tests" or something like that.

So my question is: is this the right method to calculate how far these data sets are apart? And is there an easy way to do this in R? Ideally just KStest(data1,data2) or something?

Edit To emphasize, I'm particularly interested in ways to measure how far the data are apart directly rather than fitting a distribution to each and then measuring the distance between distributions. [Does that even make sense? I guess numerical calculations in R will be done by sampling from a distribution anyway.]

Best Answer

You can do a Kolmogorov-Smirnov test using the ks.test function. See ?ks.test.

In general, when you are looking for a function in R (and you don't know its name) try using ??. For instance, ??"Kolmogorov Smirnov". If nothing comes up RSiteSearch("whatever you're looking for") should help :)

Best Answer

Related Solutions

Solved – Image classification using histogram

Solved – Testing if two non-normal distributions are significantly different (K-S or Wilcoxon or both?)

Related Question