Solved – Two sample Kolmogorov-Smirnov test for the stochastic dominance

hypothesis testingkolmogorov-smirnov teststata

I'm trying to use KS test to determine whether one group of data is scholastically dominates another. So I'm studying dataset regarding performance of companies, which are divided into 2 groups. Instead of comparing mean values for this two groups, I follow [1] and want to compare distributions using KS test (Table 3). They do two tests: one sided (A less then B) and two sided (equality). For that I use STATA's ksmirnov command, the problem is how to interpret the output. It return D and p but what one can conclude from these values is not clear for me. For instance, for my groups it returns:

. ksmirnov performance, by(myGroup)
Two-sample Kolmogorov-Smirnov test for equality of distribution functions

 Smaller group       D       P-value  Corrected
 ----------------------------------------------
 0:                  0.0047    0.972
 1:                 -0.1635    0.000
 Combined K-S:       0.1635    0.000      0.000

The 0 is checking hypothesis that group0 has smaller values then group1. The 1 for hypothesis that group0 has larger values then group1. But I do not understand how to interpret D and p. What is the unit of D and is it big enough to accept hypothesis (for instance, for the confidence 0.05)?

[1] http://www.etsg.org/ETSG2012/Programme/Papers/329.pdf

Best Answer

The Ds are the test statistics and they derive from the differences between the empirical cumulative distribution functions of the two groups. Therefore, they are the differences of probabilities. The p-values have their normal interpretation: if $pval \leq \alpha$, reject the null hypothesis; where $\alpha$ is a predetermined significance level.

Stata also gives an additional p-value for the non-directional hypothesis (Combined K-S), corrected for small samples.

Examples and details of what Stata does are in [R] ksmirnov, including the math in the Methods and formulas section.

An example of a "manual" computation of the Ds is:

clear
set more off

*------ example data -----

use http://www.stata-press.com/data/r12/ksxmpl

*----- manual computation -----

bysort group: cumul x, gen(cumd)

sort x

gen cumd1 = cumd 
replace cumd1 = cumd1[_n-1] if group != 1

gen cumd2 = cumd 
replace cumd2 = cumd2[_n-1] if group != 2
replace cumd2 = cumd2[2] in 1

gen diff = cumd1 - cumd2

summarize diff, meanonly
display  "" _n ///
         "Results are:" _n ///
         "This is D+ : `r(max)'" _n ///
         "This is D- : `r(min)'"

line cumd1 cumd2 x, sort // graph the cdfs

*----- direct computation -----

ksmirnov x, by(group)

The "manual" approach is from [1], which Stata cites in its manual.

[1] Riffenburgh, R. H. 2005. Statistics in Medicine. 2nd ed. New York: Elsevier.

Related Solutions

Solved – Two sample Kolmogorov-Smirnov test and p-value interpretation

If you are using the traditional 0.05 alpha level cutoff then all but group 3 are significantly different from your full group. It is a little easier to see this if the p-values are not in scientific notation ( you can use options(scipen=5) in R to make this less likely). Also group 1 becomes non-significant for some adjustments for multiple tests. You should consider whether that adjustment applies in your case or not. Also note that the groups that are not significant could be different, just low power.

But that just means that any differences, however small, are not easily explained by chance. It could be that your groups are close enough for practical purposes. It is usualy more meaningful to plot the data to see how different the distributions are. You could use the qqplot function as one approach. The vis.test function in the TeachingDemos package for R gives another approach.

One possible hitch is if your groups are part of the "Full" data set as well, then you don't have the independence assumed (but given the sample sizes, I am not sure how much this would affect things). You could address this by taking random samples from the full data set and computing the KS-distance for each (ignore the p-value), then compare where your actual data falls relative to the random samples.

Most of this comes down to what question you really want answered, many of the exact distributional tests answer a different question than the researcher is really interested in.

Kolmogorov-Smirnov Test – How to Perform a Kolmogorov-Smirnov Two-Sample Test

I am assuming you are asking because the Suanshu help page reports in reference to the K-S distribution, "This is not done yet." Luckily, it is very easy to do in R. If x and y are your two samples, ks.test(x,y) returns the test statistic and pvalue. For example,

> x <- rnorm(50)
> y <- runif(30)
> ks.test(x, y)    
        Two-sample Kolmogorov-Smirnov test    
data:  x and y 
D = 0.5, p-value = 9.065e-05
alternative hypothesis: two-sided

By default, it will compute exact or asymptotic p-values based on the product of the sample sizes (exact p-values for n.x*n.y < 10000 in the two-sample case), or you can specify this option with a third argument, exact=F or exact=T. Exact p-values are calculated using the methods of Marsaglia, et al. (2003), which the Suanshu documentation also cites. Some large sample approximations are given here, although I don't have a proper citation. Lastly, if you don't want to install R, there are web calculators for the two-sample K-S test, although I don't know if they use the same algorithm as R because the one I found only reported three decimal points for the p-value.

Best Answer

Related Solutions

Solved – Two sample Kolmogorov-Smirnov test and p-value interpretation

Kolmogorov-Smirnov Test – How to Perform a Kolmogorov-Smirnov Two-Sample Test

Related Question