Solved – Two sample Kolmogorov-Smirnov test for the stochastic dominance

hypothesis testingkolmogorov-smirnov teststata

I'm trying to use KS test to determine whether one group of data is scholastically dominates another. So I'm studying dataset regarding performance of companies, which are divided into 2 groups. Instead of comparing mean values for this two groups, I follow [1] and want to compare distributions using KS test (Table 3). They do two tests: one sided (A less then B) and two sided (equality). For that I use STATA's ksmirnov command, the problem is how to interpret the output. It return D and p but what one can conclude from these values is not clear for me. For instance, for my groups it returns:

. ksmirnov performance, by(myGroup)
Two-sample Kolmogorov-Smirnov test for equality of distribution functions

 Smaller group       D       P-value  Corrected
 ----------------------------------------------
 0:                  0.0047    0.972
 1:                 -0.1635    0.000
 Combined K-S:       0.1635    0.000      0.000

The 0 is checking hypothesis that group0 has smaller values then group1. The 1 for hypothesis that group0 has larger values then group1. But I do not understand how to interpret D and p. What is the unit of D and is it big enough to accept hypothesis (for instance, for the confidence 0.05)?

[1] http://www.etsg.org/ETSG2012/Programme/Papers/329.pdf

Best Answer

The Ds are the test statistics and they derive from the differences between the empirical cumulative distribution functions of the two groups. Therefore, they are the differences of probabilities. The p-values have their normal interpretation: if $pval \leq \alpha$, reject the null hypothesis; where $\alpha$ is a predetermined significance level.

Stata also gives an additional p-value for the non-directional hypothesis (Combined K-S), corrected for small samples.

Examples and details of what Stata does are in [R] ksmirnov, including the math in the Methods and formulas section.

An example of a "manual" computation of the Ds is:

clear
set more off

*------ example data -----

use http://www.stata-press.com/data/r12/ksxmpl

*----- manual computation -----

bysort group: cumul x, gen(cumd)

sort x

gen cumd1 = cumd 
replace cumd1 = cumd1[_n-1] if group != 1

gen cumd2 = cumd 
replace cumd2 = cumd2[_n-1] if group != 2
replace cumd2 = cumd2[2] in 1

gen diff = cumd1 - cumd2

summarize diff, meanonly
display  "" _n ///
         "Results are:" _n ///
         "This is D+ : `r(max)'" _n ///
         "This is D- : `r(min)'"

line cumd1 cumd2 x, sort // graph the cdfs

*----- direct computation -----

ksmirnov x, by(group)

The "manual" approach is from [1], which Stata cites in its manual.

[1] Riffenburgh, R. H. 2005. Statistics in Medicine. 2nd ed. New York: Elsevier.