I'm comparing 2 data sets from nearby locations to see if they differ significantly in air quality. I cannot perform any paired tests because the data at one location does not necessarily correspond to the data at the other location in time (i.e. one sample at location A was taken from Monday to Friday, another sample at location B was taken at from Tuesday to Saturday). In addition, I have some duplicate samples (2 samples taken over the same few days at the same location, but separate physical samples, not duplicate analyses of one sample) I have been advised to compare the CDF of the two locations. Is this good advice? If so, I can use the k-s test here. Are there any other tests I could use?
Solved – Kolmogorov-Smirnov Test Alternatives
cumulative distribution functiondistributionshypothesis testingspatialtime series
Related Solutions
That is OK, and quite reasonable. It is referred to as the two-sample Kolmogorov-Smirnov test. Measuring the difference between two distribution functions by the supnorm is always sensible, but to do a formal test you want to know the distribution under the hypothesis that the two samples are independent and each i.i.d. from the same underlying distribution. To rely on the usual asymptotic theory you will need continuity of the underlying common distribution (not of the empirical distributions). See the Wikipedia page linked to above for more details.
In R, you can use the ks.test
, which computes exact $p$-values for small sample sizes.
The test used will determine how to assess how much data are needed. However, standard tests, such as the $\chi^2$, would seem to be inferior or inappropriate, for two reasons:
The alternative hypothesis is more specific than mere lack of independence: it focuses on a high count during one particular day.
More importantly, the hypothesis was inspired by the data itself.
Let's examine these in turn and then draw conclusions.
Standard tests may lack power
For reference, here is a standard test of independence:
x <- c(3,2,1,2,1,2,6) # The data
chisq.test(x, simulate.p.value=TRUE, B=9999)
X-squared = 7.2941, df = NA, p-value = 0.3263
(The p-value of $0.33$ is computed via simulation because the $\chi^2$ approximation to the distribution of the test statistic begins breaking down with such small counts.)
If--before seeing the data--it had been hypothesized that weekends might provoke more errors, then it would be more powerful to compare the Saturday+Sunday total to the Monday-Friday total, rather than using the $\chi^2$ statistic. Although we can analyze this special test fully (and obtain analytical results), it's simplest and more flexible just to perform a quick simulation. (The following is R
code for $100,000$ iterations; it takes under a second to execute.)
n.iter <- 1e5 # Number of iterations
set.seed(17) # Start a reproducible simulation
n <- sum(x) # Sum of all data
sim <- rmultinom(n.iter, n, rep(1, length(x))) # The simulated data, in columns
x.satsun <- sum(x[6:7]) # The test statistic
sim.satsun <- colSums(sim[6:7, ]) # The simulation distribution
cat(mean(c(sim.satsun >= x.satsun, 1))) # Estimated p-value
0.08357916
The output, shown on the last line, is the p-value of this test. It is much smaller than the $\chi^2$ p-value previously computed. This result would be considered significant by anyone needing 90% confidence, whereas few people would consider the $\chi^2$ p-value significant. That's evidence of the greater power to detect a difference.
Greater power is important: it leads to much smaller sample sizes. But I won't develop this idea, due to the conclusions in the next section.
A data-generated hypothesis gives false confidence
It is a much more serious issue that the hypothesis was inspired by the data. What we really need to test is this:
If there were no association between events and day of the week, what are the chances that the analyst would nevertheless have observed an unusual pattern "at face value"?
Although this is not definitively answerable, because we have no way to model the analyst's thought process, we can still make progress by considering some realistic possibilities. To be honest about it, we must contemplate patterns other than the one that actually appeared. For instance, if there had been 8 events on Wednesday and no more than 3 on any other day, it's a good bet that such a pattern would have been noted (leading to a hypothesis that Wednesdays are somehow error-inducing).
Other patterns I believe likely to be noted by any observant, interested analyst would include all apparent clusters of data, including:
Any single day with a high count.
Any two adjacent days with a high count.
Any adjacent days with a high count.
"Adjacent" of course means in a circular sense: Sunday is adjacent to Monday even though those days are far apart in the data listing. Other patterns are possible, such as two separate days with high counts. I will not attempt an exhaustive list; these three patterns will suffice to make the point.
It is useful to evaluate the chance that a perfectly random dataset would have evoked notice in this sense. We can evaluate that chance by simulating many random datasets and counting any that look at least as unusual as the actual data on any of these criteria. Since we already have our simulation, the analysis is a matter of a few seconds' more work:
stat <- function(y) {
y.2 <- c(y[-1], y[1]) + y # Totals of adjacent days
y.3 <- y.2 + c(y[-(1:2)], y[1:2]) # Totals of 3-day groups
c(max(y), max(y.2), max(y.3)) # Largest values for 1, 2, 3 days
}
sim.stat <- apply(sim, 2, stat)
x.stat <- stat(x)
extreme <- colSums(sim.stat >= x.stat) >= 1
cat(p.value <- mean(c(extreme, 1)))
0.3889561
This result is a much more realistic assessment of the situation than we have seen before. It suggests there is almost no objective evidence that events are related to day of week.
Conclusions
The best solution, then, might be to conclude there likely is not anything unusual going on. Keep monitoring the events, but do not worry about how much time will be needed to produce "significant" results.
Related Question
- Solved – Reproducibility of the two-sample Kolmogorov–Smirnov test
- Solved – One way anova or Paired-t test for same samples, different measurement technique
- Solved – How to test for correlation between two weather station’s data
- Solved – Unsure of whether to use an unpaired or paired t-test for two different samples
Best Answer
Bill Huber is an expert with spatial data and I think has given you good advice. Comparing CDFs may be too simplistic with spatial and temporal effects present and possibly different at the two locations. But there is also a certain amount of aggregation.
Having worked in industry for many years i know that if the bosses want things a certain way sometimes you have no choice but to give it to them that way. Just be careful to provide all the important caveats so that they don't misinterpret the results. Now your basic questions can be answered without getting into the nitty gritty details of the data.
If you have two data set there are a number of tests called empirical cdf tests because they compare the two sample cdfs and look for specific differences. The Kolmogorov-Smirnov test is perhaps the most well-known. It looks at the maximum absolute difference between the two cdfs over the entire range of the data. You can also create histograms of the data constructing the same bins for both data sets. There is a form of the chi-square test that can be used to see in the frequencies in the bins for one group is similar the the frequencies in the other. This can be done using contingency tables. For the contingency table approach there are exact permutation tests (e.g. Fisher's exact test) that also can be used.