Solved – Kolmogorov-Smirnov test

kolmogorov-smirnov test

I am studying disturbance caused by ship traffic to a small seabird. I observed focal animals for a set amount of time and record whether or not they fly from the water during the observation. This particular bird does not fly at high probabilities when not disturbed (about 10% of the time). Post hoc, I have appended the distance to the nearest ship to every observation (ships of interest had GPS locators recording a point every 5 sec).

I have plotted the cumulative distribution function for ALL observations and for observations where the bird flew from the water as a function of distance to nearest ship. As expected, the majority of the observations in which the bird flew were observed when the ship was close.

ecdf plot, fly, did not fly, all obs

Can I use the Kolmogorov-Smirnov test to test if there is a statistical difference in the distribution of the flight observations and the total observations? My thought is if these two distributions are different it would suggest that ship distance has an influence on flight. I worry as these distribution functions are not independent as the flight observations are a subset of the total observations.

Thoughts?

Having read a bit further on this site I think I can test the distribution of the observations in which flight occurred (F) against the distribution of the observations in which it did not (NF) as these are independent. If these distributions are the same F=NF, then we can assume that the distribution of (F) and (TOT = all observations) are the same as we know that the distribution of (F) is equal to itself and (F)+(T) = (TOT). Right?

UPDATE: 2/12/14

Following the suggestions from @Scortchi I investigated the relationship of the incidence of flight vs distance to the nearest ship in a logistic regression framework. There was slight relationship present (negative slope) but the p-value was non-significant suggesting that the true slope could be zero. Based on the desciptive statistics (including the ecdf plots) I suspected that the effect of close ships was being drowned out by the many observations when the ship was not affecting behavior. I then used the R package segmented (http://cran.r-project.org/web/packages/segmented/segmented.pdf) to try and find a break-point in the model. The program found that breaking the data at 2.6 km from the ship and fitting two separate coefficients was better than the single coefficient model. The coefficient for the slope of the close ship approaches was negative and suggests that ships affect the flight response until about 2.6 km (p-value < 0.001). The coefficient for the second slope was slightly positive but the p-value was non-significant at the 0.05 alpha level (p-value = 0.11). So in summary, the segmented regression line was able to detect a threshold difference at which flight probability increases. The estimate for the probability of flight when the ship is further than 2.6 km is 0.11. Fittingly, I observed 79 birds in when no ships were even in the study bay (> 100 km away) and 9 of these individuals flew from the water during the observation (0.113 probability).

Thanks for all the suggestions. I hope this question along with the suggestions and answers helps others.

Best Answer

Interesting problem. I have two thoughts, one general and one about how to characterize your data...

First, with respect to comparing distributions I agree with @Glen_b and @Scortchi that you do not want to compare Fly vs All as shown in your chart (but nice idea to overlay the plot of the D statistic). Because you have a strong belief about where the distributions are likely to be different, and not just that they are different, you might want to consider comparing quantiles of the two distributions. There is a nice blog post on the subject which works through R code to develop the testing method. And there is an R package, WRS, which implements quantile-based testing methods.

Second, I'd consider dropping the use of a formal comparison test altogether and instead use Weight of Evidence (WOE). This approach is commonly used in industries that need decision frameworks dealing with different levels of risk across various predictors. Examples include insurance underwriting, credit evaluation, and clinical trials.

In your setting there is a baseline "risk" of flight (you said 10%), but the odds of flight seem to increase greatly in the presence of ships at certain distances. Using the WOE approach you can convey the change in odds of flight as a function of a ships distance, which is easy to understand for lay audiences (well, at least easier than understanding p-values associated with test statistics). Note that this is closely related to @Scortchi's suggestion to use logistic regression, but with WOE you are not trying to fit a regression model.

There is nice documentation on Statistica's website for applying the method, but the best introduction I have found is in a book Credit Scoring, Response Modeling, and Insurance Rating: A Practical Guide to Forecasting Consumer Behavior. If you search on the term "WOE" you'll find multiple sections discussing the idea, and section 5.1 walks through a complete example of calculating WOE (it's pretty easy) and evaluating the results for decision-making. Finally, note that there is a stackoverflow post on this topic, which is not very developed, but there is a link to PDF walking through another example in the context of SAS coding.