Solved – Is it valid to use Hartigans’ dip test to reject uni-modal null hypothesis with large N

distributionsmodestatistical significance

first time questioner, so please be gentle 🙂

I have two distributions of data from a simulation. By eye, one looks like it might be bimodal, one not. I copy them below

enter image description here

A: parameter value 1 : by eye possibly bi / multi modal

enter image description here

B: parameter value 2 : by eye unimodal

I thought I would use Hartigans' dip test to test H1:not uni-modal vs H0:uni-modal. The problem is that the test statistic suggests that I reject the null hypothesis of unimodality for both distributions with values well under the suggested 0.05 threshold. In fact, the test statistic for the distribution that looks more unimodal is lower than that for the one that looks potentially multi-modal (dist A: D = 0.00814; dist B: D = 0.00340)

I think what I'm seeing is the effect of fairly large N (N=5000), so the sample size lends statistical power to the test. But inspection of the histograms suggests this is not valid. Is there some way to formalise discussion of whether it is valid to reject the null hypothesis based on this test?

I've read a few posts on here (Test for bimodal distribution and @whuber 's suggested search ). I also found this elsewhere, but the answer is a bit generic – basically saying that with large N you're likely to find lots of tests significant, which I already suspect is the case here.

I realise that some consideration of causal mechanisms for uni / non-uni modal outcomes could aid the discussion, but want to understand the statistical tests, too.

I'd like some advice on

1) Am I interpreting the test statistic correctly (i.e. D < 0.05 evidence to reject uni-modality)?

2) Is there a way to determine whether large N is giving the test undue statistical power?

Best Answer

It isn't so much that a hypothesis test has "too much power" with large n, it's that hypothesis tests don't seem to answer the question you're interested in.

Given the large sample size, the second plot looks to me like there is some suggestion of at least 2 modes.

"Qualitatively different" is basically "is this different enough to matter?" which is more a question of effect size than significance (hypothesis tests will identify even trivial differences, so they're simply not answering a question like "how different are they?"). In fact, it's often the case that we already know the null to be false, so we certainly shouldn't be using one in that case. Some diagnostic measure of whatever you regard as important in terms of difference in distribution might be useful.

I think there are indeed two questions - 1) Am I seeing an effect (multi-mode output resulting from unimodal input) and 2) how big is that effect (and, related, is it big enough to matter?)

The first question might be cast as a hypothesis testing question (though it's not the only way to take it). On the second question, the paper by Hartigan&Hartigan says

"The dip test measures multimodality in a sample by the maximum difference, over all sample points, between the empirical distribution function, and the unimodal distribution function that minimizes that maximum difference."

-- that statistic appears to be a perfectly reasonable measure of the extent of deviation from unimodality. As for whether the size of difference matters, that's harder to say without knowing the application - you might be better placed to judge it (or to identify some other measure that you can say whether it matters with).

What I want to know is whether my simulation fed with the same input distribution but with one parameter changed (what I've plotted) leads to qualitatively different output distributions, particularly if the system has a different number of modes of operation.

Since you're interested in a two-sample comparison of number of modes, it might even be possible to modify a statistic for the one-sample case to be used for the two-sample case.

One way to approach that occurs to me, which is a form of resampling.

Compute the difference in dip-test statistics for the two samples. That's your measure of the effect.

You could do bootstrapping, for example, to estimate the standard error or to build a confidence interval for the difference in dip statistics. Given the difference in location, you might try resampling within-samples. A test might be based off whether the confidence interval covers 0. You might want to do some simulation to see whether the coverage of such an interval is reasonable, and whether techniques for esitmating/reducing bias might help.

Another alternative way to bootstrap might be to have some estimates of location and scale and resample standardized residuals, if you regard the shapes as being the same under the null.


Yet another alternative might be to consider a model that might be suitable for the distributional shape if it were unimodal, and then treat the two samples as finite mixtures of such distributions (location or scale mixtures, if either are indicated by your knowledge of the problem); then you'd have some difference in estimated number of components as a measure.

That's a little less specific than I'd like to be, but it's at least a starting point.