Solved – One sample test of uniformity in R

distributionshypothesis testingruniform distributionvariance

I have a dataset of two columns: one with IDs and one with a column of single digits (0-9) (see below). I would like a statistical significance test for whether the data is uniform. Ideally, I would actually like to test whether certain groups of the data are uniform (i.e. is ID=1 uniform or is ID=2 uniform).

There are a number of possible ways to do this, though given my actual data's structure, I think a test for the variance of uniform distribution would be best (the equation for uniform distribution variance is $\frac{(b-a)^2}{12}$.

Does anyone have any ideas of where I could start or whether there are any existing R packages for something like this?

ID     Digits
1         9            
1         4            
1         4            
2         6            
2         5            
2         3
2         3   

Best Answer

Let's address the underlying statistical question, and then briefly mention doing that in R.

You want to do a test for a discrete uniform distribution applied to each subgroup. So let's think of a specific subgroup. You have a sequence of digits (which we can tabulate as a set of counts). For example:

  9 4 3 1 1 9 9 3 6 3 7 9 5 8 5 6 0 4 1 4 8 6 3 7 3 9 1 9 1 1 1 6 7 6 2 5 5 5
   8 8 1 4 2 5 7 8 6 0 2 6

> table(x)
x
0 1 2 3 4 5 6 7 8 9 
2 8 3 5 4 6 7 4 5 6 

i.e. '0' occurred twice, '1' occurred 8 times and so on.

How do we test for uniformity?

You suggested testing variance. I initially assumed you mean the variance of the distribution of digit-values (which would be viable as a test statistic for particular kinds of deviation), but I now wonder if maybe you mean the variance of the observed counts.

Let's discuss both, second first:

a) Variance of counts of digits. That is, if the observed count for digit $i$ is $O_i$, in this case I guess you mean to take a constant times $\sum_{i=0}^9(O_i-\bar{O})^2$ (a sample variance of the counts would make that constant $\frac{1}{(10-1)}$, but let's leave that to one side for a moment).

That's actually a pretty good idea, and I want to tweak it just the tiniest bit, for reasons that will become clearer in a moment.

Note that $\bar O = \frac{1}{10}\sum O_i=\frac{n}{10}$, which is just the expected count for each digit -- let's call that $E_i$ (that may seem unnecessary, but you'll have to indulge me a moment).

Then the sum of squared deviations from expected is now $\sum_{i=0}^9(O_i-E_i)^2$, which is proportional to $\frac{1}{E_i}\sum_{i=0}^9(O_i-E_i)^2=\sum_{i=0}^9\frac{(O_i-E_i)^2}{E_i}$ ... which is just the usual chi-square goodness of fit statistic. So the variance of the counts of digits is (apart from a scaling constant) a well-known test for goodness of fit to the hypothesized uniform distribution of counts.

[This could be done in R by calling tapply on your second column of data with a function f, where f=function(x) chisq.test(table(x))$p.value and the index being the ID]

b) Now if you mean the variance of the distribution of digit-values, that will have very good power against particular kinds of non-uniformity, such as:

enter image description here

... specifically, nonuniform distributions with larger or smaller variance than the discrete uniform. But it will have very poor power against non-uniform distributions with very similar variance to the discrete uniform, such as either of these:

enter image description here enter image description here

If you're only interested in non-uniforms with larger or smaller variance than the discrete uniform and don't care about the later possible alternatives, this is fine. There are a couple of ways to go about testing this, which I'll go into if you definitely want this option.

But note carefully: the variance of the discrete uniform on 0,1,...,9 is not $\frac{(9-0)^2}{12}$. That's the continuous uniform. The discrete uniform on 0,1,...,9 has variance $\frac{(10^2-1)}{12}=\frac{(9\times 11)}{12}=8.25$


As requested, here's one test of variance. This one is easy to do.

Consider the variance statistic about the expected value for the uniform rather than the sample mean:

$T = \frac{1}{n}\sum_i (X_i-\frac{9}{2})^2$

This should have better power than a test using the ordinary sample variance would have against mean shifts. This is also quite easy to work out the asymptotic distribution of:

$\lim_{n\to\infty}\sqrt{n}(T_n-8.25) \sim N(0,52.8)\,,$

and convergence is so rapid that it's reasonable to use this at fairly small sample sizes.

In the right tail it looks to be quite good above about n=50 (personally, I'd happily use it down to about n=10, but I'm not fussy about exactness of type I error rates). In the right tail it looks fine down below n=10.

[Even at n=5, it's not so bad -- the left hand tail was giving a true significance level of about 2% for a nominal 2.5% left hand tail normal critical value.]

One can make a test with the ordinary sample variance, but it approaches its asymptotic distribution considerably more slowly.

To do that, we can actually compute the variance of the sample variance (it involves fourth moments), and we could then use an asymptotic approximation to the distribution of the variance (but as I mentioned it comes in relatively more slowly than the variance about 4.5). Or we could simulate from the null distribution at any given sample size to get an approximate p-value (if I was going to use the variance, this is what I'd do).

Related Question