Testing for uniformity is something common, however I wonder what are the methods to do it for a multidimensional cloud of points.
Hypothesis Testing – How to Test Uniformity in Multiple Dimensions?
hypothesis testinguniform distribution
Related Solutions
If you have not only the frequencies but the actual counts, you can use a $\chi^2$ goodness-of-fit test for each data series. In particular, you wish to use the test for a discrete uniform distribution. This gives you a good test, which allows you to find out which data series are likely not to have been generated by a uniform distribution, but does not provide a measure of uniformity.
There are other possible approaches, such as computing the entropy of each series - the uniform distribution maximizes the entropy, so if the entropy is suspiciously low you would conclude that you probably don't have a uniform distribution. That works as a measure of uniformity in some sense.
Another suggestion would be to use a measure like the Kullback-Leibler divergence, which measures the similarity of two distributions.
Let's address the underlying statistical question, and then briefly mention doing that in R.
You want to do a test for a discrete uniform distribution applied to each subgroup. So let's think of a specific subgroup. You have a sequence of digits (which we can tabulate as a set of counts). For example:
9 4 3 1 1 9 9 3 6 3 7 9 5 8 5 6 0 4 1 4 8 6 3 7 3 9 1 9 1 1 1 6 7 6 2 5 5 5
8 8 1 4 2 5 7 8 6 0 2 6
> table(x)
x
0 1 2 3 4 5 6 7 8 9
2 8 3 5 4 6 7 4 5 6
i.e. '0' occurred twice, '1' occurred 8 times and so on.
How do we test for uniformity?
You suggested testing variance. I initially assumed you mean the variance of the distribution of digit-values (which would be viable as a test statistic for particular kinds of deviation), but I now wonder if maybe you mean the variance of the observed counts.
Let's discuss both, second first:
a) Variance of counts of digits. That is, if the observed count for digit $i$ is $O_i$, in this case I guess you mean to take a constant times $\sum_{i=0}^9(O_i-\bar{O})^2$ (a sample variance of the counts would make that constant $\frac{1}{(10-1)}$, but let's leave that to one side for a moment).
That's actually a pretty good idea, and I want to tweak it just the tiniest bit, for reasons that will become clearer in a moment.
Note that $\bar O = \frac{1}{10}\sum O_i=\frac{n}{10}$, which is just the expected count for each digit -- let's call that $E_i$ (that may seem unnecessary, but you'll have to indulge me a moment).
Then the sum of squared deviations from expected is now $\sum_{i=0}^9(O_i-E_i)^2$, which is proportional to $\frac{1}{E_i}\sum_{i=0}^9(O_i-E_i)^2=\sum_{i=0}^9\frac{(O_i-E_i)^2}{E_i}$ ... which is just the usual chi-square goodness of fit statistic. So the variance of the counts of digits is (apart from a scaling constant) a well-known test for goodness of fit to the hypothesized uniform distribution of counts.
[This could be done in R by calling tapply
on your second column of data with a function f
, where f=function(x) chisq.test(table(x))$p.value
and the index being the ID]
b) Now if you mean the variance of the distribution of digit-values, that will have very good power against particular kinds of non-uniformity, such as:
... specifically, nonuniform distributions with larger or smaller variance than the discrete uniform. But it will have very poor power against non-uniform distributions with very similar variance to the discrete uniform, such as either of these:
If you're only interested in non-uniforms with larger or smaller variance than the discrete uniform and don't care about the later possible alternatives, this is fine. There are a couple of ways to go about testing this, which I'll go into if you definitely want this option.
But note carefully: the variance of the discrete uniform on 0,1,...,9 is not $\frac{(9-0)^2}{12}$. That's the continuous uniform. The discrete uniform on 0,1,...,9 has variance $\frac{(10^2-1)}{12}=\frac{(9\times 11)}{12}=8.25$
As requested, here's one test of variance. This one is easy to do.
Consider the variance statistic about the expected value for the uniform rather than the sample mean:
$T = \frac{1}{n}\sum_i (X_i-\frac{9}{2})^2$
This should have better power than a test using the ordinary sample variance would have against mean shifts. This is also quite easy to work out the asymptotic distribution of:
$\lim_{n\to\infty}\sqrt{n}(T_n-8.25) \sim N(0,52.8)\,,$
and convergence is so rapid that it's reasonable to use this at fairly small sample sizes.
In the right tail it looks to be quite good above about n=50 (personally, I'd happily use it down to about n=10, but I'm not fussy about exactness of type I error rates). In the right tail it looks fine down below n=10.
[Even at n=5, it's not so bad -- the left hand tail was giving a true significance level of about 2% for a nominal 2.5% left hand tail normal critical value.]
One can make a test with the ordinary sample variance, but it approaches its asymptotic distribution considerably more slowly.
To do that, we can actually compute the variance of the sample variance (it involves fourth moments), and we could then use an asymptotic approximation to the distribution of the variance (but as I mentioned it comes in relatively more slowly than the variance about 4.5). Or we could simulate from the null distribution at any given sample size to get an approximate p-value (if I was going to use the variance, this is what I'd do).
Best Answer
It turns out that the question is more difficult than I thought. Still, I did my homework and after looking around, I found two methods in addition to Ripley's functions to test uniformity in several dimensions.
I made an R package called
unf
that implements both tests. You can download it from github at https://github.com/gui11aume/unf. A large part of it is in C so you will need to compile it on your machine withR CMD INSTALL unf
. The articles on which the implementation is based are in pdf format in the package.The first method comes from a reference mentioned by @Procrastinator (Testing multivariate uniformity and its applications, Liang et al., 2000) and allows to test uniformity on the unit hypercube only. The idea is to design discrepancy statistics that are asymptotically Gaussian by the Central Limit theorem. This allows to compute a $\chi^2$ statistic, which is the basis of the test.
The second approach is less conventional and uses minimum spanning trees. The initial work was performed by Friedman & Rafsky in 1979 (reference in the package) to test whether two multivariate samples come from the same distribution. The image below illustrates the principle.
Points from two bivariate samples are plotted in red or blue, depending on their original sample (left panel). The minimum spanning tree of the pooled sample in two dimensions is computed (middle panel). This is the tree with minimum sum of edge lengths. The tree is decomposed in subtrees where all the points have the same labels (right panel).
In the figure below, I show a case where blue dots are aggregated, which reduces the number of trees at the end of the process, as you can see on the right panel. Friedman and Rafsky have computed the asymptotic distribution of the number of trees that one obtains in the process, which allows to perform a test.
This idea to create a general test for uniformity of a multivariate sample has been developed by Smith and Jain in 1984, and implemented by Ben Pfaff in C (reference in the package). The second sample is generated uniformly in the approximate convex hull of the first sample and the test of Friedman and Rafsky is performed on the two-sample pool.
The advantage of the method is that it tests uniformity on every convex multivariate shape and not only on the hypercube. The strong disadvantage, is that the test has a random component because the second sample is generated at random. Of course, one can repeat the test and average the results to get a reproducible answer, but this is not handy.
Continuing previous R session, here is how it goes.
Feel free to copy/fork the code from github.