I'm using R and have two vectors of discrete values. They are not strictly speaking categorical because the values themselves are number of dots counted on the image of a cell (whole vector is all the cells on the image). There are two vectors: reference and a vector with dot counts after some perturbation
What I believe is that such data should follow negative binomial distribution and some sort of goodness of fit should give a p-value and some statistic describing whether the two distributions differ significantly.
What people advised me is that chi square test would do the trick but in my understanding chi square considers all values only as a category and ignores the fact that these are numbers and if lets say number of cells with 5 dots decreased a bit while number of cells with 4 dots increased that's not the same as if same situation would happen with 0-dots and 6-dots categories.
However what I didn't find is a test which could deal with negative binomial distributions. I hope I described the problem clearly. So if somebody know any test which would deal with such kind of data or if anybody thinks that my assumptions are wrong you are welcome to share your ideas.
Example 1
library(ggplot2)
c.dots = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 3, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0,
0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 2, 2, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2,
0, 0, 0, 1, 0, 0, 1, 0, 0, 3, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 2, 0,
0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0,
0, 0, 1, 1, 0, 0, 1, 0, 2, 0, 1, 0, 2, 0, 0, 1, 0, 0, 1, 1, 0, 0, 3, 0, 0,
0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0)
w.dots = c(0, 0, 0, 1, 3, 1, 1, 1, 1, 0, 0, 2, 0, 0, 2, 1, 0, 1, 3, 0, 1, 0, 0, 0, 2,
0, 2, 2, 0, 3, 1, 2, 1, 0, 2, 1, 0, 2, 0, 1, 2, 1, 0, 0, 1, 0, 1, 1, 0, 0,
0, 1, 1, 0, 2, 0, 0, 1, 3, 0, 0, 1, 0, 2, 1, 0, 1, 1, 1, 1, 1, 1, 2, 1, 1,
2, 4, 1, 0, 0, 2, 2, 0, 1, 0, 1, 3, 0, 2, 1, 1, 2, 0, 0, 0, 0, 0, 0, 1, 0,
1, 1, 0, 1, 0, 0, 2, 0, 1, 0, 2, 1, 0, 1, 2, 0, 4, 2, 0, 1, 0, 2, 0, 1, 2,
1, 1, 2, 1, 1, 3, 1, 0, 1, 0, 1, 2, 0, 1, 2, 0, 1, 1, 2, 2, 0, 3, 0, 1, 1,
0, 0, 2, 0, 1, 1, 0, 1, 2, 0, 0, 1, 0, 1, 2, 0, 0, 4, 3, 0, 1, 0, 0, 1, 0,
4, 0, 1, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 1, 1, 0, 3, 1, 1, 0, 4, 1, 1, 3)
chisq.test(rbind(table(w.dots), table(c.dots)))
nbrand = rnbinom(length(c.dots), mu = 1, size = 1)
ggplot() +
geom_density(aes(x=x), data=data.frame(x=c.dots), fill="red", alpha=0.5) +
geom_density(aes(x=x), data=data.frame(x=w.dots), fill="blue", alpha=0.5) +
geom_density(aes(x=x), data=data.frame(x=nbrand), colour="green", alpha=0, linetype=3)
Example 2
library(ggplot2)
c.dots = c(1, 0, 0, 1, 0, 0, 3, 0, 1, 0, 3, 0, 2, 0, 0, 2, 2, 0, 0, 1, 1, 0, 0, 1, 0,
0, 0, 0, 1, 0, 1, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,
1, 1, 1, 2, 0, 4, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 3, 4,
0, 1, 1, 0, 1, 0, 2, 1, 2, 2, 3, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1,
1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 2, 0, 0, 3, 2, 2, 1, 0, 2, 0, 2, 2, 0, 0, 2,
1, 0, 2, 0, 0, 2, 2, 1, 0, 0, 0, 0, 0, 1, 0, 3, 0, 1, 0, 0, 1, 0, 0, 0, 0,
2, 1, 1, 0, 1, 0, 1, 1, 0, 1, 3, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 2, 1, 0,
1, 2, 0, 0, 3, 3, 0, 1, 2, 0, 0, 1, 1, 0, 1, 1, 3, 1, 3, 0, 2, 0, 0, 0, 0)
w.dots = c(1, 3, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 3, 0, 0, 0, 1, 2, 0,
1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 5, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 2,
0, 1, 0, 3, 0, 0, 1, 2, 3, 1, 0, 0, 0, 2, 1, 1, 2, 0, 2, 0, 3, 0, 2, 0, 0,
0, 0, 2, 0, 1, 0, 2, 0, 0, 1, 1, 2, 3, 0, 2, 2, 1, 0, 1, 0, 0, 1, 0, 1, 0,
0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 2, 0,
0, 1, 2, 1, 1, 1, 2, 1, 2, 3, 2, 0, 0, 0, 0, 0, 1, 2, 0, 0, 1, 0, 0, 1, 1,
0, 0, 1, 2, 1, 0, 1, 2, 1, 1, 1, 0, 1, 0, 0, 0, 2, 1, 1, 1, 0, 0, 0, 0, 0,
1, 2, 1, 2, 0, 1, 2, 1, 0, 1, 3, 2, 1, 0, 0, 0, 0, 0, 2, 1, 1, 2, 2, 1, 2)
chisq.test(rbind(table(w.dots), table(c.dots)))
nbrand = rnbinom(length(c.dots), mu = 1, size = 1)
ggplot() +
geom_density(aes(x=x), data=data.frame(x=c.dots), fill="red", alpha=0.5) +
geom_density(aes(x=x), data=data.frame(x=w.dots), fill="blue", alpha=0.5) +
geom_density(aes(x=x), data=data.frame(x=nbrand), colour="green", alpha=0, linetype=3)
Best Answer
Your dependent variable is a count ("number of dots counted on the image of a cell"). Asking whether the distribution of counts is similar in two groups is conceptually the same as asking whether group membership matters for the distribution of counts.
I suggest a Poisson regression as a first step where you model the dot count with group membership. In a second step, one might then try to decide whether the Poisson assumption of "conditional variance = conditional mean" is violated, suggesting a move to a quasi-Poisson model, to a Poisson-model with heteroscedasticity-consistent (HC) standard error estimates, or to a negative binomial model.
Given data
c.dots
andw.dots
as in the OP's example 1: We first create a data frame with predicted variableY
= number of dots and predictorX
= factor with group membership. Then we run a standard Poisson regressionThis indicates a significant predictor "group membership =
w
" resulting from dummy coding the grouping factor (2 groups => 1 dummy predictor,c
is the reference level). For comparison, we can run the quasi-Poisson model that has an extra dispersion parameter for the conditional variance.The parameter estimates are the same, but the standard errors of these estimates is slightly larger. The estimated dispersion parameter is slightly larger than 1 (the value in the Poisson-model), indicating some overdispersion. An alternative approach is to use a Poisson model with HC-consistent standard errors:
Again, somewhat larger standard errors. Now the negative binomial model:
You can test the negative binomial model against the Poisson model in a likelihood ratio test for the model comparison:
The result here indicates that the data are unlikely to come from a Poisson model.
For the OP's example 2, all these tests are non-significant.
Note that I slightly shortened the output from
glm()
andglm.nb()
.