Like Karl Broman said in his answer, a Bayesian approach would likely be a lot better than using confidence intervals.
The Problem With Confidence Intervals
Why might using confidence intervals not work too well? One reason is that if you don't have many ratings for an item, then your confidence interval is going to be very wide, so the lower bound of the confidence interval will be small. Thus, items without many ratings will end up at the bottom of your list.
Intuitively, however, you probably want items without many ratings to be near the average item, so you want to wiggle your estimated rating of the item toward the mean rating over all items (i.e., you want to push your estimated rating toward a prior). This is exactly what a Bayesian approach does.
Bayesian Approach I: Normal Distribution over Ratings
One way of moving the estimated rating toward a prior is, as in Karl's answer, to use an estimate of the form $w*R + (1-w)*C$:
- $R$ is the mean over the ratings for the items.
- $C$ is the mean over all items (or whatever prior you want to shrink your rating to).
- Note that the formula is just a weighted combination of $R$ and $C$.
- $w = \frac{v}{v+m}$ is the weight assigned to $R$, where $v$ is the number of reviews for the beer and $m$ is some kind of constant "threshold" parameter.
- Note that when $v$ is very large, i.e., when we have a lot of ratings for the current item, then $w$ is very close to 1, so our estimated rating is very close to $R$ and we pay little attention to the prior $C$. When $v$ is small, however, $w$ is very close to 0, so the estimated rating places a lot of weight on the prior $C$.
This estimate can, in fact, be given a Bayesian interpretation as the posterior estimate of the item's mean rating when individual ratings comes from a normal distribution centered around that mean.
However, assuming that ratings come from a normal distribution has two problems:
- A normal distribution is continuous, but ratings are discrete.
- Ratings for an item don't necessarily follow a unimodal Gaussian shape. For example, maybe your item is very polarizing, so people tend to either give it a very high rating or give it a very low rating.
Bayesian Approach II: Multinomial Distribution over Ratings
So instead of assuming a normal distribution for ratings, let's assume a multinomial distribution. That is, given some specific item, there's a probability $p_1$ that a random user will give it 1 star, a probability $p_2$ that a random user will give it 2 stars, and so on.
Of course, we have no idea what these probabilities are. As we get more and more ratings for this item, we can guess that $p_1$ is close to $\frac{n_1}{n}$, where $n_1$ is the number of users who gave it 1 star and $n$ is the total number of users who rated the item, but when we first start out, we have nothing. So we place a Dirichlet prior $Dir(\alpha_1, \ldots, \alpha_k)$ on these probabilities.
What is this Dirichlet prior? We can think of each $\alpha_i$ parameter as being a "virtual count" of the number of times some virtual person gave the item $i$ stars. For example, if $\alpha_1 = 2$, $\alpha_2 = 1$, and all the other $\alpha_i$ are equal to 0, then we can think of this as saying that two virtual people gave the item 1 star and one virtual person gave the item 2 stars. So before we even get any actual users, we can use this virtual distribution to provide an estimate of the item's rating.
[One way of choosing the $\alpha_i$ parameters would be to set $\alpha_i$ equal to the overall proportion of votes of $i$ stars. (Note that the $\alpha_i$ parameters aren't necessarily integers.)]
Then, once actual ratings come in, simply add their counts to the virtual counts of your Dirichlet prior. Whenever you want to estimate the rating of your item, simply take the mean over all of the item's ratings (both its virtual ratings and its actual ratings).
As the comments outline, you can't simply look for overlapping CI's, because it can be misleading. The better way, as you will soon learn in your classes, is to conduct a statistical hypothesis test:
You make a null hypothesis, $H_0$, which in this case would be the mean birth weight of males and females is the same, and you calculate the probability, if $H_0$ were true, of the means of your two samples not being closer than they actually are. That is the mythical p-value, which, if small enough, allows you to more or less confidently reject the null hypothesis (and get your paper published).
Notice that you can never prove $H_0$, only fail to disprove it, which is not the same. In your case, you cannot say that males and females have the same mean birth weight, only that there is not enough evidence to say they are different...
For your case, you would probably use a Student's t test, for two independent samples.
Best Answer
Well, the gee package includes facilities for fitting GEE and
gee()
return asymptotic and robust SE. I never used the geepack package. From what I saw in the online example, output seems to resemble more or less that ofgee
. To compute $100(1-\alpha)$ CIs for your main effects (e.g. gender), why not use the robust SE (in the following I will assume it is extracted from, saysummary(gee.fit)
, and stored in a variablerob.se
)? I suppose thatshould yield 95% CIs expressed on the odds scale.
Now, in fact I rarely use GEE except when I am working with binary endpoints in longitudinal studies, because it's easy to pass or estimate a given working correlation matrix. In the case you summarize here, I would rather rely on an IRT model for dichotomous items (see the psychometrics task view), or (it is quite the same in fact) a mixed-effects GLM such as the one that is proposed in the lme4 package, from Doug Bates. For study like yours, as you said, subjects will be considered as random effects, and your other covariates enter the model as fixed effects; the response is the 0/1 rating on each item (which enter the model as well). Then you will get 95% CI for fixed effects, either from the SE computed as
sqrt(diag(vcov(glmm.fit)))
or as read insummary(glmm.fit)
, or usingconfint()
together with anlmList
object. Doug Bates gave nice illustrations in the following two paper/handout:There is also a discussion about profiling
lmer
fits (based on profile deviance) to investigate variability in fixed effects, but I didn't investigate that point. I think it is still in section 1.5 of Doug's draft on mixed models. There are a lot of discussion about computing SE and CI for GLMM as implemented in thelme4
package (whose interface differs from the previousnlme
package), so that you will easily find other interesting threads after googling about that.It's not clear to me why GEE would have to be preferred in this particular case. Maybe, look at the R translation of Agresti's book by Laura Thompson, R (and S-PLUS) Manual to Accompany Agresti's Categorical Data.
Update:
I just realized that the above solution would only work if you're interested in getting a confidence interval for the gender effect alone. If it is the interaction item*gender that is of concern, you have to model it explicitly in the GLMM (my second reference on Bates's has an example on how to do it with
lmer
).Another solution is to use an explanatory IRT model, where you explicitly acknowledge the potential effect of person covariates, like gender or age, and consider fitting them within a Rasch model, for example. This is called a Latent Regression Rasch Model, and is fully described in de Boeck and Wilson's book, Explanatory item response models: a generalized linear and nonlinear approach (Springer, 2004), which you can read online on Google books (section 2.4). There are some facilities to fit this kind of model in Stata (see there). In R, we can mimic such model with a mixed-effects approach; a toy example would look something like
if I remember correctly. I'm not sure whether the eRm allows to easily incorporate person covariates (because we need to construct a specific design matrix), but it may be worth checking out since it provides 95% CIs too.