Solved – Computing confidence intervals for prevalence for several types of infection

confidence intervalepidemiologyestimation

I have a dataset about a population in hospital, and what type of infections patients have.

Let say the number of patients is 100, 10 of them have pneumonia (group A), 20 of them have urinary tract infection (group B); keep in mind that group A and B can be overlapping, that is a patient suffering from pneumonia can also have urinary tract infection.

I need to estimate the prevalence of different infection type in this population (i.e., prevalence of pneumonia, prevalence of urinary tract infection). I'm not sure whether assuming binomial distribution, like the one below from here, is appropriate:

$$\text{SE}=\sqrt{\frac{p\times (1-p)}{n}}\times\sqrt{1-f}$$

Using this formula, I will compute multiple "binomial" estimates (i.e., one for each type of infection). I would feel comfortable to use this if only I need to describe prevalence of one type of infection, but in this case, I need to describe several ones from the same population. I am not sure if using the formula is appropriate or not in this case. Can anyone here enlighten me? Thanks!

Best Answer

So you have a population each of whom can have zero or more conditions. To answer the question: How many hospital patients have A? It seems to me that the best you can do is take your favourite proportion estimator and offer it up with your favourite confidence interval. There are lots of choices, which will make a difference for very high or very low proportions. If you have such a situation, the estimator above may not be optimal.

If you are interested in the population of just your hospital then you can, as SheldonCooper points out, dispense with the statistics altogether. I suspect however that you are interested in hospital patients more generally, so your standard errors and intervals might be interpreted relative to this population. In your suggested estimator the identity of the population will determine what 1-F is. Certainly hospital patients don't look like non-hospital patients with respect to the conditions you're counting, but that need not matter.

Following Sheldon's second observation, it is probable that the conditions correlate. But as far as I can see this is only useful information if you are asking conditional questions, e.g. the prevalence of A among B sufferers. In probabilistic terms your question is about estimating marginals, and correlation information only tells you about estimating conditionals.

If you were interested in these sorts of subgroups, you'd certainly want to model this information. You'd also want it if there were differential measurement errors or sample selection issues, etc. e.g. only getting tested for A if you have a B diagnosis... That might also make certain sample marginals problematic as estimates of population marginals. Thankfully, I don't know much about hospital populations, but I'd be willing to bet that there are some of these issues around.

Finally, about reporting: If you in fact want to report confidence regions rather than condition-wise intervals, then again the correlation structure matters, and things get considerably trickier. I seem to remember that Agresti had a paper on simultaneous confidence intervals for multivariate Binomial proportions, which might be helpful for this approach.

Related Question