Solved – Confidence intervals when the sample size is very large

confidence intervallarge datareporting

My question could be rephrased as "how to assess a sampling error using big data", especially for a journal publication. Here is an example to illustrate a challenge.

From a very large dataset (>100000 unique patients and their prescribed drugs from 100 hospitals), I interested in estimating a proportion of patients taking a specific drug. It's straightforward to get this proportion. Its confidence interval (e.g., parametric or bootstrap) is incredibly tight/narrow, because n is very large. While it's fortunate to have a large sample size, I'm still searching for a way to assess, present, and/or visualize some forms of error probabilities. While it seems unhelpful (if not misleading) to put/visualize a confidence interval (e.g., 95% CI: .65878 – .65881), it also seems impossible to avoid some statements about uncertainity.

Please let me know what you think. I would appreciate any literature on this topic; ways to avoid over-confidence in data even with a large sample size.

Best Answer

This problem has come up in some of my research as well (as a epidemic modeler, I have the luxury of making my own data sets, and with large enough computers, they can be essentially arbitrarily sized. A few thoughts:

  • In terms of reporting, I think you can report more precise confidence intervals, though the utility of this is legitimately a little questionable. But it's not wrong, and with data sets of this size, I don't think there's much call to both demand confidence intervals be reported and then complain that we'd really all like them to be rounded to two digits, etc.
  • In terms of avoiding overconfidence, I think the key is to remember that precision and accuracy are different things, and to avoid trying to conflate the two. It is very tempting, when you have a large sample, to get sucked into how very precise the estimated effect is and not think that it might also be wrong. That I think is the key - a biased data set will have that bias at N = 10, or 100, or 1000 or 100,000.

The whole purpose of large data sets is to provide precise estimates, so I don't think you need to shy away from that precision. But you do have to remember that you can't make bad data better simply by collecting larger volumes of bad data.