Say we have the following data:
set.seed(45)
df <- data.frame(A = rnorm(2000, mean = 15, sd = 18),
B = rnorm(2000, mean = 25, sd = 17)) %>%
pivot_longer(cols = c(A, B), names_to = "group", values_to = "time") %>%
mutate(time = ifelse(time < 2, abs(time) + rnorm(1,15,7), time))
I would think that by doing:
df %>% ggplot(aes(x = group, y = time)) +
geom_jitter(width = .1, color = "pink", alpha = .2) +
stat_summary(fun = "mean", geom = "point") +
stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", width = .15)
ggplot would plot the 95% C.I. for each group. However, this is clearly not the case:
What I would have expected is something like this:
my_cis <- df %>%
group_by(group) %>%
summarise(mean = mean(time),
lwr = quantile(time, probs = 0.05),
upr = quantile(time, probs = 0.95))
df %>%
ggplot(aes(x = group)) +
geom_jitter(aes(y = time), width = .1, alpha = .2, color = "pink") +
geom_errorbar(aes(ymin = lwr, ymax = upr), data = my_cis, width = .13, color = "gray25") +
geom_point(aes(y = mean), data = my_cis, shape = 18, size = 2)
So, the question is: What is stat_summary() doing really? And, for better understanding, how can I replicate manually the errorbars from stat_summary?
Best Answer
The first plot shows a 95% confidence interval for the unknown population mean based on your sample. Or in other words it's "a range for estimating an unknown parameter".
The second plot is a summary of the sample (and not a confidence interval). This interval describes where 90% of the data points are located. If you wanted the range where 95% of the data are, you have to adjust your
probs =
argument to0.025
and0.975
.To reproduce the interval in the first plot try this: