Data Visualization – Understanding ggplot’s stat_summary Errorbars in R

data visualizationdistributionsggplot2r

Say we have the following data:

set.seed(45)

df <- data.frame(A = rnorm(2000, mean = 15, sd = 18),
                 B = rnorm(2000, mean = 25, sd = 17)) %>% 
  pivot_longer(cols = c(A, B), names_to = "group", values_to = "time") %>% 
  mutate(time = ifelse(time < 2, abs(time) + rnorm(1,15,7), time))

I would think that by doing:

df %>% ggplot(aes(x = group, y = time)) +
  geom_jitter(width = .1, color = "pink", alpha = .2) +
  stat_summary(fun = "mean", geom = "point") +
  stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", width = .15)

ggplot would plot the 95% C.I. for each group. However, this is clearly not the case:
stat_summary output

What I would have expected is something like this:

my_cis <- df %>% 
  group_by(group) %>% 
  summarise(mean = mean(time), 
            lwr = quantile(time, probs = 0.05),
            upr = quantile(time, probs = 0.95))

df %>%
  ggplot(aes(x = group)) +
  geom_jitter(aes(y = time), width = .1, alpha = .2, color = "pink") +
  geom_errorbar(aes(ymin = lwr, ymax = upr), data = my_cis, width = .13, color = "gray25") +
  geom_point(aes(y = mean), data = my_cis, shape = 18, size = 2) 

manual CIs

So, the question is: What is stat_summary() doing really? And, for better understanding, how can I replicate manually the errorbars from stat_summary?

Best Answer

The first plot shows a 95% confidence interval for the unknown population mean based on your sample. Or in other words it's "a range for estimating an unknown parameter".

The second plot is a summary of the sample (and not a confidence interval). This interval describes where 90% of the data points are located. If you wanted the range where 95% of the data are, you have to adjust your probs = argument to 0.025 and 0.975.

To reproduce the interval in the first plot try this:

my_cis <- df %>% 
  group_by(group) %>% 
  summarize(M = mean(time),
            lwr = M - sd(time) / sqrt(length(time)) * 1.96,
            upr = M + sd(time) / sqrt(length(time)) * 1.96)
my_cis 
# A tibble: 2 x 4
  group     M CI_lower CI_upper
  <chr> <dbl>    <dbl>    <dbl>
1 A      20.9     20.3     21.5
2 B      26.5     25.9     27.1

df %>%
  ggplot(aes(x = group)) +
  geom_jitter(aes(y = time), width = .1, alpha = .2, color = "pink") +
  geom_errorbar(aes(ymin = lwr, ymax = upr), data = my_cis, width = .13, color = "gray25") +
  geom_point(aes(y = M), data = my_cis, shape = 18, size = 2)

enter image description here

Related Question