Solved – How to visualize independent two sample t-test

data visualizationt-test

What are the most accepted ways to visualize the results of an independent two sample t-test? Is a numeric table more often used or some sort of plot? The goal is for a casual observer to look at the figure and immediately see that they are probably from two different populations.

Best Answer

It is worth being clear on the purpose of your plot. In general, there are two different kinds of goals: you can make plots for yourself to assess the assumptions you are making and guide the data analysis process, or you can make plots to communicate a result to others. These are not the same; for example, many viewers / readers of your plot / analysis may be statistically unsophisticated, and may not be familiar with the idea of, say, equal variance and its role in a t-test. You want your plot to convey the important information about your data even to consumers like them. They are implicitly trusting that you have done things correctly. From your question setup, I gather you are after the latter type.

Realistically, the most common and accepted plot for communicating the results of a t-test¹ to others (set aside whether it is actually the most appropriate) is a bar chart of means with standard error bars. This does match the t-test very well in that a t-test compares two means using their standard errors. When you have two independent groups, this will yield a picture that is intuitive, even for the statistically unsophisticated, and (data willing) people can "immediately see that they are probably from two different populations". Here is a simple example using @Tim's data:

nonsmokers <- c(18,22,21,17,20,17,23,20,22,21)
smokers <- c(16,20,14,21,20,18,13,15,17,21)
m        = c(mean(nonsmokers), mean(smokers))
names(m) = c("nonsmokers", "smokers")
se       = c(sd(nonsmokers)/sqrt(length(nonsmokers)), 
             sd(smokers)/sqrt(length(smokers)))
windows()
  bp = barplot(m, ylim=c(16, 21), xpd=FALSE)
  box()
  arrows(x0=bp, y0=m-se, y1=m+se, code=3, angle=90)

That said, data visualization specialists typically disdain these plots. They are often derided as "dynamite plots" (cf., Why dynamite plots are bad). In particular, if you have only a few data, it is often recommended that you simply show the data themselves. If the points overlap, you can jitter them horizontally (add a small amount of random noise) so that they no longer overlap. Because a t-test is fundamentally about means and standard errors, it is best to overlay the means and standard errors onto such a plot. Here is a different version:

set.seed(4643)
plot(jitter(rep(c(0,1), each=10)), c(nonsmokers, smokers), axes=FALSE, 
     xlim=c(-.5, 1.5), xlab="", ylab="")
box()
axis(side=1, at=0:1, labels=c("nonsmokers", "smokers"))
axis(side=2, at=seq(14,22,2))
points(c(0,1), m, pch=15, col="red")
arrows(x0=c(0,1), y0=m-se, y1=m+se, code=3, angle=90, length=.15)

If you have a lot of data, boxplots may be a better choice to get a quick overview of the distributions, and you can overlay the means and SEs there too.

data(randu)
x1 = qnorm(randu[,1])
x2 = qnorm(randu[,2])
m  = c(mean(x1), mean(x2))
se = c(sd(x1)/sqrt(length(x1)), sd(x2)/sqrt(length(x2)))
boxplot(x1, x2)
points(c(1,2), m, pch=15, col="red")
arrows(x0=1:2, y0=m-(1.96*se), y1=m+(1.96*se), code=3, angle=90, length=.1)
# note that I plotted 95% CIs so that they will be easier to see

Simple plots of the data, and boxplots, are sufficiently simple that most people will be able to understand them even if they aren't very statistically savvy. Bear in mind, though, that none of these make it easy to assess the validity of having used a t-test to compare your groups. Those goals are best served by different kinds of plots.

_{1. Note that this discussion assumes an independent samples t-test. These plots could be used with a dependent samples t-test, but could also be misleading in that context (cf., Is using error bars for means in a within-subjects study wrong?).}

Best Answer

Related Solutions

Solved – What’s the correct way to visualize discrete variables

T-Test – Do Data Really Need to Be Normally Distributed for Large Sample Sizes in Independent Samples t-test?

Related Question