Why continue to teach and use hypothesis testing (with all its difficult concepts and which are among the most statistical sins) for problems where there is an interval estimator (confidence, bootstrap, credibility or whatever)? What is the best explanation (if any) to be given to students? Only tradition? The views will be very welcome.
Solved – Why continue to teach and use hypothesis testing (when confidence intervals are available)
confidence intervalhypothesis testingteaching
Related Solutions
This is the bootstrap analogy principle. The (unknown) underlying true distribution $F$ produced a sample at hand $x_1, \ldots, x_n$ with cdf $F_n$, which in turn produced the statistic $\hat\theta=T(F_n)$ for some functional $T(\cdot)$. Your idea of using the bootstrap is to make statements about the sampling distribution based on a known distribution $\tilde F$, where you try to use an identical sampling protocol (which is exactly possible only for i.i.d. data; dependent data always lead to limitations in how accurately one can reproduce the sampling process), and apply the same functional $T(\cdot)$. I demonstrated it in another post with (what I think is) a neat diagram. So the bootstrap analogue of the (sampling + systematic) deviation $\hat\theta - \theta_0$, the quantity of your central interest, is the deviation of the bootstrap replicate $\hat\theta^*$ from what is known to be true for the distribution $\tilde F$, the sampling process you applied, and the functional $T(\cdot)$, i.e. your measure of central tendency is $T(\tilde F)$. If you used the standard nonparametric bootstrap with replacement from the original data, your $\tilde F=F_n$, so your measure of the central tendency has to be $T(F_n) \equiv \hat \theta$ based on the original data.
Besides the translation, there are subtler issues going on with the bootstrap tests which are sometimes difficult to overcome. The distribution of a test statistic under the null may be drastically different from the distribution of the test statistic under the alternative (e.g., in tests on the boundary of the parameter space which fail with the bootstrap). The simple tests you learn in undergraduate classes like $t$-test are invariant under shift, but thinking, "Heck, I just shift everything" fails once you have to move to the next level of conceptual complexity, the asymptotic $\chi^2$ tests. Think about this: you are testing that $\mu=0$, and your observed $\bar x=0.78$. Then when you construct a $\chi^2$ test $(\bar x-\mu)^2/(s^2/n) \equiv \bar x^2/(s^2/n)$ with the bootstrap analogue $\bar x_*^2/(s_*^2/n)$, then this test has a built-in non-centrality of $n \bar x^2/s^2$ from the outset, instead of being a central test as we would expect it to be. To make the bootstrap test central, you really have to subtract the original estimate.
The $\chi^2$ tests are unavoidable in multivariate contexts, ranging from Pearson $\chi^2$ for contingency tables to Bollen-Stine bootstrap of the test statistic in structural equation models. The concept of shifting the distribution is extremely difficult to define well in these situations... although in case of the tests on the multivariate covariance matrices, this is doable by an appropriate rotation.
Here are a few examples which worked well for me when I was teaching statistics.
- I like to begin the class with the martingale, because somehow everybody finds a winning strategy at roulette interesting, and it is fairly easy to grasp. Then later you can have people try it out for themselves, if you are doing computer labs and can find an online roulette simulator. [Warning: I once had a lab of students do this, and one of them ended up with a $60,000 profit. After that, it was not easy to convince them that the martingale is bad.]
- A good way to illustrate faulty reasoning about independence is Munchhausen's Syndrome by Proxy. Allegedly several people went to prison because the doctor who invented this syndrome claimed in court that the deaths of children within the same family were indpendent events.
- Everybody finds bad graphics like this one entertaining, and students often enjoy collecting them for themselves and bringing them to class.
- When talking about expected value, the St. Petersburg paradox is a good one. Most people can understand it fairly quickly and it shows that the definition of expected value is tricky.
- When teaching the central limit theorem, it's useful to have a wacky bimodal distribution to hand. A good one is the distirbution of the last two digits of the years on the one-cent coins which the students happen to have in their pockets. I got this one from a professor at Oberlin College.
- Identifying a fake series of coin flips is a good one because the students can try it out on their friends.
- The British magician Derren Brown has quite a few videos which relate to probability and statistics and are also entertaining. I used to show clips of these in class sometimes.
- Finally, and most importantly, use data sets from the students' fields whenever you can. It doesn't matter exactly what, but it's really important to show them data of the type that they might plausibly collect in the future. Most students don't choose to take a statistics course. Showing students how it applies to them can make a huge difference to their enjoyment. There are statistics papers on virtually everything, even poetry. Or you are teaching life tables; instead of using boring data, how about making one for tyrannosaurs like in these notes?
Related Question
- Solved – Bootstrap confidence intervals interpretation; too large when testing sample mean
- Solved – A psychology journal banned p-values and confidence intervals; is it indeed wise to stop using them
- Solved – What are good examples to show to undergraduate students
- Solved – What should a graduate course in experimental design cover
- Hypothesis Testing – Insights from the Article ‘Ditch p-values. Use Bootstrap Confidence Intervals Instead’
Best Answer
This is my personal opinion, so I'm not sure it properly qualifies as an answer.
Why should we teach hypothesis testing?
One very big reason, in short, is that, in all likelihood, in the time it takes you to read this sentence, hundreds, if not thousands (or millions) of hypothesis tests have been conducted within a 10ft radius of where you sit.
Your cell phone is definitely using a likelihood ratio test to decide whether or not it is within range of a base station. Your laptop's WiFi hardware is doing the same in communicating with your router.
The microwave you used to auto-reheat that two-day old piece of pizza used a hypothesis test to decide when your pizza was hot enough.
Your car's traction control system kicked in when you gave it too much gas on an icy road, or the tire-pressure warning system let you know that your rear passenger-side tire was abnormally low, and your headlights came on automatically at around 5:19pm as dusk was setting in.
Your iPad is rendering this page in landscape format based on (noisy) accelerometer readings.
Your credit card company shut off your card when "you" purchased a flat-screen TV at a Best Buy in Texas and a $2000 diamond ring at Zales in a Washington-state mall within a couple hours of buying lunch, gas, and a movie near your home in the Pittsburgh suburbs.
The hundreds of thousands of bits that were sent to render this webpage in your browser each individually underwent a hypothesis test to determine whether they were most likely a 0 or a 1 (in addition to some amazing error-correction).
Look to your right just a little bit at those "related" topics.
All of these things "happened" due to hypothesis tests. For many of these things some interval estimate of some parameter could be calculated. But, especially for automated industrial processes, the use and understanding of hypothesis testing is crucial.
On a more theoretical statistical level, the important concept of statistical power arises rather naturally from a decision-theoretic / hypothesis-testing framework. Plus, I believe "even" a pure mathematician can appreciate the beauty and simplicity of the Neyman–Pearson lemma and its proof.
This is not to say that hypothesis testing is taught, or understood, well. By and large, it's not. And, while I would agree that—particularly in the medical sciences—reporting of interval estimates along with effect sizes and notions of practical vs. statistical significance are almost universally preferable to any formal hypothesis test, this does not mean that hypothesis testing and the related concepts are not important and interesting in their own right.