Solved – Two-tailed tests… I’m just not convinced. What’s the point

hypothesis testinginferencestatistical significancetwo-tailed-test

The following excerpt is from the entry, What are the differences between one-tailed and two-tailed tests?, on UCLA's statistics help site.

… consider the consequences of missing an effect in the other direction. Imagine you have developed a new drug that you believe is an improvement over an existing drug. You wish to maximize your ability to detect the improvement, so you opt for a one-tailed test. In doing so, you fail to test for the possibility that the new drug is less effective than the existing drug.

After learning the absolute basics of hypothesis testing and getting to the part about one vs two tailed tests… I understand the basic math and increased detection ability of one tailed tests, etc… But I just can't wrap around my head around one thing… What's the point? I'm really failing to understand why you should split your alpha between the two extremes when your is sample result can only be in one or the other, or neither.

Take the example scenario from the quoted text above. How could you possibly "fail to test" for a result in the opposite direction? You have your sample mean. You have your population mean. Simple arithmetic tells you which is higher. What is there to test, or fail to test, in the opposite direction? What's stopping you just starting from scratch with the opposite hypothesis if you clearly see that the sample mean is way off in the other direction?

Another quote from the same page:

Choosing a one-tailed test after running a two-tailed test that failed to reject the null hypothesis is not appropriate, no matter how "close" to significant the two-tailed test was.

I assume this also applies to switching the polarity of your one-tailed test. But how is this "doctored" result any less valid than if you had simply chosen the correct one-tailed test in the first place?

Clearly I am missing a big part of the picture here. It all just seems too arbitrary. Which it is, I guess, in the sense that what denotes "statistically significant" – 95%, 99%, 99.9%… Is arbitrary to begin with.

Best Answer

Think of the data as the tip of the iceberg – all you can see above the water is the tip of the iceberg but in reality you are interested in learning something about the entire iceberg.

Statisticians, data scientists and others working with data are careful to not let what they see above the water line influence and bias their assessment of what's hidden below the water line. For this reason, in a hypothesis testing situation, they tend to formulate their null and alternative hypotheses before they see the tip of the iceberg, based on their expectations (or lack thereof) of what might happen if they could view the iceberg in its entirety.

Looking at the data to formulate your hypotheses is a poor practice and should be avoided – it's like putting the cart before the horse. Recall that the data come from a single sample selected (hopefully using a random selection mechanism) from the target population/universe of interest. The sample has its own idiosyncracies, which may or may not be reflective of the underlying population. Why would you want your hypotheses to reflect a narrow slice of the population instead of the entire population?

Another way to think about this is that, every time you select a sample from your target population (using a random selection mechanism), the sample will yield different data. If you use the data (which you shouldn't!!!) to guide your specification of the null and alternative hypotheses, your hypotheses will be all over the map, essentially driven by the idiosyncratic features of each sample. Of course, in practice we only draw one sample, but it would be a very disquieting thought to know that if someone else performed the same study with a different sample of the same size, they would have to change their hypotheses to reflect the realities of their sample.

One of my graduate school professors used to have a very wise saying: "We don't care about the sample, except that it tells us something about the population". We want to formulate our hypotheses to learn something about the target population, not about the one sample we happened to select from that population.