Hypothesis Testing – Using Central Limit Theorem

central limit theoremhypothesis testingnormal distributionnormality-assumption

I wanted to clarify my understanding of central limit theorem.
Let's say that I have following experiment…

I am using Python, I have list of products, with some properties (for example color and revenue). Let's say that I want to compare mean of revenues, for products having certain colors. I know that some hypothesis tests are assuming that data are normally distributed. For my case hoverer, data are heavily left skewed – most of the products have poor revenue.

What is the correct approach?
Can I sample randomly products (based on color) and compare data using clt and than use hypothesis test, or I would be breaking some assumption and it is basically useless. It is better to use this approach or to use some test that doesn't need normally distributed data?
What about CLT in general – is it something what statisticans use for (for example) distribution normalization, or I should approach CLT as foundation, explaining why some methods like T-test work for non-normally distributed data given that my sample is large enough?

Thank you,
And please excuse my silly question

Best Answer

  • First, you should probably refresh your understanding of the central limit theorem. It does not say anything about data being normally distributed when the sample is large enough.
  • You are also incorrect in saying that hypothesis tests assume that the data is normally distributed. There are tests that do, but the statement is not true in general. There are tests that are robust against non-normal data like $t$-test or non-parametric tests. Non-parametric tests such as permutation tests do not make nearly any assumptions about the distribution.
  • Hypothesis tests give you the guarantees only if the assumptions they make are met. If you just blindly throw them on your data you have no guarantees of them giving any useful results.
  • Finally, I'd bet that this is more complicated than you're describing. You're saying "can I sample randomly (based on color) and compare data using clt and than use hypothesis test", but sampling randomly could not be enough. You're probably not doing randomized experiments, so sampling randomly from the data does nothing about self-selection or having otherwise biased data. You didn't tell us any details, but to give one example, cars can have different prices for different colors, there could be also differences in availability of items of different colors, etc.
Related Question