Solved – Big sample size (n>50.000), but still highly skewed. Is central limit theorem still valid

central limit theoremnormal distributiont-test

I have 12 samples with approx. 50.000 data points each. I got it from a content analysis of Reddit comments. The data is generated by analyzing the comments with LIWC tool and it gives me information for instance about the percentage of pronouns in a comment. The comments represent one year and consists of every publicly availabe comment on Reddit for five particular subreddits.

Central limit theorem says that with a certain amount of samples, the data will be normally distributed. This is also the info I got when asking friends, who are quite good in statistics (Master in econometrics etc), when i asked if i can compare groups with the t-test. But when i plot the data, it looks quite skewed and not normal at all and I wonder when you are not able to accept central limit theorem anymore?

How can I be sure that the attributes of the population is not naturally, non-normal distributed?

Best Answer

Confirming Whuber's comment, this is not what the central limit theorem says. The distribution does not get less skewed as the sample size increases. All you get is a more and more accurate picture of the shape of the true distribution in the population (just as you get a more accurate estimate of the mean, the SD, etc).

What the central limit theorem says (amongst other things) is that the sampling distribution of the mean gets closer to normal as the sample size gets bigger. This sampling distribution is the distribution of means of the samples; in other words if you took lots of samples of 50,000 items, and plotted the means of those samples as a new distribution in their own right, that histogram would tend to normality, regardless of the distribution of the original means. It is this that allows you to carry out a t-test regardless of the normality of the original distribution - when the sample size is large enough - and there can surely be no doubt that 50,000 is going to be 'large enough' in this context. [Note: I clarified "of the mean" in the first sentence and added "surely" in the final sentence after reading comments on my answer.]

Related Question