Solved – Normal approximation for large data set

central limit theoremdistributionsnormality-assumptionr

I have a dataset that is highly skewed. See image below:
Histogram of Hydrogen Gas Untransformed

When I transform the data I get the following histogram that makes it look normal:

Log transformed Hydrogen data

This data however is not normal. I get a p-value of <2.2e-16 for the Shapiro-Wilk normality test.

Using the package distrplus in R shows that the transformed data is most likely a Gamma or a Log Normal distribution. See graphic below:

enter image description here

This data set contains over 5000 data points, is there a better way to normalise this? Can it be normalised? Or because it's so large can I assume its approximately normal due to the central limit theorem?

Best Answer

I have a dataset that is highly skewed. See image below: Histogram of Hydrogen Gas Untransformed

Beware drawing histograms with very few bins.

First, they're not very good at showing details of the shape, such as small modes.

Second, you can sometimes get quite misleading impressions. (It shouldn't be likely to happen with this large a data set, though)

I'd suggest:

(i) if you're going to do a histogram with so many data points, you want perhaps 4-5 times as many bins as you have; you might consider several displays at somewhat different bin-widths.

(ii) consider a kernel density estimate on the log-scale

When I transform the data I get the following histogram that makes it look normal:

Doesn't look normal to me. It looks right skew. But you need more bins.

Using the package distrplus in R shows that the transformed data is most likely a Gamma or a Log Normal distribution.

I bet you it isn't either of those. That's not to say it would be bad to use a gamma or lognormal model (such a model might be useful) - only that you'd be wrong to think your model was actually correct.

Why in your third plot is the x-axis (and the binning) different to your second plot?

This data set contains over 5000 data points, is there a better way to normalise this?

To what end?

Or because it's so large can I assume its approximately normal due to the central limit theorem?

The central limit theorem is about standardized averages as $n\to\infty$, not about the raw data. Making the sample size large makes the ECDF approach the CDF, it doesn't change the CDF at all, which will be non-normal all the way.

The most important question is What are you trying to achieve?

Best Answer

Related Solutions

R – How to Test Large Dataset for Normality and Reliability

Normal Distribution – Can Normal Distribution Always be Assumed if n > 30?

Related Question