R – How to Test Large Dataset for Normality and Reliability

large datanormal distributionnormality-assumptionr

I'm examining a part of my dataset containing 46840 double values ranging from 1 to 1690 grouped in two groups. In order to analyze the differences between these groups I started by examining the distribution of the values in order to pick the right test.

Following a guide on testing for normality, I did a qqplot, histogram & boxplot.

enter image description here

This doesn't seem to be a normal distribution. Since the guide states somewhat correctly that a purely graphical examination isn't sufficient I also want to test the distribution for normality.

Considering the size of the dataset and the limitation of the shapiro-wilks test in R , how should the given distribution be tested for normality and considering the size of the dataset, is this even reliable? (See accepted answer to this question)

Edit:

The limitation of the Shapiro-Wilk test I'm referring to is that the dataset to be tested is limited to 5000 points.
To cite another good answer concerning this topic:

An additional issue with the Shapiro-Wilk's test is that when you feed
it more data, the chances of the null hypothesis being rejected
becomes larger. So what happens is that for large amounts of data even
very small deviations from normality can be detected, leading to
rejection of the null hypothesis event hough for practical purposes
the data is more than normal enough.

[…] Luckily shapiro.test protects the user from the above described
effect by limiting the data size to 5000.

As to why I am testing for normal distribution in the first place:

Some hypothesis tests assume normal distribution of the data. I want to know whether or not I can use these tests.

Best Answer

I don't see why you'd bother. It's plainly not normal – in this case, graphical examination appears sufficient to me. You've got plenty of observations from what appears to be a nice clean gamma distribution. Just go with that. kolmogorov-smirnov it if you must – I'll recommend a reference distribution.

x=rgamma(46840,2.13,.0085);qqnorm(x);qqline(x,col='red')
enter image description here

hist(rgamma(46840,2.13,.0085))

boxplot(rgamma(46840,2.13,.0085))

As I always say, "See Is normality testing 'essentially useless'?," particularly @MånsT's answer, which points out that different analyses have different sensitivities to different violations of normality assumptions. If your distribution is as close to mine as it looks, you've probably got skew $\approx1.4$ and kurtosis $\approx5.9$ ("excess kurtosis" $\approx2.9$). That's liable to be a problem for a lot of tests. If you can't just find a test with more appropriate parametric assumptions or none at all, maybe you could transform your data, or at least conduct a sensitivity analysis of whatever analysis you have in mind.

Related Solutions

Solved – Normal approximation for large data set

I have a dataset that is highly skewed. See image below: Histogram of Hydrogen Gas Untransformed

Beware drawing histograms with very few bins.

First, they're not very good at showing details of the shape, such as small modes.

Second, you can sometimes get quite misleading impressions. (It shouldn't be likely to happen with this large a data set, though)

I'd suggest:

(i) if you're going to do a histogram with so many data points, you want perhaps 4-5 times as many bins as you have; you might consider several displays at somewhat different bin-widths.

(ii) consider a kernel density estimate on the log-scale

When I transform the data I get the following histogram that makes it look normal:

Doesn't look normal to me. It looks right skew. But you need more bins.

Using the package distrplus in R shows that the transformed data is most likely a Gamma or a Log Normal distribution.

I bet you it isn't either of those. That's not to say it would be bad to use a gamma or lognormal model (such a model might be useful) - only that you'd be wrong to think your model was actually correct.

Why in your third plot is the x-axis (and the binning) different to your second plot?

This data set contains over 5000 data points, is there a better way to normalise this?

To what end?

Or because it's so large can I assume its approximately normal due to the central limit theorem?

The central limit theorem is about standardized averages as $n\to\infty$, not about the raw data. Making the sample size large makes the ECDF approach the CDF, it doesn't change the CDF at all, which will be non-normal all the way.

The most important question is What are you trying to achieve?

Normal Distribution – Conducting Normality Testing with Very Large Sample Sizes

Continuation from comment: If you are using simulated normal data from R, then you can be quite confident that what purport to be normal samples really are. So there shouldn't be 'quirks' for the Shapio-Wilk test to detect.

Checking 100,000 standard normal samples of size 1000 with the Shapiro-Wilk test, I got rejections just about 5% of the time, which is what one would expect from a test at the 5% level.

set.seed(2019)
pv = replicate( 10^5, shapiro.test(rnorm(1000))$p.val )
mean(pv <= .05)
[1] 0.05009

Addendum. By contrast, the distribution $\mathsf{Beta}(20,20)$ "looks" very much like a normal distribution, but isn't exactly normal. If I do the same simulation for this approximate model, Shapiro-Wilk rejects about 7% of the time. Looked at from the perspective of power, that's not great. But it seems Shapiro-Wilk is sometimes able to detect that the data aren't exactly normal.

This is a long way from "always," but I think $\mathsf{Beta}(20,20)$ is closer to normal than a lot of real-life "normal" data are. (And the link says always may be "a bit strongly stated." I suspect the greatest trouble may come with samples a lot bigger than 1000, and for some normal approximations that are quite useful--even if imperfect.) "Not every statistically significant difference is a difference of practical importance." Sometimes, people who should know better seem to forget that when doing goodness-of-fit tests.

set.seed(2019)
pv = replicate( 10^5, shapiro.test(rbeta(1000, 20,20))$p.val )
mean(pv <= .05)
[1] 0.07152

Best Answer

Related Solutions

Solved – Normal approximation for large data set

Normal Distribution – Conducting Normality Testing with Very Large Sample Sizes

Related Question