Solved – Methods to check if the data fits a distribution function

normal distributionr

My shortened data is:

y <- c (2,2,1,5,6,7,1,2,1,6,6,7,3,2,4,4,4,4,3,3,9,1,1,9)

I firstly normalize my data:

y_scale <- scale(y)

Then, I generate a model dataset with normal distribution based on y_scale's mean and stdev:

y_norm <- rnorm(n=24, m=mean(y_scale), sd=sd(y_scale))

To check if my data fits the normal distribution, I do

ks.test(y_scale,y_norm)

I found the result is as follows:

Two-sample Kolmogorov-Smirnov test

data:  y_scale and y_norm 
D = 0.2083, p-value = 0.6749
alternative hypothesis: two-sided 

Warning message:
In ks.test(y_scale, y_norm) : cannot compute correct p-values with ties

Here, my question is:

(1) My real data set has ~ 700,000 numbers, I found I cannot use shapiro.test.

shapiro.test(y_scale)
Error in shapiro.test(y_scale) : sample size must be between 3 and 5000

(2) Is the p-value calculated above by ks.test wrong? How to solve this problem of p-values?

Warning message:
In ks.test(y_scale, y_norm) : cannot compute correct p-values with ties

(3) The reasons why I tried to use ks.test instead of other methods, is because I want to compare with other model datasets that have other distribution functions. It seems to me that I can simply replace y_norm with other model dataset, and compare their p-values or D-values (the smaller the better), to choose which distribution function fits my data most.

(4) Is it a must to normalize my data first?

Best Answer

Why are you testing for normality?

As stated by @user603 the discreteness of your data suggests that it is not normal. But are you interested in normal enough?

With 700,000 data points you will have power to detect differences from normality that are very minor and probably not important (I just compared a sample from a t distribution with 100 df and got a p-value of 0.011). If you are doing the normality test in preparation to using normal theory inference (t-tests, confidence intervals, etc.) then for most population distributions 700,000 should be enough for the central limit theorem to let you use the normal theory inference, but knowledge about the population is probably more important here than any test of normality.

Also, if your data really is discrete, then normalizing ruins the discreteness and could be misleading. For meaningful results you should not normalize.

Computing the mean and standard deviation after normalizing is meaningless since you have set the mean to 0 and standard deviation to 1.

The KS test is designed to test against a fully specified distribution, normalizing the data first is equivalent to comparing to a distribution with estimated parameters which will also invalidate the p-value.