Solved – How far can be median, mode and mean be from each other and still be able to say that is a normal distribution

histogramnormal distribution

I'm working on the Boston housing project of the udacity ML nano degree. A histogram of the data set looks like this:

The mean, median and mode are:

mean:   454342.944785
median: 438900.0
mode:   525000

Is it correct to say that it has a normal distribution?

Best Answer

How far can be median, mode and mean to still say that is a normal distribution?

This kind of gets things backward -- even if they were exactly equal, that's no basis on which to claim you have a normal distribution. Note that

the population values are equal for many distributions that are not normal (if the population values were unequal of course you'd have non-normality, but if they're all equal it doesn't tell us that you have symmetry)
the sample values could be equal or very close to it even if the population values differ (indeed exact equality would suggest the distribution was discrete, and therefore not normal).

If you're using the data I think you are, for the variable you're referring to you have discretized and censored data, so normality would be moot. We can also see that it can't be normal because house values can't be negative.

So one thing you can say with confidence is that those values you have are not drawn from a normal distribution

Leaving that specific data aside, what we can do instead of trying to day data come from a normal distribution when those location values are close together is to ask "how far apart would they have to be to say that they're inconsistent with normality?".

That we can do something with, at least with respect to mean and median. (The sample mode is a bit tricky with continuous distributions; it would depend on how you obtain it; I suggest we leave that issue aside.)

The distance that sample mean and median would tend to differ will depend on scale and sample size. So one way to assess that difference independent of scale would be to measure how many standard deviations they are apart.

Note that (mean-median)/s.d. is one third of the second Pearson skewness; it's also (apparently) sometimes called the nonparametric skew.

So let's define that statistic, $$S=\frac{\bar{x}-\tilde{x}}{s}$$ (where $\tilde{x}$ is the sample median), which is one on which we can base a test.

Doane & Seward (2011)[1] offer a brief table for a test of $3S$ (the second Pearson skewness) at the normal.

Cabilio and Masaro (1996)[2] use $S$ as a test statistic for a test of symmetry (based on the values at the normal).

[In their case the test is asymptotic; you'd reject symmetry if $|S|>0.7555 \,Z_{\alpha/2}/\sqrt{n}$. Simulations suggest the asymptotic values aren't too bad once you get some way beyond the ends of Doane and Seward's table, I'd consider using it upward from about $n=400$ or so, though there's only about two figure accuracy in the critical values.]

Note that using this sort of statistic to decide if your distribution is non-normal would leave you unable to reject many other distributions (including -- in spite of Cabilio & Masaro's test being for asymmetry -- some asymmetric distributions which have mean = median)

[1]: Doane, D. P., Seward L. E. (2011),
Measuring Skewness: A Forgotten Statistic?
Journal of Statistics Education, Volume 19, Number 2
https://ww2.amstat.org/publications/jse/v19n2/doane.pdf

[2]: Cabilio, P. & Masaro, J. (1996),
A Simple Test of Symmetry about an Unknown Median,
The Canadian Journal of Statistics, Vol. 24, No. 3 (Sep.), pp. 349-361

Best Answer

Related Solutions

Solved – Estimating parameters of a normal distribution: median instead of mean

Solved – Estimation of mean, variance and mean squared error of a histogram that poorly models real distribution

Related Question