I have seen multiple times that a normal distribution is fully specified by mean and variance. It is obvious that the third moment is not necessary for a perfect normal distribution as it is 0. I would like to know if mean and variance are sufficient statistic for normal distribution then why do we have positive Kurtosis – 4th moment to describe the tails of the normal distribution?
Statistical Sufficiency – Mean, Variance & Kurtosis for Normal Distribution
kurtosisnormal distributionprobabilitysufficient-statistics
Related Solutions
What you think about here is something like a philosopher's stone of statistics.
The strict answer is:
No, it is impossible to express skewness or kurtosis via the mean and variance.
@Macro gave a counterexample of distributions with different skewness and kurtosis. A question of coming up with distributions for the given set of moments has entertained statisticians since the very early ages, and Pearson's system of frequency curves is one of the examples of how one could come up with a continuous distribution for the numeric values of the first four moments. You could also look at the moment generating function $m(t)={\rm E}[\exp(Xt)]$, a characteristic function $\phi(t)={\rm E}[\exp(iXt)]$, or a cumulant generating function $\psi(t) = \ln \phi(t)$. With some luck, you can try putting your four moments into them and invert these functions to obtain explicit expression of the densities. Finally, you can always find a distribution with discrete support on five points to satisfy the five equations for the moments of order 0 through 4 by solving a corresponding system of nonlinear equations.
To express the higher order moments via the lower order moments, you need to know the shape of the distribution and its parameters. For one-parameter (Poisson, exponential, geometric) or two-parameter (normal, gamma, binomial) distributions, you can express the higher order moments via the natural parameters of these distributions; e.g., for a Poisson with rate $\lambda$, skewness is $\lambda^{-1/2}$, and kurtosis is $\lambda^{-1}$ (sanity check: both going to zero as $\lambda \to \infty$, providing a normal approximation for Poisson for large $\lambda$). But these exceptions should not fool you; for more interesting distributions, including anything from the real world, you can just forget about doing anything meaningful with the kurtosis.
You'd probably benefit from reading about sufficiency in any textbook on theoretical statistics, where most of these questions will be covered in detail. Briefly ...
Not necessarily. Those are special cases: of distributions where the support (the range of values the data can take) doesn't depend on the unknown parameter(s), only those in the exponential family have a sufficent statistic of the same dimensionality as the number of parameters. So for estimating the shape & scale of a Weibull distribution or the location & scale of a logistic distribution from independent observations, the order statistic (the whole set of observations disregarding their sequence) is minimal sufficient—you can't reduce it further without losing information about the parameters. Where the support does depend on the unknown parameter(s) it varies: for a uniform distribution on $(0,\theta)$, the sample maximum is sufficient for $\theta$; for a uniform distribution on $(\theta-1,\theta+1)$ the sample minimum and maximum are together sufficient.
I don't know what you mean by "direct correspondence"; the alternative you give seems a fair way to describe sufficient statistics.
Yes: trivially the data as a whole are sufficient. (If you hear someone say there's no sufficient statistic they mean there's no low-dimensional one.)
Yes, that's the idea. (What's left—the distribution of the data conditional on the sufficient statistic—can be used for checking the distributional assumption independently of the unknown parameter(s).)
Apparently not, though I gather the counter-examples are not distributions you're likely to want to use in practice. [It'd be nice if anyone could explain this without getting too heavily into measure theory.]
In response to the further questions ...
The first factor, $ \mathrm{e}^{-n\lambda}\cdot\lambda^{\sum{x_i}}$, depends on $\lambda$ only through $\sum x_i$. So any one-to-one function of $\sum x_i$ is sufficient: $\sum x_i$, $\sum x_i/n$, $(\sum x_i)^2$†, & so on.
The second factor, $\tfrac{1}{x_1! x_2! \ldots x_n!}$, doesn't depend on $\lambda$ & so won't affect the value of $\lambda$ at which $f(x;\lambda)$ is a maximum. Derive the MLE & see for yourself.
The sample size $n$ is a known constant rather than a realized value of a random variable‡, so isn't considered part of the sufficient statistic; the same goes for known parameters other than the ones you want to infer things about.
† In this case squaring is one-to-one because $\sum x_i$ is always positive.
‡ When $n$ is a realized value of the random variable $N$, then it will be part of the sufficient statistic, $(\sum x_i,n)$. Say you choose a sample size of 10 or 100 by tossing a coin: $n$ tells you nothing about the value of $\theta$ but does affect how precisely you can estimate it; in this case it's called an ancillary complement to $\sum x_i$ & inference can proceed by conditioning on its realized value—in effect ignoring that it might have come out different.
Best Answer
There's quite a large amount of confusion in this question.
In the first place, most probabilists who are not statisticians have never even heard of the concept of a sufficient statistic but all of them know that a normal distribution is uniquely characterized among the family of normal distributions by its expected value and variance. That is the sense in which the mean and variance are "sufficient" to identify a normal distribution. That is not about what statisticians call sufficient statistics at all; that's an altogether different concept. That latter concept concerns an i.i.d. sample, and no i.i.d. sample is in any way involved in the statement that the mean and the variance characterize a normal distribution within the family of normal distributions. To say that the sample mean and the sample variance constitute a sufficient statistic for the family of normal distributions means that the conditional distribution of the $n$-tuple of observations given the value of the sample mean and the sample variance does not depend on which normal distribution the sample was drawn from, i.e. does not depend on the mean and variance of the distribution.
Now notice that I said "among the family of normal distributions." The mean and variance do not characterize a normal distribution without that or something equivalent to it. In other words, there are many non-normal distributions that have the same mean and the same variance as a particular normal distribution. To say that the mean and the variance are enough to determine a normal distribution means only that they are enough to separate it from other normal distributions.
Next, why shouldn't a normal distribution have higher moments? The $n$th moment of a distribution is just $\operatorname E(X^n)$ where $X$ is a random variable with that distribution. It exists if, and only if, $\operatorname E\left(\left| X^n \right| \right) \text{ (with an absolute value sign)} <+\infty.$ That's all it means.