The height of a class of students in a school is said to follow Normal Distribution. We know that the domain of a random variable that follows Normal distribution is said to range from minus infinity to plus infinity. But height can never attain a negative value. So how can we say that height follows a Normal Distribution?
Solved – why does the distribution of height follow Normal Distribution
distributionsnormal distribution
Related Solutions
I think you might be better off treating this as a mixture of two distributions rather than trying to apply the standard normal-theory tools, so instead I'm going to outline a bit about the zero inflated Gamma distribution, including computing its first two moments, to give you a sense of how this goes. You could pretty easily swap the Gamma out for a different continuous distribution if you'd like (e.g. a Beta distribution scaled to be in $[0,100]$). Happy to add updates later if this is not helpful.
Let $Z_1,\dots,Z_n\stackrel{\text{iid}}\sim\text{Bern}(\theta)$ and consider $X_1,\dots,X_n$ where $$ X_i \vert Z_i \stackrel{\text{iid}}\sim \begin{cases}\Gamma(\alpha, \beta) & Z_i = 1 \\ 0 & Z_i = 0\end{cases} $$
so each $X_i$ is a mixture of a point mass at $0$ with probability $1-\theta$ and a $\Gamma(\alpha,\beta)$ with probability $\theta$. We interpret this as $Z_i$ being a hidden latent variable that determines whether or not the student studies, and then $X_i$ is the observed value.
This is a bit formal but I'm going to mention it for the sake of completeness. $X_i$ does not have a pdf in the usual sense because it's neither discrete nor continuous, but if we consider the measure $\nu = \lambda + \delta_0$, i.e. the Lebesgue measure plus a point mass at $0$, then $\nu(A) = 0 \implies P_X(A) = 0$ for any measurable $A$ so we can get a pdf $f_X := \frac{\text dP_X}{\text d\nu}$ w.r.t. $\nu$.
But what does this pdf look like? We can work out the CDF $F$ using some rules of conditional probability. $$ F(x) = P(X\leq x \cap Z = 0) + P(X\leq x \cap Z = 1) \\ = P(X\leq x | Z = 0)P(Z=0) + P(X\leq x | Z=1)P(Z=1) \\ = 1 \cdot (1 - \theta) + F_\Gamma(x; \alpha, \beta) \theta \\ = 1 - \theta + \theta F_\Gamma(x; \alpha, \beta) $$ where $F_\Gamma$ denotes the CDF of an actual Gamma distribution.
So we want a function $f_X$ such that $$ F(x) = \int_{[0, x]} f_X\,\text d\nu. $$ Note that $$ \int_{[0, x]} f_X\,\text d\nu = \int_{\{0\}} f_X\,\text d\delta_0 + \int_{(0, x)} f_X\,\text d\lambda \\ = f_X(0) + \int_{(0, x)} f_X\,\text d\lambda $$ so I can take $$ f_X(x) = 1 - \theta + \theta f_\Gamma(x; \alpha, \beta). $$
Let's check that this is a valid pdf: $$ \int_{[0,\infty)} f_X\,\text d\nu = 1 - \theta + \theta \int_0^\infty f_\Gamma \,\text d\lambda = 1 $$ so this is indeed a valid pdf (w.r.t. $\nu$).
Now I'll work out the first two moments of $X_i$.
$$ E(X_i) = \int_{[0,\infty)} x f_X(x)\,\text d\nu(x) \\ = 0 \cdot \int_{\{0\}} f_X\,\text d\delta_0 + \int_{(0,\infty)} x f_X(x)\,\text d\lambda(x) \\ = 0 + \theta \int_0^\infty x f_\Gamma(x) \,\text d\lambda(x) = \frac{\theta\alpha}\beta := \mu < \infty. $$ Next $$ E(X_i^2) = \int_{[0,\infty)} x^2 f_X(x)\,\text d\nu(x) \\ = 0 + \theta \int_0^\infty x^2 f_\Gamma(x)\,\text d\lambda(x) \\ = \frac{\theta\alpha(1 + \alpha)}{\beta^2}. $$ This means $$ \sigma^2 := E(X_i^2) - \mu^2 < \infty. $$
At long last I have confirmed the following facts: we have a collection of iid RVs $X_1, X_2,\dots$ with finite means and variances, so we can happily apply the standard CLT to conclude
$$ \frac{\bar X_n - \mu_n}{\sqrt n} \stackrel{\text d}\to \mathcal N(0, \sigma^2). $$
Now as for how good this is, you'll probably want to do some simulations. Also I'm not saying this is actually a good model.
I'll check my math with the following simulation:
theta <- .76
a <- 5.4
b <- 1.2
n <- 1e6
set.seed(42)
z <- rbinom(n, 1, theta)
x <- numeric(n)
x[z==1] <- rgamma(sum(z), shape=a, rate=b)
hist(x, main="Zero inflated Gamma simulations")
mean(x)
theta * a / b # agrees
mean(x^2)
theta * a * (1 + a) / b^2 # agrees
Also note that I'm not using a KDE to show the distribution like (it looks like) you are. Those typically aren't appropriate for distributions that have a point mass like this. Plus if you're using one that puts a mini gaussian at each data point then it's implicitly assumed that the support is all of $\mathbb R$ so you can also get positive probability on impossible areas like you did.
If you choose to use this model and want to estimate the parameters, the EM algorithm is the usual way to go.
In this case though there's no doubt as to which class a particular $X_i$ belongs as if $X_i = 0$ then $Z_i = 0$ almost surely. So you can do
mean(x > 0) # compare with theta
mu.x <- mean(x[x > 0])
s2.x <- var(x[x > 0])
(b.hat <- mu.x / s2.x)
(a.hat <- mu.x^2 / s2.x)
and these agree. But I have a massive sample size and $\theta$ isn't particularly close to $0$ or $1$ here so it's not impressive to be so accurate with this conditional approach.
Some points:
The choice of a normal distribution for the random effects in linear mixed models (i.e., normally distributed) outcome is typically done for mathematical convenience. That is, the normal distribution of $[Y \mid b]$ works nicely with the normal distribution for the random effects $[b]$, and you get a marginal distribution that for the outcome $[Y]$ that is multivariate normal.
In that regard it helps to see a mixed model as a hierarchical Bayesian model. Namely, in the linear mixed model assuming a normal distribution for the random effects is a conjugate prior that gives you back a posterior in closed-form. Hence, you can do the same for other distributions. If you have Binomial outcome data, the conjugate prior for the random effects is a Beta distribution, and you get the Beta-Binomial model. Likewise if you have Poisson outcome data, the conjugate prior for the random effects is a Gamma distribution, and you get the Gamma-Poisson model. Just to make clear here that in the previously mentioned examples, the distribution of the random effects was on the scale of the mean of the outcome conditional on the random effects not on the scale of the linear predictor (e.g., in the Gamma-Poisson example, on the linear predictor scale the assumed distribution would be a log-Gamma distribution).
There is nothing stoping you changing the distribution. For example, in the linear mixed model you could use a Student’s t distribution for the random effects, or in categorical outcomes to use a normal distribution. But then you lose the computational advantage of having a closed-form posterior. There is considerable literature looking into the impact of changing the random-effects distribution. Many people have proposed flexible models for it; for example, using splines or mixtures to be able to capture random-effects distributions that are multi-modal. However, the general consensus has been that the normal distribution works pretty well. Namely, even if you simulate data from a bimodal or skewed distribution for the random effects, and you assume in your mixed model that it is normal, the results (i.e., parameter estimates and standard errors) are almost identical to when you fit a flexible model that captures this distribution more appropriately.
Hence, the choice of the normal distribution has dominated, even though other options do exist. With regard to your point on whether the choice of a normal distribution is defensible for categorical data, as Ben mentioned, note that the distribution of the random effects is placed not on the outcome but rather on the transformed mean of the outcome. For example, for Poisson data you assume a normal distribution for the random effects for $\log(\mu)$ where $\mu$ denotes the expected counts of the outcome variable $Y$ which is the observed counts.
Best Answer
It is just an approximation, a generalization so that a high-school statistics classes may take samples of their friends in an interactive classroom activity. It also serves as a nice introductory lesson for these statistics classes, showing the prevalence of the normality assumption in real-life and how close a Gaussian actually comes to modeling real datasets.
According to Wikipedia...
Also, according to this site (which also makes sweeping approximations and generalizations like the one you are concerned with)...
Here is a link to an article based in mathematics criticizing the very assumption you are questioning. Quite a nice article, especially the beginning parts.