Solved – Normal Distribution With Many Zero Values

distributionsnormal distribution

I have here a probability density function representing the amount of hours spent studying for a class final. Unsurprisingly, many of the students do not study for the final.

enter image description here

However, because so many students stack up on 0 hours studied, it appears that the curve extends into a negative domain, which is not possible because you can't study for negative hours.

Here are some questions:

  • How can one interpret the results for a curve like this?
  • Can values for probability, z-score, std, cdf, etc be reliably calculated?
  • Is there a way to transform this so that all values of the domain are greater than 0?

Best Answer

I think you might be better off treating this as a mixture of two distributions rather than trying to apply the standard normal-theory tools, so instead I'm going to outline a bit about the zero inflated Gamma distribution, including computing its first two moments, to give you a sense of how this goes. You could pretty easily swap the Gamma out for a different continuous distribution if you'd like (e.g. a Beta distribution scaled to be in $[0,100]$). Happy to add updates later if this is not helpful.

Let $Z_1,\dots,Z_n\stackrel{\text{iid}}\sim\text{Bern}(\theta)$ and consider $X_1,\dots,X_n$ where $$ X_i \vert Z_i \stackrel{\text{iid}}\sim \begin{cases}\Gamma(\alpha, \beta) & Z_i = 1 \\ 0 & Z_i = 0\end{cases} $$

so each $X_i$ is a mixture of a point mass at $0$ with probability $1-\theta$ and a $\Gamma(\alpha,\beta)$ with probability $\theta$. We interpret this as $Z_i$ being a hidden latent variable that determines whether or not the student studies, and then $X_i$ is the observed value.

This is a bit formal but I'm going to mention it for the sake of completeness. $X_i$ does not have a pdf in the usual sense because it's neither discrete nor continuous, but if we consider the measure $\nu = \lambda + \delta_0$, i.e. the Lebesgue measure plus a point mass at $0$, then $\nu(A) = 0 \implies P_X(A) = 0$ for any measurable $A$ so we can get a pdf $f_X := \frac{\text dP_X}{\text d\nu}$ w.r.t. $\nu$.

But what does this pdf look like? We can work out the CDF $F$ using some rules of conditional probability. $$ F(x) = P(X\leq x \cap Z = 0) + P(X\leq x \cap Z = 1) \\ = P(X\leq x | Z = 0)P(Z=0) + P(X\leq x | Z=1)P(Z=1) \\ = 1 \cdot (1 - \theta) + F_\Gamma(x; \alpha, \beta) \theta \\ = 1 - \theta + \theta F_\Gamma(x; \alpha, \beta) $$ where $F_\Gamma$ denotes the CDF of an actual Gamma distribution.

So we want a function $f_X$ such that $$ F(x) = \int_{[0, x]} f_X\,\text d\nu. $$ Note that $$ \int_{[0, x]} f_X\,\text d\nu = \int_{\{0\}} f_X\,\text d\delta_0 + \int_{(0, x)} f_X\,\text d\lambda \\ = f_X(0) + \int_{(0, x)} f_X\,\text d\lambda $$ so I can take $$ f_X(x) = 1 - \theta + \theta f_\Gamma(x; \alpha, \beta). $$

Let's check that this is a valid pdf: $$ \int_{[0,\infty)} f_X\,\text d\nu = 1 - \theta + \theta \int_0^\infty f_\Gamma \,\text d\lambda = 1 $$ so this is indeed a valid pdf (w.r.t. $\nu$).


Now I'll work out the first two moments of $X_i$.

$$ E(X_i) = \int_{[0,\infty)} x f_X(x)\,\text d\nu(x) \\ = 0 \cdot \int_{\{0\}} f_X\,\text d\delta_0 + \int_{(0,\infty)} x f_X(x)\,\text d\lambda(x) \\ = 0 + \theta \int_0^\infty x f_\Gamma(x) \,\text d\lambda(x) = \frac{\theta\alpha}\beta := \mu < \infty. $$ Next $$ E(X_i^2) = \int_{[0,\infty)} x^2 f_X(x)\,\text d\nu(x) \\ = 0 + \theta \int_0^\infty x^2 f_\Gamma(x)\,\text d\lambda(x) \\ = \frac{\theta\alpha(1 + \alpha)}{\beta^2}. $$ This means $$ \sigma^2 := E(X_i^2) - \mu^2 < \infty. $$

At long last I have confirmed the following facts: we have a collection of iid RVs $X_1, X_2,\dots$ with finite means and variances, so we can happily apply the standard CLT to conclude

$$ \frac{\bar X_n - \mu_n}{\sqrt n} \stackrel{\text d}\to \mathcal N(0, \sigma^2). $$

Now as for how good this is, you'll probably want to do some simulations. Also I'm not saying this is actually a good model.


I'll check my math with the following simulation:

theta <- .76
a <- 5.4
b <- 1.2
n <- 1e6

set.seed(42)
z <- rbinom(n, 1, theta)
x <- numeric(n)
x[z==1] <- rgamma(sum(z), shape=a, rate=b)

hist(x, main="Zero inflated Gamma simulations")

mean(x)
theta * a / b  # agrees

mean(x^2)
theta * a * (1 + a) / b^2  # agrees

enter image description here

Also note that I'm not using a KDE to show the distribution like (it looks like) you are. Those typically aren't appropriate for distributions that have a point mass like this. Plus if you're using one that puts a mini gaussian at each data point then it's implicitly assumed that the support is all of $\mathbb R$ so you can also get positive probability on impossible areas like you did.

If you choose to use this model and want to estimate the parameters, the EM algorithm is the usual way to go.

In this case though there's no doubt as to which class a particular $X_i$ belongs as if $X_i = 0$ then $Z_i = 0$ almost surely. So you can do

mean(x > 0)  # compare with theta

mu.x <- mean(x[x > 0])
s2.x <- var(x[x > 0])
(b.hat <- mu.x / s2.x)
(a.hat <- mu.x^2 / s2.x)

and these agree. But I have a massive sample size and $\theta$ isn't particularly close to $0$ or $1$ here so it's not impressive to be so accurate with this conditional approach.