Using Histograms To Study Conditional Distributions

regression

I have the following question about Conditional Probability Distributions vs Marginal Distributions. To provide some context, I will post my current understanding of this issue in two parts – followed by my question.

Part 1: Suppose we take a standard statistical regression model:

\begin{equation}
Y_i = \beta_0 + \beta_1 X_i + \epsilon_i
\end{equation}

where:

  • $Y_i$ is the response variable
  • $X_i$ is the predictor variable
  • $\beta_0$ and $\beta_1$ are the coefficients
  • $\epsilon_i$ is the error term.

The error term $\epsilon_i$ are assumed to follow a normal distribution with mean 0 and constant variance $\sigma^2$, i.e.,

\begin{equation}
\epsilon_i \sim N(0, \sigma^2)
\end{equation}

Given this information, we can find out the Expected Value and Variance of $Y_i$:

\begin{align*}
E(Y_i) &= E(\beta_0 + \beta_1 X_i + \epsilon_i) \\
&= \beta_0 + \beta_1 X_i
\end{align*}

\begin{align*}
Var(Y_i) &= Var(\beta_0 + \beta_1 X_i + \epsilon_i) \\
&= Var(\beta_0) + Var(\beta_1 X_i) + Var(\epsilon_i) \\
&= \sigma^2 \\
\end{align*}

Using the mathematical properties of the Normal Distribution (i.e. linear combinations of Normal Distributions are also Normally Distributed), we usually write the following:

\begin{equation}
P(Y_i | \mathbf{x}_i) \sim N(\mathbf{x}_i^T \boldsymbol{\beta}, \sigma^2)
\end{equation}

This means that if I have some data and I want to fit a standard regression model to my data, I can check to see if the errors (i.e. Predicted Value – Observed Value) of my model are Normally Distributed. If the errors are "more or less" Normally Distributed, then I can (in part) conclude that this assumption/requirement has been satisfied. In Statistics, there are several methods which can be used to test if the errors are Normally Distributed (e.g. Quantile-Quantile Plots, Kolmogorov-Smirnov Test, etc.).

Part 2: Now, consider a Poisson Regression Model.

A Poisson Regression Model assumes that the response variable $Y$ has a Poisson distribution, and that the logarithm of its expected value can be modeled by a linear combination of unknown parameters.

\begin{align*}
\log(\mu_i) &= \beta_0 + \beta_1x_{i1} + \cdots + \beta_kx_{ik} \\
\end{align*}

\begin{align*}
y_i| x_i &\sim \text{Poisson}(\mu_i)
\end{align*}

where:

  • $\mu_i$ is the expected value of the response variable for the $i$-th observation
  • $x_{ij}$ represents the $j$-th predictor variable for the $i$-th observation
  • $\beta$ are the parameters of the model estimated from the data

Thus, just as in Part 1 – the conditional distribution follows a Poisson Distribution:

$$Y_i | x_i \sim \text{Poisson}(\mu_i), \quad \text{where} \quad \mu_i = e^{\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \cdots}$$

To my initial surprise, I realized that the "errors" (i.e. Predicted Value – Observed Value) in a Poisson Model do not have Poisson Distribution. Not only this – but in a Poisson Model, the errors actually don't have any Probability Distribution! This is because the Observed Value is a Discrete Count whereas the Predicted Value might not be Discrete. Thus, it is not immediately clear as to what is distribution of the errors in a Poisson Regression Model

This leads me to my question: In the case of a Poisson Regression Model – how can we specifically check the assumption whether the Conditional Distribution indeed follows a Poisson Distribution?

After doing some reading and some thinking, this is my current understanding of the issue (i.e. how to verify the conditional distribution follows a Poisson Distribution):

  • Step 1: Given some data, first we can fit a Poisson Regression Model
  • Step 2: Then – for each value data point we observed, we can use the Regression Model to simulate multiple values of $Y_i$ for that single point. Ideally, these simulated $Y_i$'s should be close to the actual $Y_i$.
  • Step 3: After you repeat this simulation for all observed points – you can now make a histogram for all the simulations. In the ideal case, this histogram should look similar to a histogram made only using the observed values of $Y_i$.
  • Step 4: I think then in this case, we can compare the Empirical Distribution Function (EDF) of the simulations vs the EDF of the theoretical Poisson Distribution (e.g. Kolmogorov-Smirnov Test). If the difference between the EDF of the simulations and the EDF of the theoretical Poisson Distribution are similar – then we can conclude that conditional distribution assumption/requirement is satisfied.

Can someone please tell me if my understanding is correct?

Thanks!

  • Note: I know that are other methods which can be used such as Goodness of Fit Tests, Pearson Residuals, Deviance Residuals that can be used to check the overall model fit … but I specifically interested in learning about how to check if the conditional distribution follows a Poisson Distribution.

Best Answer

Yes - the residuals overall in your Poisson distribution, are not Poisson distributed. They are only Poisson distributed conditional on the linear predictor.

The DHARMa package implements something akin to your closing paragraphs. https://cran.r-project.org/web/packages/DHARMa/vignettes/DHARMa.html. It is very useful for assessing fit of GLMs with non-Gaussian errors.

Given some fitted model, DHARMa simulates from the model for each $x_i$. This gives the Poisson/whatever distribution conditional upon the linear predictor.

Then, the conditional distribution for observation $i$ is compared to the actual observed data point $y_i$. Specifically, this comparison the empirical cumulative distribution function evaluated at $y_i$. For example, if $y_i = 5$ and 20% of the simulated conditional distribution is less than 5, the comparison value is 0.2 (this is the "DHARMa simulated residual").

The key theoretical result underpinning DHARMa, is that if your model is correct, then the simulated residuals will have a $\text{Uniform}(0,1)$ distribution. This gets around pesky issues like the variance increasing with the mean, which is the case for models like Poisson, Negative Binomial etc.

Usually with OLS, you would use a normal quantile-quantile plot to assess the residuals. With DHARMa, you use a uniform quantile plot to assess the residuals. But the philosophy behind it is the same.

The vignette, which I linked above, has a number of worked examples, links, and more theory which I strongly suggest reading - it's very good.

Related Question