Solved – Sample Size for Poisson Regression

generalized linear modellogisticpoisson-regressionregression coefficientssample-size

Recently, I was tasked with a sample size calculation for a study in which the outcome is to be modeled using a Poisson regression (i.e. a generalized linear model). For quick and simple calculations of this nature, I often use PASS, a statistical software package dedicated to power/sample size calculations. However, I noticed that in the "Poisson Regression" menu for PASS, one of the options is specifying the distribution of the PREDICTOR variable (X). I was rather confused by this, since regression models don't make assumptions about the distribution of the covariates, so I checked the online documentation (here).

On page 870-2 of that document, they give a formula for sample size calculations. On that page, they state, "The variance [of the regression parameter estimate $\hat{\beta}$] for the non-null case depends on the underlying distribution of $X$."

I always learned rather unequivocally during my education that regression models do NOT make ANY assumptions about the distribution of the covariates themselves, and that even assumptions about the distribution of regression parameters are purely inferential (i.e. the parameter estimates themselves will be unbiased regardless of the underlying distribution of the covariate, assuming a properly specified model, but assumptions about the distribution of that parameter are useful for building confidence intervals, etc.). And it doesn't seem straightforward to me to draw a direct relationship between the distribution of a covariate and the variance of a corresponding parameter estimate, since the latter should be driven by the functional relationship between the covariate and the outcome.

The PASS documentation quoted above cites a 1991 Biometrika paper called "Sample Size for Poisson Regression", which I investigated in an effort to better understand what is going on here. That paper is available online here (I don't believe there is a paywall, but I'm on an institutional network at the moment so I may be wrong on that). As with the PASS documentation, this paper talks extensively about the maximum likelihood estimates for the regression parameters as functions of the distribution of the covariates.

At the bottom of page 1 of this paper, they write the likelihood function of the Poisson model (using their notation) as:

$L(\beta_0,\beta) = \prod_{i=1}^{N}f_X(x_i)f_T(t_i)\lambda_i^{y_i}exp(-\lambda_i)/y_i!$

where

$\lambda_i=t_iexp(\beta_0+\beta^Tx_i)$

This is, of course, not the usual likelihood function for a Poisson model, which would typically would not include $f_X(x)$ or $f_T(t)$ (where $f_X$ is the distribution of the covariates, $X$, and $f_T$ is the distribution of the exposure times (i.e. the 'offset' term in a Poisson model).

Granted, the paper notes that they are treating $X$ and $T$ as random variables, but I am struggling to understand why they are doing this, since it is so radically different from the traditional approach to estimating regression parameters using maximum likelihood.

This paper further cites a 1989 JASA paper that is making similar calculations for the case of logistic regression, again with the variance of parameters as a function of the distribution of the covariates themselves (and with a similar expression for the likelihood including some $f_X(x)$. Now THIS paper (available here) also includes a table (Table 1, on page 28), which seems to parameterize the distribution of X in terms of the regression parameters!

I am having a very difficult time understanding this. I have always learned that regression models do not make assumptions about the distribution of the covariates, or even of the regression parameters, yet it seems that these methods for sample size calculations under logistic regression and Poisson regression (both of these papers I linked to are fairly well cited) are explicitly making such assumptions.

Can anybody shed any light on this subject?

1) Do we ever need to make assumptions about the distribution of covariates in a regression model? Am I simply incorrect in believing we never need to make these assumptions, especially if we are assuming we have a correctly specified regression model?

2) If we DON'T need to make these assumptions, then what is the utility of doing so for the purposes of these sample size calculations? I will note further that using the formulas in these papers produces some RADICALLY different sample size estimates for the same effect sizes, based only on changing the assumed distribution of the covariates. I don't see how estimates using this method can be valid if we answer "NO" to question 1.

Best Answer

You do not need to make assumptions about the distribution of the predictors $X$ in order to estimate the regression model or to make inference, but the variance of the obtained estimators will depend on aspects of the distribution of $X$, in particular, its spread as for example measured by its variance. This should be quite clear in the case of simple linear regression:

two linear regression models with different spread of x's

This shows a situation with simple linear regression, the same model ($y=1+x+\text{error}$), but in the left panel observation planned with a well-spread-out $x$, in the right panel the same number of observations planned, but with a more concentrated distribution of the $x$'s. It is intuitively clear that the design of the left panel will give better information for estimation of the model, and the algebra of simple linear regression confirms that.

In the following we will show this for simple linear regression, but the argument and conclusion is equally valid for multiple linear regression. The simple linear regression model is $$ y_i = \alpha + \beta x_i + \epsilon_i $$ and the usual assumptions of homoskedasticity and independent error terms. Then the estimator $\hat{\beta}$ of the slope can be written $$ \hat{\beta} = r_{xy} \frac{s_y}{s_x} $$ with the usual definitions. Its variance can be written $$ \DeclareMathOperator{\V}{\mathbb{V}} \V \hat{\beta} = \frac{\frac1{n-2}\sum \hat{\epsilon}_i^2}{\sum (x_i-\bar{x})^2} $$ and it is obvious from that formula that increasing the variance of the $x$' s will decrease the variance of $\hat{\beta}$. There is some subtle nuances here, in the usual theory for (simple) linear regression, the $x$'s are simply known constants, they are not random variables. Therefor the name design matrix, the values of $x$ are not observed values of a random variable, they are designed, that is, chosen, by the statistician. That reflect the origin of this terminology in design of experiments. In the majority of applications, probably, this is not the case, we do observe $x$ as values of some random variable. But still, inference is done conditional on the observed values. So, in this sense, since $x$ is not modeled as a random variable, it does not have a distribution! So, in that sense, inference cannot depend on something that does not exist (in the model). But, as the variance formula above shows, the variance of estimator $\hat{\beta}$ depends on the empirical variance of $x$.

So, if you are able to influence data collection, it is better to plan for well spread out values of $x$.

All of the above (except the concrete formulas) will be valid for Poisson regression. I will not repeat the arguments in that case.

Related Question