Solved – linear regression on exponential distributed dependent variable

exponential-familygeneralized linear modellinear modelnormal distributionregression

suppose that I want to use linear regression on a data where independent variables x1,x2,…xn are all more or less normally distributed, while the dependent variable y is almost exponentially distributed, what should I do to y (or the whole dataset) so that it is okay to use linear regression model? I applied log transformation to y and then used ols, which seems fine to me, until a friend argued strongly against using log transform under such condition, but he did not provide any solution. So I am here to ask for help.

if not limit to the ordinary linear regression model, what model can I use to have a better fit?

Best Answer

I want to use linear regression on [...] independent variables x1,x2,...xn [...]
while the dependent variable y is almost exponentially distributed

If you expect the relationship between y and the x's to be linear, then a nonlinear transformation of y will make the relationship between it and the x's nonlinear. It will also alter the spread about the model (if the data had constant variance before transformation, it won't have it afterward).

Note further that in regression, there's no assumption about the distribution of the dependent variable itself (unconditionally). That is, there's little value in looking at say a histogram of the $y$ values -- it doesn't directly relate to any regression assumption. The assumption of normality applies when you're using normal based tests or intervals, and applies to the conditional distribution, which you can't usually assess until you look at residuals.

If you're not interested in hypothesis tests or confidence intervals, an ordinary regression with non-normal conditional distribution may in some situations be reasonable (non-constant variance may be more of an issue than distribution-shape anyway). If you do want to perform inference as well, there are several ways of going about it (some approximate) that may be suitable.

If you thought that the conditional distribution $Y|x1,x2,...$ was distributed as exponential, and that the relationship between $Y$ and the $x$'s was linear, you could use a GLM with identity link. There's advice relating to fitting exponential models in this way on site.

independent variables [...] are all more or less normally distributed, while

The distribution of the independent variables doesn't matter, since you condition on them in regression. No assumption about their distribution is made. The only way it's relevant is that sometimes the joint distribution can help inform us how to interpret the marginal distribution of the dependent variable, y (e.g. jointly normal x's would not produce an exponential y from conditionally normal y, so it would lead us to doubt the y's were conditionally normal).

I applied log transformation to y and then used ols, which seems fine to me, until a friend argued strongly against using log transform under such condition, but he did not provide any solution.

If it makes sense to model $E(\log(y))$ as a linear function of the predictors, that may be fine, but note that if you exponentiate such a fit, you don't get a suitable estimate of $E(Y|X=x)$ out (unless there's almost no variation about the model, in which case the bias may sometimes be small enough to ignore). An alternative to that would be to use a GLM with log link (in which case you'd be modelling $\log(E(y))$ as a linear function of parameters -- and expected values do come straight out of that model.

You should consider the spread about the relationship; if you know something about that it already it may help inform your choice of model (but beware your inferences if you're using the same data to identify the model as to make inferences about it)

There are many alternative ways than least squares to fit linear relationships, and some might be more suitable in the case of some non-normal conditional distributions.

You should clarify your expectations about what it is that will be linearly related to the x's and how you understand the variability about the line would behave (say as a function of the mean for example -- would it tend to spread more as the mean increased, or not?) on whatever that scale is.