Solved – Interpretation of linear regression results where dependent variable is transformed using ln(y+1)

back-transformationdata transformationinterpretationlinear modelregression

Similar questions have been asked before, e.g.

Back-transformation and interpretation of logā”(X+1) estimates in multiple linear regression

However, this question is a little different because (a) I'm interested in a transformed dependent variable only and (b) I'm trying to work out if the model allows me to predict values for each group.

I'm using linear regression to understand how various factors predict length of hospital admission (in days). The data is highly skewed and includes zeros. Regression with untransformed data doesn't work at all. If I transform the admission duration by ln(y + 1), examination of the residuals suggests the model works well. But I'm struggling to interpret the output!

The formula is (I'm using R):

lm(duration ~ sex:diagnosis + diagnosis + age_group)

duration is the transformed variable. sex is male/female. diagnosis is a 4-level categorical variable. age_group is a 3-level categorical variable. sex:diagnosis + diagnosis is just another way of writing the interaction sex*diagnosis, but the output shows stratrum-specific effects of sex rather than interaction terms.

And the output is something like:

                            Estimate   Std. Error   t value   Pr(>|t|)        
 ------------------------- ---------- ------------ --------- ---------- ----- 
  (Intercept)                  0.739        0.002     381.9   <0.001     ***  
  diagnosisheart              -0.208        0.003     -80.2   <0.001     ***  
  diagnosisliver               0.257        0.002     119.5   <0.001     ***  
  diagnosiskidney              0.856        0.004     213.9   <0.001     ***  
  age_group30-59               0.054        0.002      25.2   <0.001     ***  
  age_group60+                 0.100        0.002      47.6   <0.001     ***  
  sexmale:diagnosislung        0.478        0.006      76.4   <0.001     ***  
  sexmale:diagnosisheart       0.340        0.007      48.7   <0.001     ***  
  sexmale:diagnosisliver       0.037        0.008       4.6   <0.001     ***  
  sexmale:diagnosiskidney      0.163        0.024       6.8   <0.001     ***  

I am trying to work out (a) exactly what the results mean – particularly the intercept, and (b) if I can use the results to predict the duration of stay for each group.

Many thanks

Best Answer

You have fit a formula of the form

$$\log(y+1) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots$$

where $y$ is the dependent variable (duration), $x_i$ are the covariates (indicators for sex, diagnosis, etc.), $\beta_0$ the intercept, and $\beta_1, \beta_2, \ldots$ are effects associated with the covariates.

Take the exponential of both sides and subtract one to isolate the original dependent variable:

$$\begin{align} y &= \exp(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots) - 1 \\ &= \exp(\beta_0)\exp(\beta_1 x_1)\exp(\beta_2 x_2) \cdot \ldots - 1 \end{align}$$

Assuming $y$ is much bigger than 1, the minus one in this equation is negligible, and you can see that $y$ depends exponentially on each of the covariates $x_i$, with a rate constant $\beta_i$.

For a binary variable like sex (let's say sex is a binary variable encoded in $x_1$, male is 1 and female is 0) this means the duration for males is $\exp(\beta_1)$ times the duration for females, all else being equal.

If $y$ is much smaller than 1, then $\log(y+1) \approx y$ and the original formula behaves more like a traditional linear regression model.

When $y$ is around one, neither very large nor very small, there is no simple interpretation of your model, but you can still use the second formula above to predict duration.

It looks like your variables are all binary/categorical so this should suffice to interpret the results.