Solved – Interpretation of linear regression results where dependent variable is transformed using ln(y+1)

back-transformationdata transformationinterpretationlinear modelregression

Similar questions have been asked before, e.g.

Back-transformation and interpretation of log⁡(X+1) estimates in multiple linear regression

However, this question is a little different because (a) I'm interested in a transformed dependent variable only and (b) I'm trying to work out if the model allows me to predict values for each group.

I'm using linear regression to understand how various factors predict length of hospital admission (in days). The data is highly skewed and includes zeros. Regression with untransformed data doesn't work at all. If I transform the admission duration by ln(y + 1), examination of the residuals suggests the model works well. But I'm struggling to interpret the output!

The formula is (I'm using R):

lm(duration ~ sex:diagnosis + diagnosis + age_group)

duration is the transformed variable. sex is male/female. diagnosis is a 4-level categorical variable. age_group is a 3-level categorical variable. sex:diagnosis + diagnosis is just another way of writing the interaction sex*diagnosis, but the output shows stratrum-specific effects of sex rather than interaction terms.

And the output is something like:

                            Estimate   Std. Error   t value   Pr(>|t|)        
 ------------------------- ---------- ------------ --------- ---------- ----- 
  (Intercept)                  0.739        0.002     381.9   <0.001     ***  
  diagnosisheart              -0.208        0.003     -80.2   <0.001     ***  
  diagnosisliver               0.257        0.002     119.5   <0.001     ***  
  diagnosiskidney              0.856        0.004     213.9   <0.001     ***  
  age_group30-59               0.054        0.002      25.2   <0.001     ***  
  age_group60+                 0.100        0.002      47.6   <0.001     ***  
  sexmale:diagnosislung        0.478        0.006      76.4   <0.001     ***  
  sexmale:diagnosisheart       0.340        0.007      48.7   <0.001     ***  
  sexmale:diagnosisliver       0.037        0.008       4.6   <0.001     ***  
  sexmale:diagnosiskidney      0.163        0.024       6.8   <0.001     ***

I am trying to work out (a) exactly what the results mean – particularly the intercept, and (b) if I can use the results to predict the duration of stay for each group.

Many thanks

Best Answer

You have fit a formula of the form

$$\log(y+1) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots$$

where $y$ is the dependent variable (duration), $x_i$ are the covariates (indicators for sex, diagnosis, etc.), $\beta_0$ the intercept, and $\beta_1, \beta_2, \ldots$ are effects associated with the covariates.

Take the exponential of both sides and subtract one to isolate the original dependent variable:

$$\begin{align} y &= \exp(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots) - 1 \\ &= \exp(\beta_0)\exp(\beta_1 x_1)\exp(\beta_2 x_2) \cdot \ldots - 1 \end{align}$$

Assuming $y$ is much bigger than 1, the minus one in this equation is negligible, and you can see that $y$ depends exponentially on each of the covariates $x_i$, with a rate constant $\beta_i$.

For a binary variable like sex (let's say sex is a binary variable encoded in $x_1$, male is 1 and female is 0) this means the duration for males is $\exp(\beta_1)$ times the duration for females, all else being equal.

If $y$ is much smaller than 1, then $\log(y+1) \approx y$ and the original formula behaves more like a traditional linear regression model.

When $y$ is around one, neither very large nor very small, there is no simple interpretation of your model, but you can still use the second formula above to predict duration.

It looks like your variables are all binary/categorical so this should suffice to interpret the results.

Related Solutions

Solved – Interpretation of log transformed negatively skewed dependent variable

Your original x and your transformed x are inversely related, so, naturally, any relationship that is positive on one will be negative on the other. One way to see this inverse relationship is

x <- rnorm(100)
x2 <- log(1 + max(x) - x)
plot(x, x2)

(I used a normally distributed X here, but it does not matter for these purposes; you could substitute your variable and its transformation).

Further explanation after reading one of @Beka 's comments above.

x2 is a transformed version of x, using the transformation you used. Then I plotted x vs. x2. When x goes up, x2 goes down. So, any relationship between x and some other variable will be reversed between x2 and that variable.

In other words, your findings do not contradict earlier work.

Solved – Beta confidence intervals in transformed linear regression

Actually, the interval carries over just fine. The transformation is monotonic; the probability statement that applies on the log-scale transforms directly to the original scale, so as long as the assumptions under which the original interval was computed do apply, then it works as an interval for the original population parameter after transformation.

It's the estimate that may be problematic (but may be okay, depending on what you want). Note that $E[\exp(X)]\neq \exp[E(X)]$ if $\sigma_X^2>0$. If the log-scale estimate is unbiased, the transformed estimate is biased.

If you're happy to have an estimate that's median-unbiased, then the back-transformed estimate is also okay, for the same reason that the interval works.

If you seek mean-unbiasedness there are some choices. For example, if you're prepared to assume a normal distribution on $\hat\beta$ you can unbias it by using the properties of the lognormal. Alternatively, you can use a Taylor expansion to get an approximate adjustment (details are also in a number of posts on this site). If the standard error of the estimate is small, it won't matter much. There are other things that are done.

Best Answer

Related Solutions

Solved – Interpretation of log transformed negatively skewed dependent variable

Solved – Beta confidence intervals in transformed linear regression

Related Question