Solved – How to interpret logarithmically transformed coefficients in linear regression

data transformationlogarithmregressionregression coefficients

My situation is:

I have 1 continuous dependent and 1 continuous predictor variable that I've logarithmically transformed to normalise their residuals for simple linear regression.

I would appreciate any help on how I can relate these transformed variables to their original context.

I want to use a linear regression to predict the number of days that pupils missed school in 2011 based on the number of days they missed in 2010. Most pupils miss 0 days or just a few days the data is positively skewed to the left. Therefore, there is a need for transformation to use linear regression.

I've used log10(var+1) for both variables (I used +1 for pupils who had missed 0 days school). I'm using regression because I want to add in categorical factors – gender/ethnicity etc too.

My problem is:

The audience I want to feed back to wouldn't understand log10(y) = log(constant) + log(var2)x (and frankly neither do I).

My questions are:

a) Are there better ways of interpreting transformed variables in regression? I.e. for ever 1 day missed in 2010 they will miss 2 days in 2011 as opposed to for ever 1 log unit change in 2010 there will be x log units change in 2011?

b) Specifically, given the quoted passage from this source as follows:

"This is the negative binomial regression estimate for a one unit
increase in math standardized test score, given the other variables
are held constant in the model. If a student were to increase her
mathnce test score by one point, the difference in the logs of
expected counts would be expected to decrease by 0.0016 unit, while
holding the other variables in the model constant."

I would like to know:

  • Is this passage saying that for every one unit increase in the score of the UNTRANSFORMED variable math leads to a 0.0016 decrease from the constant (a), so if UNTRANSFORMED maths score goes up by two points, I subtract 0.0016*2 from the constant a?
  • Does that mean that I get the geometric mean by using exponential(a)) and exponential(a+beta*2) and, that I need to calculate the percentage difference between these two to say what effect the predictor variable(s) has/have on the dependent variable?
  • Or have I got that totally wrong?

I'm using SPSS v20. Sorry for framing this in a long question.


Best Answer

I think the more important point is suggested in @whuber's comment. Your whole approach is misfounded because by taking logarithms you effectively are throwing out of the dataset any students with zero missing days in either 2010 or 2011. It sounds like there are enough of these people to be a problem, and I am sure your results will be wrong based on the approach you are taking.

Instead, you need to fit a generalized linear model with a poisson response. SPSS can't do this unless you have paid for the appropriate module, so I'd suggest upgrading to R.

You will still have the problem of interpreting coefficients, but this is secondary to the importance of having a model that is basically appropriate.