Solved – How to predict with log transformed variable

data transformationpredictionprediction intervalrregression

I know a question of this nature has been posted previously but I am in doubt whether my data table is peculiar to my prediction model.

I have noise data (dependent variable) which I sampled at different intervals of distance (independent variable) from the road.

My data looks something like this. I conducted multiple measurements at each intervals:

    > dput(data.frame(dist, leq))
structure(list(dist = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 10L, 
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 20L, 20L, 
20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 40L, 40L, 40L, 
40L, 40L, 40L, 40L, 40L, 40L, 40L, 40L, 40L), leq = c(65.15, 
68.55, 66.06, 65.99, 64.86, 66.77, 65.25, 67.65, 66.74, 66.71, 
64.43, 65.61, 65.02, 66.04, 61.54, 62.84, 64.05, 67.57, 64.42, 
62.58, 63.24, 62.86, 62.07, 66.08, 61.2, 65.73, 60.95, 65.77, 
59.45, 60.13, 62.82, 64.96, 61.13, 62.45, 60.43, 61.58, 65.25, 
60.04, 61.21, 59.44, 62.57, 62.41, 58.78, 58.88, 64.04, 59.37, 
57.89, 61.42, 58.4, 62.13, 58.93, 63.95, 56.25, 56.37, 56.93, 
57.37, 66.04, 57.07, 58.54, 57.65)), .Names = c("dist", "leq"
), row.names = c(NA, -60L), class = "data.frame")

I log transformed my independent variable

logdist<-log(dist+0.1)

after running a Kolmogorov‑Smirnov test (statistically significant, data is not normally distributed) and a Shapiro test (statistically significant, the residuals aren't normally distributed).

This is what it looks like plotted:

I then ran a linear regression for noise and distance

  Call:
lm(formula = logdist ~ leq)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0943 -0.7311  0.0163  0.8371  3.6267 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 29.88748    3.83556   7.792 1.37e-10 ***
leq         -0.45159    0.06129  -7.368 7.07e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.535 on 58 degrees of freedom
Multiple R-squared:  0.4834,    Adjusted R-squared:  0.4745 
F-statistic: 54.28 on 1 and 58 DF,  p-value: 7.066e-10

This is where I'm lost; I want to predict mean noise level from 0 to a 100 m in steps of 10 m, and then identify the mean predicted noise level at 50 m:

pred<-predict(reg2, data.frame(logdist=seq(0, 100, 10)))
dist<-seq(0,100,10)
cbind(dist2,reg2)
pred2<-predict(reg2,data.frame(logdist=50),interval="predict")

My results:

> pred2
           fit         lwr      upr
1   0.46661760 -2.64902534 3.582261
2  -1.06877685 -4.25518137 2.117628
3   0.05567379 -3.07361956 3.184967
4   0.08728485 -3.04081911 3.215389
5   0.59757771 -2.51454427 3.709700
6  -0.26495270 -3.40761110 2.877706
7   0.42145894 -2.69549138 3.538409
8  -0.66234891 -3.82482580 2.500128
9  -0.25140510 -3.39345106 2.890641
10 -0.23785750 -3.37929522 2.903580

I'm honestly beyond confused on how to interpret my data. I feel like I missed a step. I know that I need to reverse the transformation to make practical sense of my predictions, but even then the values seem well below than what I would be expecting.

Best Answer

There are two parts to this answer. I will consider the utility of transformations for these data. Then I will suggest a quite different analysis.

Transformation needed and useful here? No

I see no reason whatsoever to transform distance or indeed noise either.

There is no requirement that responses or predictors in regression follow a normal (Gaussian) distribution. As a thought experiment, imagine $x$ is uniform on the integers and $y$ is $a + bx$. Then $y$ is also uniform; any regression program will retrieve the linear relation and produce the best possible figures of merit. Is it a problem in any sense that neither variable is normally distributed? No.

Looking more closely at the data, here are some normal quantile plots of the original variables and of Senun's transform $\log(\text{dist} + 0.1)$. I find these immensely more useful than (e.g.) Kolmogorov-Smirnov or Shapiro-Wilk tests: they show not only how well data fit a normal but also in what ways they fall short.

The labelled values on the vertical axes are those of the five-number summary, minimum, lower quartile, median, upper quartile and maximum. In the case of the distances, there are five distinct values with equal frequency, so they are reported as just those distinct values.

Note. The quantile plots here include only minor variations on conventional axis labelling and titling, but anyone interested in the details, or in a Stata implementation, may consult this paper.

The distances are thus a distribution with 5 spikes and cannot get close to normal; any one-to-one transformation must yield another distribution with 5 spikes. If there were a problem with mild skewness, the chosen transformation makes it worse by flipping the skewness from positive to negative and increasing its magnitude. This is shown by calculation of both moments-based and L-moments-based measures. If there were a problem with mildly non-normal kurtosis (there isn't), the transformation leaves it a little closer to the normal state.

Those unfamiliar with, but interested by, L-moments should start with the Wikipedia entry and might like to know that the L-skewness $\tau_3$ is 0 for every symmetric distribution, including the normal, while the L-kurtosis $\tau_4 \approx$ 0.123 for the normal. This is Stata output using moments and lmoments from SSC: Stata uses that definition of kurtosis for which the normal yields 3. The first L-moment measures location and is identical to the mean; the second is a measure of scale. Location and scale detail is naturally just context here and not otherwise germane to discussing transformations.

----------------------------------------------------------------
         n = 60 |       mean          SD    skewness    kurtosis
----------------+-----------------------------------------------
           dist |     15.000      14.261       0.795       2.263
log(dist + 0.1) |      1.666       2.118      -1.113       2.758
            leq |     62.494       3.261      -0.192       1.979
----------------------------------------------------------------

----------------------------------------------------------------
         n = 60 |        l_1         l_2         t_3         t_4
----------------+-----------------------------------------------
           dist |     15.000       7.729       0.229       0.012
log(dist + 0.1) |      1.666       1.087      -0.301       0.084
            leq |     62.494       1.887      -0.057       0.027
----------------------------------------------------------------

Noise is close to normally distributed, so even anyone worried about non-normality should leave it alone.

That deals with the mistaken stance that the transformation here is a good idea because the marginal distribution of distance is not normal. There is no problem; if there were, the distribution being a set of spikes makes at least hard to solve; and in practice the chosen transformation makes the situation worse even on its own criteria.

I'll flag a further detail. The ad hoc constant $0.1$ added before taking logarithms minimally needs a rationale: the absence of a rationale makes the transformation even more unsatisfactory.

That still leaves scope for a transformation to make sense because the relationship on new scale(s) would be closer to linear (or, a much smaller deal, because scatter around the relationship would then be closer to equal).

Here the main evidence lies in the first instance in scatter plots. The plot in the question shows that the transformation just splits data into two groups, which doesn't seem physically or statistically sensible. The scatter plot below doesn't indicate to me that transformation would help, but it's more crucial to think what kind of model makes sense any way.

A different analysis

We need more physical thinking. There is no doubt a substantial literature here which is being ignored. As an amateur alternative arm-waving I postulate that noise is here noise locally raised by road noise above some background and should diminish more rapidly at first and then more slowly with distance from the road. In fact some such thought may lie behind the unequal spacing in the sample design. So, one model matching those ideas is $\text{noise} = \alpha + \beta \exp(\gamma\ \text{distance})$ where we expect $\alpha, \beta > 0$ and $\gamma < 0$. Such models are a little tricky to fit as nonlinear least squares is implied but I'd assert that they make more sense than any linear model implying constant slope.

I get $\alpha = 58.75, \beta = 7.3565, \gamma = -.06770$ using nl in Stata.

The larger variability of lower noise levels needs some discussion, but presumably quite different conditions may be found at equal distances from the road. Clearly, what is important is not the distance but what else is in the gap (e.g. buildings and other structures, uneven topography).

Best Answer

Related Solutions

Solved – How to interpret a percent variable with a log-transformed outcome

Solved – Regression RMSE when dependent variable is log transformed

Related Question