Solved – Model with non-linear transformation

data transformationfittingmodelingnonlinear regressionr

I don't understand this concept well and need help.

I was choosing whether to use a linear model or apply a non-linear transformation in my model formula. To do a diagnostic, I quickly plotted my data:

plotalldaily <- ggplot(amsd, aes(ImpressionsA, Leads.T)) + geom_point(color="orange")+geom_smooth()

enter image description here

From the plot, I guessed that a cubic polynomial transformation of my x variable should give me a better model fit. I referred this: http://www3.nd.edu/~rwilliam/stats2/l61.pdf . On Page 5, there is an explanation of polynomial models with cubic terms.

So I checked the model fit using two formulae- one with the non-linear transformation & another simple linear:

test1 <- lm(Leads.T~ImpressionsA, amsd)

test2 <- lm(Leads.T~I(ImpressionsA^3), amsd)

Strangely, the linear relationship is giving me a better model fit: Lower Standardized Error, Higher R-squared and Better Residuals Distribution.

TEST 1 Residuals

enter image description here

TEST 2 Residuals

enter image description here

I don't know what to make of it. Which model should I fit and what other kind of transformations should I try?

Best Answer

You haven't used the full form of the cubic representation and are missing two terms (i.e. have unintentionally constrained their parameters to equal zero):

$$Leads = \beta_{0} + \beta_{ImpA}ImpA + \beta_{ImpA^{2}}ImpA^{2} + \beta_{ImpA^{3}}ImpA^{3} + \varepsilon$$

Per the currently highest voted answer in Fitting polynomial model to data in R, you can do either

lm(Leads ~ ImpA + I(ImpA^2) + I(ImpA^{3}))

(as you indicate in your comment, but you had a missing parenthesis), or:

lm(Leads ~ poly(ImpA, 3, raw=TRUE))

Related Solutions

Regression – How to Interpret Polynomial Regression Bootstrap Output?

Did I correctly identify that a poly(Age,3) model was most appropriate?

I don't think so. There may be some issues of multiple testing here because you've done several likelihood ratio tests.

A better approach would be to have a linear model vs a spline model. Here is how to do this. I'll fit a model using natural splines with knots placed at the percentiles 5, 27.5, 50, 72.5, and 95 as per Frank Harrell's suggestions in his book Regression Modelling Strategies.

library(splines)

# Place the knots at the indicated quantiles, as per Frank Harrell's suggestions

knot_locs = quantile(d$Age, c(0.05, 0.275, 0.5, 0.725, 0.95))
fit1 = lm(Value ~ Age, data = d)
fit2 = lm(Value ~ ns(Age, knots = knot_locs), data = d)

anova(fit1, fit2)
Analysis of Variance Table

Model 1: Value ~ Age
Model 2: Value ~ ns(Age, knots = knot_locs)
  Res.Df     RSS Df Sum of Sq      F   Pr(>F)    
1    323 0.36426                                 
2    318 0.31802  5  0.046232 9.2456 3.12e-08 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Using a spline is a lower bias way to estimate a non-linear effect while keeping complexity of the model low. Here, I've used 5 knots for the spline, but you are free to use fewer or more. The F test here favors the spline model. We can plot it easily.

library(tidyverse)
library(modelr)

data_grid(d, Age) %>% 
  add_predictions(fit2) %>% 
  ggplot(aes(Age, pred))+
  geom_point(data=d, aes(Age, Value))+
  geom_line(color = 'red', size=1)

I think this approach, should the assumptions of the linear model be met (looks like they are to me from residual diagnostics), would eliminate the need for bootstrapping.

Lastly, can I state something like:

"We are 95% confident that in the true population, adults show a non-linear trend in Value overtime, with a larger decline in late life."

Uh...no, that is not typically something we would say. I might say something like...

"A spline model with 5 degrees of freedom is favoured over a model with a linear effect of age (F Value 9.24 on 6 and 318 degrees of freedom, p<0.001). Our model estimates that the expected Value should decline as individuals become older than 60".

Making statements about the decline specifically is...doable, but may take some care. I think here is where I would bootstrap and perhaps look at the derivative of the models. You could say "In X bootstrap replications, Y% of models had a negative derivative after age Z" or something, but I'm not confident. In any case, you would just have to be very careful about what you say. I'm not sure it would be worth it.

Edit: I agree with Whuber that the bootstrap is not necessary and that a joint confidence region gives information about the trend, but disagree that a confidence band is a good way to visualize the uncertainty about this particular problem. Shown below is a plot of the confidence band. Note that the model is more uncertain about the prediction in the right tail, that is not surprising. But from this plot alone, we can't tell if that increase in uncertainty is consistent with all models from the sampling distribution having negative derivative in the tail. For that, the bootstrap could help.

Here, I have replayed the model fit (including selection of knots, at the indicated percentiles) 250 times and plotted only the relevant part of the model. The overwhelming majority of samples have negative derivative, but a few don't. This plot, I argue, is more informative than the confidence band alone because we could visualize (and if we wanted to, estimate) how many bootstrap samples have negative first derivative in the tail. I leave that to you should you find that approach enticing.

Code


library(splines)
library(tidyverse)
library(rsample)
library(modelr)


d = # Your data here

do_fit<-function(data){
  model_data = assessment(data)
  knot_locs = quantile(model_data$Age, c(0.05, 0.275, 0.5, 0.725, 0.95))
  model = lm(Value ~ ns(Age, knots = knot_locs), data = model_data)
  model
}

do_pred<-function(data, model){
  data_grid(data, Age=seq(60, 100, 0.1))  %>% 
    add_predictions(model) 
}

# Estimate the derivative using finite differences
do_deriv<-function(data, model){
 data_grid(data, Age=seq(60, 100, 0.1))  %>% 
    add_predictions(model) %>% 
    mutate(deriv = (pred - lag(pred))/0.1)
}


straps = bootstraps(d, times=250) %>% 
  mutate(
    fits = map(splits, do_fit),
    preds = map(fits, ~do_pred(d, .x)),
    deriv = map(fits, ~do_deriv(d, .x))
  )


straps %>% 
  unnest(preds) %>% 
  ggplot(aes(Age, pred, group=id))+
  geom_line(alpha = 0.5)
  


straps %>% 
  unnest(deriv) %>% 
  ggplot(aes(Age, deriv, group=id))+
  geom_line(alpha = 0.5)
```

Best Answer

Related Solutions

Regression – How to Interpret Polynomial Regression Bootstrap Output?

Code

Related Question