Solved – How to one use the predict function on a lm object where the IVs have been dynamically scaled

rregression

Note: I've updated the example case code, there were some errors in the previous version

Cross posted to R-help, because I half suspect this is 'unexpected behaviour'.

I want to predict values from an existing lm (linear model, e.g.
lm.obj) result in R using a new set of predictor variables (e.g.
newdata). Specifically, I am interested in the predicted y value at the mean, 1 SD above of the mean, and 1 SD below the mean for each predictor. However, it seems that because my linear models was made by calling scale() on the target predictor that predict exits with an error, "Error in scale(xxA, center = 9.7846094491829, scale =
0.959413568556403) : object 'xxA' not found". By debugging predict, I
can see that the error occurs in a call to model.frame. By debugging
model frame I can see the error occurs with this command: variables
<- eval(predvars, data, env); it seems likely that the error is
because predvars looks like this:

list(scale(xxA, center = 10.2058714830537, scale = 0.984627257169526),
scale(xxB, center = 20.4491690881149, scale = 1.13765718273923))

An example case:

   dat <- data.frame(xxA = rnorm(20,10), xxB = rnorm(20,20))
   dat$out <- with(dat,xxA+xxB+xxA*xxB+rnorm(20,20))
   lm.res.scale <- lm(out ~ scale(xxA)*scale(xxB),data=dat)
   my.data <- lm.res.scale$model #load the data from the lm object
   newdata <- expand.grid(X1=c(-1,0,1),X2=c(-1,0,1))
   names(newdata) <- c("scale(xxA)","scale(xxB)")
   newdata$Y <- predict(lm.res.scale,newdata)

Is there something I could do before passing newdata or lm.obj to
predict() that would prevent the error? I tried:

From the help file it looks
like I might be able to do something with the terms, argument but I
haven't quite figured out what I would need to do. Alternatively, is
there a fix for model.frame that would prevent the error? Should
predict() behave this way?

Additional Details:
However, I really want a solution that, in one step will provide values like:

coef(lm.res.scale)[1]+
coef(lm.res.scale)[2]*newdata[,1]+
coef(lm.res.scale)[3]*newdata[,2]+
coef(lm.res.scale)[4]*newdata[,1]*newdata[,2]

I think that should be exactly what predict() should do. That is, I think my example code should be equivalent to:

   dat <- data.frame(xxA = rnorm(20,10), xxB = rnorm(20,20))
   dat$out <- with(dat,xxA+xxB+xxA*xxB+rnorm(20,20))
   #rescaling outside of lm
   X1 <- with(dat,as.vector(scale(xxA)))
   X2 <- with(dat,as.vector(scale(xxB)))
   y  <- with(dat,out)
   lm.res.correct <- lm(y~X1*X2)
   my.data <- lm.res.correct$model #load the data from the lm object
   newdata <- expand.grid(X1=c(-1,0,1),X2=c(-1,0,1))
   #No need to rename newdata as it matches my lm object already
   newdata$Y <- predict(lm.res.correct,newdata)

Notably, adjusting my formula to include as.vector() does not solve the problem with my attempt to use predict() directly with newdata.

Best Answer

When you use the predict with newdata argument you must supply the the data.frame with the same column names. In your code you have

newdata <- expand.grid(X1=c(-1,0,1),X2=c(-1,0,1))
names(newdata) <- c("scale(xxA)","scale(xxB)")

But the formula supplied to lm object is

out ~ scale(xxA)*scale(xxB)

So when you call the predict, it tries to find objects xxA and xxB in your data and apply function scale as per your initial request. But all R finds are objects scale(xxA) and scale(xxB). So naturally it produces the error.

Now if you supply correctly named newdata

newdata <- expand.grid(xxA=c(-1,0,1),xxB=c(-1,0,1)) 

and try to use it for prediction

newdata$Y <- predict(lm.res.scale,newdata)

R will remember the how it scaled original data and apply the same scaling to your new data. In this case supplied value -1 for xxA will be subtracted the original mean of xxA and divided by the original standard value of xxA. If you want to get prediction of 1 S.D below the mean, you will need to supply this value. In your case then newdata should look like this:

newdata <- expand.grid(xxA=mean(dat$xxA)+sd(dat$xxA)*c(-1,0,1),xxB=mean(dat$xxB)+sd(dat$xxB)*c(-1,0,1))

I gathered all the solutions in one place to compare:

##Prepare data
set.seed(1)
dat <- data.frame(xxA = rnorm(20,10), xxB = rnorm(20,20))
dat$out <- with(dat,xxA+xxB+xxA*xxB+rnorm(20,20))
dat <- within(dat,{
    X1 <- as.numeric(scale(xxA))
    X2 <- as.numeric(scale(xxB))
    })

##Estimate the models
lm.res.scale <- lm(out ~ scale(xxA)*scale(xxB),data=dat)
lm.res.correct <- lm(out~X1*X2,data=dat)
lm.mod <- lm(out ~ I(scale(xxA))*I(scale(xxB)), data=dat)
rms.res <- Glm(out ~ scale(xxA)*scale(xxB),data=dat)


##Build data for prediction
newdata <- expand.grid(xxA=c(-1,0,1),xxB=c(-1,0,1))
newdata$X1<-newdata$xxA
newdata$X2<-newdata$xxB

##Gather the predictions
newdata$Yscaled <- predict(lm.res.scale,newdata)
newdata$Ycorrect <- predict(lm.res.correct,newdata)
newdata$YwithI <- predict(lm.mod,newdata)
newdata$Ywithrms <- Predict(rms.res,xxA=c(-1,0,1),xxB=c(-1,0,1),conf.int=FALSE)[,3]

##Build alternative data for prediction
newdata2 <- expand.grid(xxA=mean(dat$xxA)+sd(dat$xxA)*c(-1,0,1),xxB=mean(dat$xxB)+sd(dat$xxB)*c(-1,0,1))

#Predict
newdata$Yorigsc <- predict(lm.res.scale,newdata2)

I used set.seed so the results should be the same if you try to repeat it. The newdata looks like this:

> newdata
  xxA xxB X1 X2  Yscaled Ycorrect   YwithI Ywithrms  Yorigsc
1  -1  -1 -1 -1 25.79709 225.9562 221.7517 221.7517 225.9562
2   0  -1  0 -1 25.63030 244.5181 243.0404 243.0404 244.5181
3   1  -1  1 -1 25.46351 263.0800 264.3291 264.3291 263.0800
4  -1   0 -1  0 25.36341 234.6981 231.7012 231.7012 234.6981
5   0   0  0  0 26.21499 254.0704 254.0704 254.0704 254.0704
6   1   0  1  0 27.06657 273.4427 276.4396 276.4396 273.4427
7  -1   1 -1  1 24.92972 243.4400 241.6507 241.6507 243.4400
8   0   1  0  1 26.79967 263.6227 265.1004 265.1004 263.6227
9   1   1  1  1 28.66962 283.8054 288.5501 288.5501 283.8054

As expected Yscaled produces not the result we need since the original scaling is applied. In the case when we scale data before lm (Ycorrect) and when we supply alternative unscaled values (Yorigsc) results coincide and are the ones needed.

Now the other prediction methods give different results. This happens since R is forced to forget the original scaling using formula

out ~ I(scale(xxA))*I(scale(xxB))

or package rms. But when we use predict, the values are still scaled, but now according to supplied values of xxA and xxB. This is best illustrated by following statement, which in some way mimics what predict does with the data:

> eval(expression(cbind(scale(xxA),scale(xxB))),env=as.list(newdata))
           [,1]      [,2]
 [1,] -1.154701 -1.154701
 [2,]  0.000000 -1.154701
 [3,]  1.154701 -1.154701
 [4,] -1.154701  0.000000
 [5,]  0.000000  0.000000
 [6,]  1.154701  0.000000
 [7,] -1.154701  1.154701
 [8,]  0.000000  1.154701
 [9,]  1.154701  1.154701

We can see that in this case, scaling does not change original values too much, but this is even worse, since the values from predict look reasonable, when in fact they are wrong.

Related Question