Note: I've updated the example case code, there were some errors in the previous version
Cross posted to R-help, because I half suspect this is 'unexpected behaviour'.
I want to predict values from an existing lm (linear model, e.g.
lm.obj) result in R using a new set of predictor variables (e.g.
newdata). Specifically, I am interested in the predicted y value at the mean, 1 SD above of the mean, and 1 SD below the mean for each predictor. However, it seems that because my linear models was made by calling scale() on the target predictor that predict exits with an error, "Error in scale(xxA, center = 9.7846094491829, scale =
0.959413568556403) : object 'xxA' not found". By debugging predict, I
can see that the error occurs in a call to model.frame. By debugging
model frame I can see the error occurs with this command: variables
<- eval(predvars, data, env); it seems likely that the error is
because predvars looks like this:
list(scale(xxA, center = 10.2058714830537, scale = 0.984627257169526),
scale(xxB, center = 20.4491690881149, scale = 1.13765718273923))
An example case:
dat <- data.frame(xxA = rnorm(20,10), xxB = rnorm(20,20))
dat$out <- with(dat,xxA+xxB+xxA*xxB+rnorm(20,20))
lm.res.scale <- lm(out ~ scale(xxA)*scale(xxB),data=dat)
my.data <- lm.res.scale$model #load the data from the lm object
newdata <- expand.grid(X1=c(-1,0,1),X2=c(-1,0,1))
names(newdata) <- c("scale(xxA)","scale(xxB)")
newdata$Y <- predict(lm.res.scale,newdata)
Is there something I could do before passing newdata or lm.obj to
predict() that would prevent the error? I tried:
From the help file it looks
like I might be able to do something with the terms, argument but I
haven't quite figured out what I would need to do. Alternatively, is
there a fix for model.frame that would prevent the error? Should
predict() behave this way?
Additional Details:
However, I really want a solution that, in one step will provide values like:
coef(lm.res.scale)[1]+
coef(lm.res.scale)[2]*newdata[,1]+
coef(lm.res.scale)[3]*newdata[,2]+
coef(lm.res.scale)[4]*newdata[,1]*newdata[,2]
I think that should be exactly what predict() should do. That is, I think my example code should be equivalent to:
dat <- data.frame(xxA = rnorm(20,10), xxB = rnorm(20,20))
dat$out <- with(dat,xxA+xxB+xxA*xxB+rnorm(20,20))
#rescaling outside of lm
X1 <- with(dat,as.vector(scale(xxA)))
X2 <- with(dat,as.vector(scale(xxB)))
y <- with(dat,out)
lm.res.correct <- lm(y~X1*X2)
my.data <- lm.res.correct$model #load the data from the lm object
newdata <- expand.grid(X1=c(-1,0,1),X2=c(-1,0,1))
#No need to rename newdata as it matches my lm object already
newdata$Y <- predict(lm.res.correct,newdata)
Notably, adjusting my formula to include as.vector() does not solve the problem with my attempt to use predict() directly with newdata.
Best Answer
When you use the
predict
withnewdata
argument you must supply the the data.frame with the same column names. In your code you haveBut the formula supplied to lm object is
So when you call the predict, it tries to find objects
xxA
andxxB
in your data and apply functionscale
as per your initial request. But all R finds are objectsscale(xxA)
andscale(xxB)
. So naturally it produces the error.Now if you supply correctly named
newdata
and try to use it for prediction
R will remember the how it scaled original data and apply the same scaling to your new data. In this case supplied value -1 for
xxA
will be subtracted the original mean ofxxA
and divided by the original standard value ofxxA
. If you want to get prediction of 1 S.D below the mean, you will need to supply this value. In your case then newdata should look like this:I gathered all the solutions in one place to compare:
I used
set.seed
so the results should be the same if you try to repeat it. The newdata looks like this:As expected
Yscaled
produces not the result we need since the original scaling is applied. In the case when we scale data beforelm
(Ycorrect
) and when we supply alternative unscaled values (Yorigsc
) results coincide and are the ones needed.Now the other prediction methods give different results. This happens since R is forced to forget the original scaling using formula
or package
rms
. But when we use predict, the values are still scaled, but now according to supplied values ofxxA
andxxB
. This is best illustrated by following statement, which in some way mimics what predict does with the data:We can see that in this case, scaling does not change original values too much, but this is even worse, since the values from predict look reasonable, when in fact they are wrong.