Solved – How to one use the predict function on a lm object where the IVs have been dynamically scaled

rregression

Note: I've updated the example case code, there were some errors in the previous version

Cross posted to R-help, because I half suspect this is 'unexpected behaviour'.

I want to predict values from an existing lm (linear model, e.g.
lm.obj) result in R using a new set of predictor variables (e.g.
newdata). Specifically, I am interested in the predicted y value at the mean, 1 SD above of the mean, and 1 SD below the mean for each predictor. However, it seems that because my linear models was made by calling scale() on the target predictor that predict exits with an error, "Error in scale(xxA, center = 9.7846094491829, scale =
0.959413568556403) : object 'xxA' not found". By debugging predict, I
can see that the error occurs in a call to model.frame. By debugging
model frame I can see the error occurs with this command: variables
<- eval(predvars, data, env); it seems likely that the error is
because predvars looks like this:

list(scale(xxA, center = 10.2058714830537, scale = 0.984627257169526),
scale(xxB, center = 20.4491690881149, scale = 1.13765718273923))

An example case:

   dat <- data.frame(xxA = rnorm(20,10), xxB = rnorm(20,20))
   dat$out <- with(dat,xxA+xxB+xxA*xxB+rnorm(20,20))
   lm.res.scale <- lm(out ~ scale(xxA)*scale(xxB),data=dat)
   my.data <- lm.res.scale$model #load the data from the lm object
   newdata <- expand.grid(X1=c(-1,0,1),X2=c(-1,0,1))
   names(newdata) <- c("scale(xxA)","scale(xxB)")
   newdata$Y <- predict(lm.res.scale,newdata)

Is there something I could do before passing newdata or lm.obj to
predict() that would prevent the error? I tried:

From the help file it looks
like I might be able to do something with the terms, argument but I
haven't quite figured out what I would need to do. Alternatively, is
there a fix for model.frame that would prevent the error? Should
predict() behave this way?

Additional Details:
However, I really want a solution that, in one step will provide values like:

coef(lm.res.scale)[1]+
coef(lm.res.scale)[2]*newdata[,1]+
coef(lm.res.scale)[3]*newdata[,2]+
coef(lm.res.scale)[4]*newdata[,1]*newdata[,2]

I think that should be exactly what predict() should do. That is, I think my example code should be equivalent to:

   dat <- data.frame(xxA = rnorm(20,10), xxB = rnorm(20,20))
   dat$out <- with(dat,xxA+xxB+xxA*xxB+rnorm(20,20))
   #rescaling outside of lm
   X1 <- with(dat,as.vector(scale(xxA)))
   X2 <- with(dat,as.vector(scale(xxB)))
   y  <- with(dat,out)
   lm.res.correct <- lm(y~X1*X2)
   my.data <- lm.res.correct$model #load the data from the lm object
   newdata <- expand.grid(X1=c(-1,0,1),X2=c(-1,0,1))
   #No need to rename newdata as it matches my lm object already
   newdata$Y <- predict(lm.res.correct,newdata)

Notably, adjusting my formula to include as.vector() does not solve the problem with my attempt to use predict() directly with newdata.

Best Answer

When you use the predict with newdata argument you must supply the the data.frame with the same column names. In your code you have

newdata <- expand.grid(X1=c(-1,0,1),X2=c(-1,0,1))
names(newdata) <- c("scale(xxA)","scale(xxB)")

But the formula supplied to lm object is

out ~ scale(xxA)*scale(xxB)

So when you call the predict, it tries to find objects xxA and xxB in your data and apply function scale as per your initial request. But all R finds are objects scale(xxA) and scale(xxB). So naturally it produces the error.

Now if you supply correctly named newdata

newdata <- expand.grid(xxA=c(-1,0,1),xxB=c(-1,0,1))

and try to use it for prediction

newdata$Y <- predict(lm.res.scale,newdata)

R will remember the how it scaled original data and apply the same scaling to your new data. In this case supplied value -1 for xxA will be subtracted the original mean of xxA and divided by the original standard value of xxA. If you want to get prediction of 1 S.D below the mean, you will need to supply this value. In your case then newdata should look like this:

newdata <- expand.grid(xxA=mean(dat$xxA)+sd(dat$xxA)*c(-1,0,1),xxB=mean(dat$xxB)+sd(dat$xxB)*c(-1,0,1))

I gathered all the solutions in one place to compare:

##Prepare data
set.seed(1)
dat <- data.frame(xxA = rnorm(20,10), xxB = rnorm(20,20))
dat$out <- with(dat,xxA+xxB+xxA*xxB+rnorm(20,20))
dat <- within(dat,{
    X1 <- as.numeric(scale(xxA))
    X2 <- as.numeric(scale(xxB))
    })

##Estimate the models
lm.res.scale <- lm(out ~ scale(xxA)*scale(xxB),data=dat)
lm.res.correct <- lm(out~X1*X2,data=dat)
lm.mod <- lm(out ~ I(scale(xxA))*I(scale(xxB)), data=dat)
rms.res <- Glm(out ~ scale(xxA)*scale(xxB),data=dat)


##Build data for prediction
newdata <- expand.grid(xxA=c(-1,0,1),xxB=c(-1,0,1))
newdata$X1<-newdata$xxA
newdata$X2<-newdata$xxB

##Gather the predictions
newdata$Yscaled <- predict(lm.res.scale,newdata)
newdata$Ycorrect <- predict(lm.res.correct,newdata)
newdata$YwithI <- predict(lm.mod,newdata)
newdata$Ywithrms <- Predict(rms.res,xxA=c(-1,0,1),xxB=c(-1,0,1),conf.int=FALSE)[,3]

##Build alternative data for prediction
newdata2 <- expand.grid(xxA=mean(dat$xxA)+sd(dat$xxA)*c(-1,0,1),xxB=mean(dat$xxB)+sd(dat$xxB)*c(-1,0,1))

#Predict
newdata$Yorigsc <- predict(lm.res.scale,newdata2)

I used set.seed so the results should be the same if you try to repeat it. The newdata looks like this:

> newdata
  xxA xxB X1 X2  Yscaled Ycorrect   YwithI Ywithrms  Yorigsc
1  -1  -1 -1 -1 25.79709 225.9562 221.7517 221.7517 225.9562
2   0  -1  0 -1 25.63030 244.5181 243.0404 243.0404 244.5181
3   1  -1  1 -1 25.46351 263.0800 264.3291 264.3291 263.0800
4  -1   0 -1  0 25.36341 234.6981 231.7012 231.7012 234.6981
5   0   0  0  0 26.21499 254.0704 254.0704 254.0704 254.0704
6   1   0  1  0 27.06657 273.4427 276.4396 276.4396 273.4427
7  -1   1 -1  1 24.92972 243.4400 241.6507 241.6507 243.4400
8   0   1  0  1 26.79967 263.6227 265.1004 265.1004 263.6227
9   1   1  1  1 28.66962 283.8054 288.5501 288.5501 283.8054

As expected Yscaled produces not the result we need since the original scaling is applied. In the case when we scale data before lm (Ycorrect) and when we supply alternative unscaled values (Yorigsc) results coincide and are the ones needed.

Now the other prediction methods give different results. This happens since R is forced to forget the original scaling using formula

out ~ I(scale(xxA))*I(scale(xxB))

or package rms. But when we use predict, the values are still scaled, but now according to supplied values of xxA and xxB. This is best illustrated by following statement, which in some way mimics what predict does with the data:

> eval(expression(cbind(scale(xxA),scale(xxB))),env=as.list(newdata))
           [,1]      [,2]
 [1,] -1.154701 -1.154701
 [2,]  0.000000 -1.154701
 [3,]  1.154701 -1.154701
 [4,] -1.154701  0.000000
 [5,]  0.000000  0.000000
 [6,]  1.154701  0.000000
 [7,] -1.154701  1.154701
 [8,]  0.000000  1.154701
 [9,]  1.154701  1.154701

We can see that in this case, scaling does not change original values too much, but this is even worse, since the values from predict look reasonable, when in fact they are wrong.

Related Solutions

PCA – Using Multiple Regression to Predict One Principal Component (PC) from Others

A principal component is a weighted linear combination of all your factors (X's).

example: PC1 = 0.1X1 + 0.3X2

There will be one component for each factor (though in general a small number are selected).

The components are created such that they have zero correlation (are orthogonal), by design.

Therefore, component PC1 should not explain any variation in component PC2.

You may want to do regression on your Y variable and the PCA representation of your X's, as they will not have multi-collinearity. However, this could be hard to interpret.

If you have more X's than observations, which breaks OLS, you can regress on your components, and simply select a smaller number of the highest variation components.

Principal Component Analysis by Jollife a very in-depth and highly cited book on the subject

This is also good: http://www.statsoft.com/textbook/principal-components-factor-analysis/

Solved – Confidence Interval for predictions for Poisson regression

To address Q1, lets start by making some data to play with:

lo.to.p <- function(lo){  # this function will convert log odds to probabilities
  o <- exp(lo)            # we get odds by exponentiating log odds
  p <- o/(o+1)            # we convert to probabilities
  return(p)
}

set.seed(90)                        # this makes the example reproducible
x   <- runif(100, min=0, max=100)   # I generate some x data from a uniform dist
lo  <- -.5 + .1*x                   # this is the linear predictor
p   <- lo.to.p(lo)                  # converting log odds to probabilities
y   <- rbinom(100, size=1, prob=p)  # generating observed y values
foo <- data.frame(x=x, y=y)

  # @Gavin's code:
mod     <- glm(y ~ x, data=foo, family=binomial)
preddat <- with(foo, data.frame(x=seq(min(x), max(x), length=100)))
preds   <- predict(mod, newdata=preddat, type="link", se.fit=TRUE)

Now, why not try to get predicted values and a confidence interval / band by just using the original data:

preds2  <- predict(mod, newdata=foo$x, type="link", se.fit=TRUE)

That throws an error, because predict() needs the newdata argument to get a data frame:

# Error in eval(predvars, data, env) : 
#   numeric 'envir' arg not of length one

So let's try with the original data as a data frame:

preds3  <- predict(mod, newdata=data.frame(x=foo$x), type="link", se.fit=TRUE)

That time it worked, so let's see what the output looks like (I used our lo.to.p() function to convert the output from predict to predicted probabilities as @Gavin suggested, note that you can also use predict with type="response" to do that automatically):

enter image description here

Using the original data frame yields a garbled mess. You can sort the data first, which works OK in this case, but generally is not as smooth / pretty. To better show the effect of this strategy, I slightly augmented the data and model. Here's the code for the sorted version:

foo2    <- with(foo, data.frame(x=c(x, -100), y=c(y,0)))
mod2    <- glm(y~x, data=foo2, family=binomial)
preds4  <- predict(mod2, newdata=data.frame(x=sort(foo2$x)), type="link", 
                   se.fit=TRUE)

Regarding Q2, the statistical theory behind generalized linear models (GLiMs) assumes that the sampling distribution of a parameter estimate is asymptotically normally distributed (i.e., 'at infinity'). It is well known that this is not necessarily true for small samples, but the sampling distribution may be 'normal enough'. At any rate, this is (possibly) true on the scale of the linear predictor, which I call lo above; but the link function is a non-linear transformation, it isn't necessarily true on the response scale. To use an easy example, the normal distribution goes to infinity on both sides, but the response scale is bounded at 0 and 1. Moreover, all of these points hold for the Poisson distribution just like the binomial. Although it's not exactly the same topic, it may help to read my answer here: difference between logit and probit models because it provides a lot of information about link functions and GLiMs that may help with the larger conceptual framework.

For Q3, yes there is a relationship between the SEs of your coefficients and the width confidence band, but the confidence band is a little more complicated. The width of the confidence band grows as you move left or right away from the mean of x. (You can get the general idea from my answer here: linear regression prediction interval.) On the other hand, with a GLiM, the width of the confidence band also depends on the predicted value. To more easily see these effects, we can look at the confidence band for our original model on the scale of the linear predictor, and for a second model where there is no effect of x. Here's the second model:

y2      <- rbinom(100, size=1, prob=.5)
mod2    <- glm(y2~x, family=binomial)
preds5  <- predict(mod2, newdata=data.frame(x=sort(foo$x)), type="link", 
                   se.fit=TRUE)

Here's what they look like:

enter image description here

Best Answer

Related Solutions

PCA – Using Multiple Regression to Predict One Principal Component (PC) from Others

Solved – Confidence Interval for predictions for Poisson regression

Related Question