Solved – the role of a categorical predictor in polynomial regression

categorical datapolynomialrregression

I understand that there is a function in R called poly() that can generate orthogonal polynomials–useful for applying on input variables before running a predictive model.

My question is that what is the role of categorical variables when we generate polynomials? Are they to be excluded?

Update:

Dan, Thank you for your kind response. I'm not sure I understand it completely – let me explain the query in more detail. I'm trying to run logistic regression using glmnet on Titanic dataset.
Let us assume shortened set of columns:

The questions is it meaningful to create polynomial features based on these factors? e.g. class. If yes could you pls explain what it means?
I've seen examples with numeric input variables, where one can pass the entire input set to the poly() function and get polynomial features as output.
Your response is highly appreciated.

Best Answer

It's a little hard to answer without a specific example, but in general you can use orthogonal polynomials on the continuous variables and still include the categorical variables. Here's how it might work on some random data:

#create random data
dat <- data.frame(Y = rnorm(100), X = rep(1:50, 2), Cat = as.factor(rep(c("A", "B"), 50)))
#create second order orthogonal polynomial
x <- poly((unique(dat$X)), 2)
#insert it into original data frame
dat[,paste("ot", 1:2, sep="")] <- x[dat$X, 1:2]
#run regression
m <- lm(Y ~ (ot1 + ot2)*Cat, data=dat)
summary(m)

Related Solutions

Solved – get wildly different results for poly(raw=T) vs. poly()

I think this is a bug in the predict function (and hence my fault), which in fact nlme does not share. (Edit: should be fixed in most recent R-forge version of lme4.) See below for an example ...

I think your understanding of orthogonal polynomials is probably just fine. The tricky thing you need to know about them if you are trying to write a predict method for a class of models is that the basis for the orthogonal polynomials is defined based on a given set of data, so if you naively (like I did!) use model.matrix to try to generate the design matrix for a new set of data, you get a new basis -- which no longer makes sense with the old parameters. Until I get this fixed, I may need to put a trap in that tells people that predict doesn't work with orthogonal polynomial bases (or spline bases, which have the same property).

d <- expand.grid(x=seq(0,1,length=50),f=LETTERS[1:10])
set.seed(1001)
u.int <- rnorm(10,sd=0.5)
u.slope <- rnorm(10,sd=0.2)
u.quad <- rnorm(10,sd=0.1)
d <- transform(d,
               ypred = (1+u.int[f])+
               (2+u.slope[f])*x-
               (1+u.quad[f])*x^2)
d$y <- rnorm(nrow(d),mean=d$ypred,sd=0.2)
ggplot(d,aes(x=x,y=y,colour=f))+geom_line()+
    geom_line(aes(y=ypred),linetype=2)

library(lme4)
fm1 <- lmer(y~poly(x,2,raw=TRUE)+(1|f)+(0+x|f)+(0+I(x^2)|f),
            data=d)


fm2 <- lmer(y~poly(x,2)+(1|f)+(0+x|f)+(0+I(x^2)|f),
            data=d)
newdat <- data.frame(x=unique(d$x))
plot(predict(fm1,newdata=newdat,REform=NA))
lines(predict(fm2,newdata=newdat,REform=NA),col=2)
detach("package:lme4")

library(nlme)
fm3 <- lme(y~poly(x,2,raw=TRUE),
           random=list(~1|f,~0+x|f,~0+I(x^2)|f),
            data=d)
VarCorr(fm3)

fm4 <- lme(y~poly(x,2),
           random=list(~1|f,~0+x|f,~0+I(x^2)|f),
            data=d)

newdat <- data.frame(x=unique(d$x))
lines(predict(fm3,newdata=newdat,level=0),col=4)
lines(predict(fm4,newdata=newdat,level=0),col=5)

Solved – Logistic regression: categorical predictor vs. quantitative predictor

That is not a necessary result, but it is certainly plausible. If you turn a quantitive predictor into a single categorical predictor you lose a lot of information; with the categorical predictor you only know whether an observation is below or above a certain threshold (e.g. the mean or median), while with a quantitative predictor you also know how much below or above the threshold that observation is. It is not unreasonable to suspect that if you feed your model more information (i.e. add your variable as a quantitative predictor), you will get more precise results.

One of the reasons why this is not necessarily true is that if you add a variable to a regression model as a quantitative variable you assume the effect of that variable to be linear. If the effect is strongly non-linear, then that may undo the advantage of adding quantitative variables. There are however easy ways to check whether that is the case (plots of residuals against predictors), and easy ways to solve it (adding your variables as splines or polynomials are probably the easiest solutions).

Related Question