Solved – How to encode factors as dumthe variables when using stepPlr

logisticmultiple regressionrregression

When using the step.plr() function in the stepPlr package, if my predictors are factors, do I need to encode my predictors as dummy variables manually before passing it to the function? I do know that I can specify "level", but how the "level" parameter works is confusing to me.
My understanding is that I need to tell step.plr() explicitly which factors should be encoded as dummy variables and thus leaving one factor out intentionally.

Let's consider a simple example. Suppose I have 1 categorical predicator with 4 levels and binary response. Normally, if I use glm() to fit a logistic regression model, glm() would automatically convert the categorical predicator into 3 dummy variables. Now in stepPlr(), do I specify the "level" parameter for that predictor with 4 levels or 3 levels? The "Help" section is vague, and says:

If the j-th column of x is discrete, level[[ j ]] is the set of levels for the categorical factor.

Does it mean I should tell step.plr() about all 4 levels, or I should make an intelligent decision myself and tell step.plr() to use only 3 levels?

==============UPDATE (16 Oct 2012)=============

The following example will demonstrate what is the problem with step.plr()'s automatic dummy variable encoding. It is a slight modification of the code in the function's help section.
set.seed(100)

n <- 100
p <- 3
z <- matrix(sample(seq(3),n*p,replace=TRUE),nrow=n)
x <- data.frame(x1=factor(z[ ,1]),x2=factor(z[ ,2]),
                x3=factor(sample(seq(3), n, replace=TRUE, prob=c(0.2, 0.5, 0.3))),
                x4=factor(sample(seq(3), n, replace=TRUE, prob=c(0.1, 0.3, 0.6))))
y <- sample(c(0,1),n,replace=TRUE)
fit <- step.plr(x,y, cp="aic")
summary(fit)

And here's an excerpt of the result:

Call:
plr(x = ix0, y = y, weights = weights, offset.subset = offset.subset, 
    offset.coefficients = offset.coefficients, lambda = lambda, 
    cp = cp)

Coefficients:
      Estimate Std.Error z value Pr(>|z|)
Intercept  0.91386   5.04780   0.181    0.856
x4.1       1.33787   4.61089   0.290    0.772
x4.2      -1.70462   4.91240  -0.347    0.729
x4.3       0.36675   3.18857   0.115    0.908
x3.1:x4.1  7.04901  14.35112   0.491    0.623
x3.1:x4.2 -5.50973  15.53674  -0.355    0.723
x3.1:x4.3 -0.50012   7.95651  -0.063    0.950

You can see that all levels, that is, (1,2,3), are used to fit the model. But normally you only need two dummy variables to encode a predictor with 3 levels.
On the other hand, if you use glm():

glm(y~.^2, data=x, family=binomial)

you will get the correct dummy variable encoding.

Best Answer

See the first example given in help for step.plr

n <- 100
p <- 3
z <- matrix(sample(seq(3),n*p,replace=TRUE),nrow=n)
x <- data.frame(x1=factor(z[ ,1]),x2=factor(z[ ,2]),x3=factor(z[ ,3]))
y <- sample(c(0,1),n,replace=TRUE)
fit <- step.plr(x,y)
# 'level' is automatically generated. Check 'fit$level'.

Does that answer your question?

Best Answer

Related Solutions

Solved – Recursive feature elimination and one-hot & dumthe encoding

Solved – Assessing GLM fit for categorical data

Related Question