Solved – Regression selection using all possible subsets selection and automatic selection techniques

model selectionrregressionstepwise regression

Given the dataset cars.txt, we want to formulate a good regression model for the Midrange Price using the variables Horsepower, Length, Luggage, Uturn, Wheelbase, and Width. Both:

  1. using all possible subsets selection, and
  2. using an automatic selection technique.

For the first part, we do in R:

cars <- read.table(file=file.choose(), header=TRUE)
names(cars)

#regression
attach(cars)
leap <- leaps(x=cbind(cars$Horsepower, cars$Length, cars$Luggage, cars$Uturn, cars$Wheelbase, cars$Width), 
y=cars$MidrangePrice, method=c("r2"), nbest=3)
combine <- cbind(leap$which,leap$size, leap$r2)
n <- length(leap$size)
dimnames(combine) <- list(1:n,c("horsep","length","Luggage","Uturn","Wheelbase","Width","size","r2"))
round(combine, digits=3)

leap.cp <- leaps(x=cbind(cars$Horsepower, cars$Length, cars$Luggage, cars$Uturn, cars$Wheelbase, cars$Width), 
y=cars$MidrangePrice, nbest=3)
combine.cp <- cbind(leap.cp$which,leap.cp$size, leap.cp$Cp)
dimnames(combine.cp) <- list(1:n,c("horsep","length","Luggage","Uturn","Wheelbase","Width","size","cp"))
round(combine.cp, digits=3)
plot(leap.cp$size, leap.cp$Cp, ylim=c(1,7))
abline(a=0, b=1)

Am I correct in my interpretation that the most adequate model is one with 4 parameters (the three variables Horsepower, Wheelbase and Width) because it has the lowest Mallows' Cp value?

For the second part, we can choose between the forward, backward or stepwise selection models:

#stepwise selection methods
#forward
slm.foward <- step(lm(cars$MidrangePrice ~1, data=cars), scope=~cars$Horsepower + cars$Length + cars$Luggage + cars$Uturn + cars$Wheelbase + cars$Horsepower+ cars$Width, direction="forward")

#backward
reg.lm1 <- lm(cars$MidrangePrice ~ cars$Horsepower + cars$Length + cars$Luggage + cars$Uturn + cars$Wheelbase + cars$Horsepower + cars$Width)
slm.backward <- step(reg.lm1, direction="backward")


#stepwise
reg.lm1 <- lm(cars$MidrangePrice ~ cars$Horsepower + cars$Length + cars$Luggage + cars$Uturn + cars$Wheelbase + cars$Horsepower + cars$Width)
slm.stepwise <- step(reg.lm1,direction="both")

How do I interpret the results I get from this R code?

Best Answer

For the second part, you must interpret the output as the steps towards your final model.

For example, in the forward case you begin with Start: AIC=377.95 cars$MidrangePrice ~ 1

              Df Sum of Sq    RSS    AIC
+ cars$Horsepower  1    4979.3 3054.9 300.66
+ cars$Wheelbase   1    3172.3 4862.0 338.76
+ cars$Length      1    2448.8 5585.4 350.14
+ cars$Width       1    1969.2 6065.0 356.89
+ cars$Uturn       1    1450.2 6584.0 363.63
+ cars$Luggage     1    1079.6 6954.7 368.12
<none>                         8034.2 377.95

Your current model is only considering the constant cars$MidrangePrice ~ 1.

Each row in the table indicates that in case you add that variable (for example, Horsepower), you will get the following results rearding Sq RSS(Residual Sum of Squares) and AIC (Akaike Information Criterion).

In the other cases you must read the results the same way.

Hope this helps :)

Related Question