Solved – Improving the SVM classification of diabetes

classificatione1071feature selectionrsvm

I am using SVM to predict diabetes. I am using the BRFSS data set for this purpose. The data set has the dimensions of $432607 \times 136$ and is skewed. The percentage of Ys in the target variable is $11\%$ while the Ns constitute the remaining $89\%$.

I am using only 15 out of 136 independent variables from the data set. One of the reasons for reducing the data set was to have more training samples when rows containing NAs are omitted.

These 15 variables were selected after running statistical methods such as random trees, logistic regression and finding out which variables are significant from the resulting models. For example, after running logistic regression we used p-value to order the most significant variables.

Is my method of doing variable selection correct? Any suggestions to is greatly welcome.

The following is my R implementation.

library(e1071) # Support Vector Machines

#--------------------------------------------------------------------
# read brfss file (huge 135 MB file)
#--------------------------------------------------------------------
y <- read.csv("http://www.hofroe.net/stat579/brfss%2009/brfss-2009-clean.csv")
indicator <- c("DIABETE2", "GENHLTH", "PERSDOC2", "SEX", "FLUSHOT3", "PNEUVAC3", 
    "X_RFHYPE5", "X_RFCHOL", "RACE2", "X_SMOKER3", "X_AGE_G", "X_BMI4CAT", 
    "X_INCOMG", "X_RFDRHV3", "X_RFDRHV3", "X_STATE");
target <- "DIABETE2";
diabetes <- y[, indicator];

#--------------------------------------------------------------------
# recode DIABETE2
#--------------------------------------------------------------------
x <- diabetes$DIABETE2;
x[x > 1]  <- 'N';
x[x != 'N']  <- 'Y';
diabetes$DIABETE2 <- x; 
rm(x);

#--------------------------------------------------------------------
# remove NA
#--------------------------------------------------------------------
x <- na.omit(diabetes);
diabetes <- x;
rm(x);

#--------------------------------------------------------------------
# reproducible research 
#--------------------------------------------------------------------
set.seed(1612);
nsamples <- 1000; 
sample.diabetes <- diabetes[sample(nrow(diabetes), nsamples), ]; 

#--------------------------------------------------------------------
# split the dataset into training and test
#--------------------------------------------------------------------
ratio <- 0.7;
train.samples <- ratio*nsamples;
train.rows <- c(sample(nrow(sample.diabetes), trunc(train.samples)));

train.set  <- sample.diabetes[train.rows, ];
test.set   <- sample.diabetes[-train.rows, ];

train.result <- train.set[ , which(names(train.set) == target)];
test.result  <- test.set[ , which(names(test.set) == target)];

#--------------------------------------------------------------------
# SVM 
#--------------------------------------------------------------------
formula <- as.formula(factor(DIABETE2) ~ . );
svm.tune <- tune.svm(formula, data = train.set, 
    gamma = 10^(-3:0), cost = 10^(-1:1));
svm.model <- svm(formula, data = train.set, 
    kernel = "linear", 
    gamma = svm.tune$best.parameters$gamma, 
    cost  = svm.tune$best.parameters$cost);

#--------------------------------------------------------------------
# Confusion matrix
#--------------------------------------------------------------------
train.pred <- predict(svm.model, train.set);
test.pred  <- predict(svm.model, test.set);
svm.table <- table(pred = test.pred, true = test.result);
print(svm.table);

I ran with $1000$ (training = $700$ and test = $300$) samples since it is faster in my laptop. The confusion matrix for the test data ($300$ samples) I get is quite bad.

    true
pred   N   Y
   N 262  38
   Y   0   0

I need to improve my prediction for the Y class. In fact, I need to be as accurate as possible with Y even if I perform poorly with N. Any suggestions to improve the accuracy of classification would be greatly appreciated.

Best Answer

I have 4 suggestions:

  1. How are you choosing the variables to include in your model? Maybe you are missing some the key indicators from the larger dataset.
  2. Almost all of the indicators you are using (such as sex, smoker, etc.) should be treated as factors. Treating these variables as numeric is wrong, and is probably contributing to the error in your model.
  3. Why are you using an SVM? Did you try any simpler methods, such as linear discriminant analysis or even linear regression? Maybe a simple approach on a larger dataset will yield a better result.
  4. Try the caret package. It will help you cross-validate model accuracy, it is parallelized which will let you work faster, and it makes it easy to explore different types of models.

Here is some example code for caret:

library(caret)

#Parallize
library(doSMP)
w <- startWorkers()
registerDoSMP(w)

#Build model
X <- train.set[,-1]
Y <- factor(train.set[,1],levels=c('N','Y'))
model <- train(X,Y,method='lda')

#Evaluate model on test set
print(model)
predY <- predict(model,test.set[,-1])
confusionMatrix(predY,test.set[,1])
stopWorkers(w)

This LDA model beats your SVM, and I didn't even fix your factors. I'm sure if you recode Sex, Smoker, etc. as factors, you will get better results.