Solved – Improving the SVM classification of diabetes

classificatione1071feature selectionrsvm

I am using SVM to predict diabetes. I am using the BRFSS data set for this purpose. The data set has the dimensions of $432607 \times 136$ and is skewed. The percentage of Ys in the target variable is $11\%$ while the Ns constitute the remaining $89\%$.

I am using only 15 out of 136 independent variables from the data set. One of the reasons for reducing the data set was to have more training samples when rows containing NAs are omitted.

These 15 variables were selected after running statistical methods such as random trees, logistic regression and finding out which variables are significant from the resulting models. For example, after running logistic regression we used p-value to order the most significant variables.

Is my method of doing variable selection correct? Any suggestions to is greatly welcome.

The following is my R implementation.

library(e1071) # Support Vector Machines

#--------------------------------------------------------------------
# read brfss file (huge 135 MB file)
#--------------------------------------------------------------------
y <- read.csv("http://www.hofroe.net/stat579/brfss%2009/brfss-2009-clean.csv")
indicator <- c("DIABETE2", "GENHLTH", "PERSDOC2", "SEX", "FLUSHOT3", "PNEUVAC3", 
    "X_RFHYPE5", "X_RFCHOL", "RACE2", "X_SMOKER3", "X_AGE_G", "X_BMI4CAT", 
    "X_INCOMG", "X_RFDRHV3", "X_RFDRHV3", "X_STATE");
target <- "DIABETE2";
diabetes <- y[, indicator];

#--------------------------------------------------------------------
# recode DIABETE2
#--------------------------------------------------------------------
x <- diabetes$DIABETE2;
x[x > 1]  <- 'N';
x[x != 'N']  <- 'Y';
diabetes$DIABETE2 <- x; 
rm(x);

#--------------------------------------------------------------------
# remove NA
#--------------------------------------------------------------------
x <- na.omit(diabetes);
diabetes <- x;
rm(x);

#--------------------------------------------------------------------
# reproducible research 
#--------------------------------------------------------------------
set.seed(1612);
nsamples <- 1000; 
sample.diabetes <- diabetes[sample(nrow(diabetes), nsamples), ]; 

#--------------------------------------------------------------------
# split the dataset into training and test
#--------------------------------------------------------------------
ratio <- 0.7;
train.samples <- ratio*nsamples;
train.rows <- c(sample(nrow(sample.diabetes), trunc(train.samples)));

train.set  <- sample.diabetes[train.rows, ];
test.set   <- sample.diabetes[-train.rows, ];

train.result <- train.set[ , which(names(train.set) == target)];
test.result  <- test.set[ , which(names(test.set) == target)];

#--------------------------------------------------------------------
# SVM 
#--------------------------------------------------------------------
formula <- as.formula(factor(DIABETE2) ~ . );
svm.tune <- tune.svm(formula, data = train.set, 
    gamma = 10^(-3:0), cost = 10^(-1:1));
svm.model <- svm(formula, data = train.set, 
    kernel = "linear", 
    gamma = svm.tune$best.parameters$gamma, 
    cost  = svm.tune$best.parameters$cost);

#--------------------------------------------------------------------
# Confusion matrix
#--------------------------------------------------------------------
train.pred <- predict(svm.model, train.set);
test.pred  <- predict(svm.model, test.set);
svm.table <- table(pred = test.pred, true = test.result);
print(svm.table);

I ran with $1000$ (training = $700$ and test = $300$) samples since it is faster in my laptop. The confusion matrix for the test data ($300$ samples) I get is quite bad.

    true
pred   N   Y
   N 262  38
   Y   0   0

I need to improve my prediction for the Y class. In fact, I need to be as accurate as possible with Y even if I perform poorly with N. Any suggestions to improve the accuracy of classification would be greatly appreciated.

Best Answer

I have 4 suggestions:

How are you choosing the variables to include in your model? Maybe you are missing some the key indicators from the larger dataset.
Almost all of the indicators you are using (such as sex, smoker, etc.) should be treated as factors. Treating these variables as numeric is wrong, and is probably contributing to the error in your model.
Why are you using an SVM? Did you try any simpler methods, such as linear discriminant analysis or even linear regression? Maybe a simple approach on a larger dataset will yield a better result.
Try the caret package. It will help you cross-validate model accuracy, it is parallelized which will let you work faster, and it makes it easy to explore different types of models.

Here is some example code for caret:

library(caret)

#Parallize
library(doSMP)
w <- startWorkers()
registerDoSMP(w)

#Build model
X <- train.set[,-1]
Y <- factor(train.set[,1],levels=c('N','Y'))
model <- train(X,Y,method='lda')

#Evaluate model on test set
print(model)
predY <- predict(model,test.set[,-1])
confusionMatrix(predY,test.set[,1])
stopWorkers(w)

This LDA model beats your SVM, and I didn't even fix your factors. I'm sure if you recode Sex, Smoker, etc. as factors, you will get better results.

Related Solutions

Cross-validation – How to Apply Cross-Validation for Selecting SVM Parameters

If you learn the hyper-parameters in the full training data and then cross-validate, you will get an optimistically biased performance estimate, because the test data in each fold will already have been used in setting the hyper-parameters, so the hyper-parameters selected are selected in part because they suit the data in the test set. The optimistic bias introduced in this way can be unexpectedly large. See Cawley and Talbot, "On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation", JMLR 11(Jul):2079−2107, 2010. (Particularly section 5.3). The best thing to do is nested cross-validation. The basic idea is that you cross-validate the entire method used to generate the model, so treat model selection (choosing the hyper-parameters) as simply part of the model fitting procedure (where the parameters are determined) and you can't go too far wrong.

If you use cross-validation on the training set to determine the hyper-parameters and then evaluate the performance of a model trained using those parameters on the whole training set, using a separate test set, that is also fine (provided you have enough data for reliably fitting the model and estimating performance using disjoint partitions).

Solved – Plotting the decision boundary of a kernel SVM (RBF)

I figured out what is needed to be done. Actually, it's something simple, but its seems I had a matlaboid bug... Here is the code and the resulting figure for the "XOR" binary classification problem.

gamma     = getGamma();
b         = getB();
points_x1 = linspace(xLimits(1), xLimits(2), 100);
points_x2 = linspace(yLimits(1), yLimits(2), 100);
[X1, X2]  = meshgrid(points_x1, points_x2);

% Initialize f
f = ones(length(points_x1), length(points_x2))*rho;

% Iter. all SVs
for i=1:N_sv
    alpha_i = getAlpha(i);
    sv_i    = getSV(i);
    for j=1:length(points_x1)
        for k=1:length(points_x2)
            x = [points_x1(j);points_x2(k)];
            f(j,k) = f(j,k) + alpha_i*y_i*kernel_func(gamma, x, sv_i);
        end
    end    
end

surf(X1,X2,f);
shading interp;
lighting phong;
alpha(.6)

contourf(X1, X2, f, 1);

where the function

function k = kernel_func(gamma, x, x_i)
    k = exp(-gamma*norm(x - x_i)^2);
end

just produces the kernel function (RBF kernel), $k(\mathbf{x},\mathbf{x}')=\operatorname{exp}\left(-\gamma\lVert\mathbf{x}-\mathbf{x}'\rVert^2\right)$.

Here is the result for the XOR problem. Here $\gamma=4$.

enter image description here

Best Answer

Related Solutions

Cross-validation – How to Apply Cross-Validation for Selecting SVM Parameters

Solved – Plotting the decision boundary of a kernel SVM (RBF)

Related Question