MATLAB: Classification by logistic regression

classificationregression

I am new learner in the field of classification, and am stuck with a problem while implementing logistic regression:

My data set consists of about 300 measurement, with 20 features. I implemented logistic regression model using glmfit and got the probability (Y) values. Next, I use the model output (Y) to generate ROC curve, which gives me sensitivity and specificity of the model/technique.

(1) I am using the entire data set for training and testing. Is that correct? If not, how can I validate my model? Is there a way to know if I am not overfitting by using all the features?

(2)I have tried to implement k-fold cross-validation(k =10), by running logistic regression and getting the sensitivity/specificity for test set 10 times. But my concern is that I am creating a new model for each of the 10 training sets, so in the end I do not have a single classifier.

Thanks,

Vikrant

Best Answer

Because logistic regression is a simple linear model and because you have 10 times as many observations as predictors, the classification error measured on the training set should not be far off the true value. Even so, it is best to validate your model on data not used for training. 300 observations are not a lot, so you would likely be better off cross-validating the classification error and ROC curve.

10-fold stratified cross-validation is a good rule of thumb. This is what you get from function CROSSVAL by default. Several runs of 10-fold cross-validation would be even better.

Hosmer-Lemeshow goodness of fit test is often used for logistic regression models. It is described in many places.

You can use SEQUENTIALFS (with cross-validation) to see if you need all predictors.

Logistic regression and cross-validation are described in many textbooks, by the way.

Related Solutions

MATLAB: How to create an ROC plot from the set of cross-validation models using Statistics Toolbox 7.5 (R2011a)

In order to plot ROC curve, the decision boundary should be a value of the posterior probabilities that you compute using ‘NaiveBayes’. You can plot ROC curves using PERFCURVE, which computes the ROC curve for a vector of classifier predictions, given the true class. More details about PERFCURVE are available via the following link:

http://www.mathworks.com/help/toolbox/stats/perfcurve.html

Here is a small example of how to plot an ROC curve without cross-validation:

       load fisheriris
       x = meas(51:end,1:2);
       y = species(51:end);
       nb = NaiveBayes.fit(x,y);
       p = posterior(nb,x);
       [X,Y] = perfcurve(y,p(:,1),'versicolor');
       plot(X,Y)
       xlabel('False positive rate'); ylabel('True positive rate')
       title('ROC for classification by naïve Bayes')

To use 10-fold cross-validation, you can fit the model on 90% of the data, and compute predicted posteriors for the remaining 10% of data which was not used for fitting. You can then loop over each of the 10 subsets to plot the ROC curves for individual runs.

MATLAB: How do you obtain a vector of predicted classes generated after cross-valiation of a decision tree

Consider this:

load fisheriris
cp = cvpartition(species,'k',10);
F = @(xtr,ytr,xtest,ytest){ytest, eval(classregtree(xtr,ytr),xtest)};
result = crossval(F,meas,species,'partition',cp)

This uses the other syntax of crossval. The function F packages up the observed and fitted Y values for each fold of the 1-fold cross-validation. The results is a cell array with ten rows, one for each fold, with each row containing the corresponding observed and fitted values.

Alternatively you could write the cross-validation loop yourself. Loop over j and get test(cp,j) and training(cp,j) each time.

Best Answer

Related Solutions

MATLAB: How to create an ROC plot from the set of cross-validation models using Statistics Toolbox 7.5 (R2011a)

MATLAB: How do you obtain a vector of predicted classes generated after cross-valiation of a decision tree

Related Question