MATLAB: Classification by logistic regression

classificationregression

I am new learner in the field of classification, and am stuck with a problem while implementing logistic regression:
My data set consists of about 300 measurement, with 20 features. I implemented logistic regression model using glmfit and got the probability (Y) values. Next, I use the model output (Y) to generate ROC curve, which gives me sensitivity and specificity of the model/technique.
(1) I am using the entire data set for training and testing. Is that correct? If not, how can I validate my model? Is there a way to know if I am not overfitting by using all the features?
(2)I have tried to implement k-fold cross-validation(k =10), by running logistic regression and getting the sensitivity/specificity for test set 10 times. But my concern is that I am creating a new model for each of the 10 training sets, so in the end I do not have a single classifier.
Thanks,
Vikrant

Best Answer

Because logistic regression is a simple linear model and because you have 10 times as many observations as predictors, the classification error measured on the training set should not be far off the true value. Even so, it is best to validate your model on data not used for training. 300 observations are not a lot, so you would likely be better off cross-validating the classification error and ROC curve.
10-fold stratified cross-validation is a good rule of thumb. This is what you get from function CROSSVAL by default. Several runs of 10-fold cross-validation would be even better.
Hosmer-Lemeshow goodness of fit test is often used for logistic regression models. It is described in many places.
You can use SEQUENTIALFS (with cross-validation) to see if you need all predictors.
Logistic regression and cross-validation are described in many textbooks, by the way.