Solved – Comparing different classifiers (using ski-kit cross validation values)

cross-validationpythonscikit learnstatistical significance

Thanks for taking the time to read this I'm new to Machine Learning and so am going through a Kaggle competition to help me improve but I have a question. How can I compare different classifiers??
My Python isn't as neat as I would like it to be but i think it's correct. Please let me know if I'm doing something strange

train = pca(train)
cfr = KNeighborsClassifier(n_neighbors=neighbours, algorithm="kd_tree")
cv = cross_validation.KFold(len(train), n_folds=10, indices=True)

results = []
i = 0
count =0
for traincv, testcv in cv:
    ClassPred = cfr.fit(train[traincv], target[traincv]).predict(train[testcv]) 
    for j in range(0,(len(train)/10)):
        labelindex = testcv[j]
        if (ClassPred[j] == target[labelindex]):
            i = i+1
    accuracy = (i/2000.0)*100
    i=0 
    results.append(accuracy)
    count = count + 1
    print "accuracy for fold", count, " : ", accuracy,"%"
    print "time after fold", count, " : ", elspasedfold,"%"

elapsed = (time.clock() - start)
#print out the mean of the cross-validated results
print "Results for RBF, c = 10.0, gamma=0.1 \n" + str(np.array(results).mean())
print "Time taken is %ds" % elapsed`.

As you can see I'm doing 10 fold cross validation on the training data and hopefully producing an 'accuracy' value out of the other end. My question here is how can I compare different classifiers??
If I choose cfr as a different classifier (e.g. Random Forest) and get a value for that how do i statistically compare the two?

My initial thoughts are to use a two-tailed t-test on the values for each fold (so 10 values per classifier) instead of just taking using the one value it currently outputs, which I think is an average of all the folds combined. I would need to then get a p-value to see if the differences are significant. I am not sure how I would go about implementing this however or if this is the correct thing to do. My stats is a little patchy (Which I'm working on) but any
help anyone can give me would be much appreciated.

So to clarify, using python and sci-kit what would be the best way to compare the performance of different classifiers on my data?
Thanks!

EDIT: The tutorial assessment criteria is that 10 fold cross validation must be used.

Best Answer

As you loop over the 10 folds, each fold returns an accuracy. That accuracy varies a little bit, depending on how you slice your data for each fold. The range of accuracies that you obtain represents the range of variability in the performance of your model that you might expect to see, if you brought that model to bear upon a brand new set of test data.

From the 10 accuracies that you obtain, for each classifier, you can calculate a mean and a corrected sample standard deviation. You would like to know, for two classifiers with two different average accuracies, whether those differences are "significant", i.e., whether the differences are meaningfully different, or are simply due to the same random statistical fluctuations that account for the same fold-to-fold random variation that you have already observed in the first place.

If you have an average accuracy $A_{1}$ and standard deviation $\sigma_{1}$ for classifier number 1, and the same for some other classifier number 2, you can estimate whether the difference in their relative performance is meaningfully different from zero by calculating $$\Delta_{12} = \frac{A_{1} - A_{2}}{\sqrt{\sigma_{1}^{2} + \sigma_{2}^{2}}}$$ This quantity can be interpreted effectively as a kind of Z score. If the score is large (a value greater than 3 standard deviations is a common choice of cutoff) you may declare the performance of the two classifiers to be significantly different, if not, they are essentially equivalent.

Related Question