Solved – My Test accuracy is pretty bad compared to cross-validation accuracy

cross-validationmachine learningrandom forestregression

I did a Multi-class document classification. I divided the original data set (18,8334 documents as a list of strings where each element of list is a document string.) into 70% training and 30% test.

Then on the 70% training dataset, I used sklearn 5 fold cross validation to train the model. I used three models. First was Gaussian Naive Bayes, second was Random Forests and third was Stochastic Gradient Descent SVM.

Stochastic gradient descent gave the highest cross validated accuracy of 0.85. But the same model when tested on the 30% test dataset gives 9% accuracy. Why is that? Isn't cross-validation error a measure or estimate of test error/generalization error?

Thanks

Edit:

This is how I created the train_test(70/30)

def split(docs_list,target_recoded):
    """This function samples the dataset into training and testing"""
    # Splitting into training and test. 
    from sklearn.cross_validation import train_test_split
    train_X, test_X,train_Y,test_Y = train_test_split(docs_list, target_recoded, test_size=0.30, random_state=42)

    return train_X, test_X,train_Y,test_Y

After initial nlp preprocessing like stop words removal, stemming etc, I have a cleaned list of doc strings. On that, I used the following for bag of words creation. First 70% training data is passed and then 30% test data was passed as argument to this function.

def bagofWords(X,Y,max_feature=5000,type="count"):
    """This function creates a Bag of Features vectors from the original documents"""

    from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
    # Initialize the "CountVectorizer" object, which is scikit-learn's
    # bag of words tool.  

    if(type=="count"): # To choose between count or tf-idf bag or words model
        vectorizer = CountVectorizer(analyzer = "word",max_features = max_feature) 
    else: 
        vectorizer = TfidfVectorizer(analyzer = "word",max_features = max_feature)

    X=vectorizer.fit_transform(X)
    return X ,np.array(Y) 

This is how I train an SGD

 def SGD(self):
        """Method to implement Multi-class SVM using Stochastic Gradient Descent"""

        from sklearn.linear_model import SGDClassifier
        scores_sgd = []

        for train_indices, test_indices in self.k_fold:
            train_X_cv = self.train_X[train_indices].todense()
            train_Y_cv= self.train_Y[train_indices]

            test_X_cv = self.train_X[test_indices].todense()
            test_Y_cv= self.train_Y[test_indices]

            self.sgd=SGDClassifier(loss='hinge',penalty='l2')
            scores_sgd.append(self.sgd.fit(train_X_cv,train_Y_cv).score(test_X_cv,test_Y_cv))

        print("The mean accuracy of Stochastic Gradient Descent Classifier on CV data is:", np.mean(scores_sgd))   

And this is to check test performance

def test_performance(self,test_X,test_Y):
        """This method checks the performance of each algorithm on test data."""

        from sklearn import metrics
# For SGD
        print ("The accuracy of SGD on test data is:", self.sgd.score(test_X,test_Y))
        print 'Classification Metrics for SGD'
        print metrics.classification_report(test_Y, self.sgd.predict(test_X))
        print "Confusion matrix"
        print metrics.confusion_matrix(test_Y, self.sgd.predict(test_X))

Best Answer

Hastie et al discuss this precise issue in their book, The Elements of Statistical Learning. They conclude that cross-validation is NOT an estimate of test-error conditional on the training set. Rather, they believe it is an estimate of the unconditional test error. In other words, this is the expected test error if you are also randomizing over the world of possible training sets, rather than the precise training set you've been given.