Solved – K-fold cross validation and F1 score metric

cross-validationprecision-recallscikit learn

I have to classify and validate my data with 10-fold cross validation. Then, I have to compute the F1 score for each class. To do that, I divided my X data into X_train (80% of data X) and X_test (20% of data X) and divided the target Y in y_train (80% of data Y) and y_test (20% of data Y). I have the following questions about this:

  1. Is it correct to run cross validation with only training data, or I have to run it with all data X?
  2. Is it correct to divide data in training and test parts to compute the F1 score, or is there a way to obtain F1 score for each class with all data?

For reference, here is the code I wrote:

X_df = pd.read_csv('X.csv', skipinitialspace=True, sep=',', header=None)
X = X_df.values
Y_df = pd.read_csv('Y.csv', header=None)
Y = Y_df[0].values

X_train_df = pd.read_csv('X_train.csv', skipinitialspace=True, sep=',', header=None)
X_train = X_train_df.values
y_train_df = pd.read_csv('Y_train.csv', header=None)
y_train = y_train_df[0].values

X_test_df = pd.read_csv('X_test.csv', skipinitialspace=True, sep=',', header=None)
X_test = X_test_df.values
y_test_df = pd.read_csv('_Y_test.csv', header=None)
y_test = y_test_df[0].values


######################## RandomForest #################################"

clf = RandomForestClassifier(n_estimators=100, n_jobs=1, criterion="gini")
clf.fit(X_train, y_train)    

cv = np.mean(cross_val_score(clf, X_train, y_train, cv=10))
print ("Accuracy using RF with 10 cross validation : {}%".format(round(cv*100,2)))
y_predict_test = clf.predict(X_test)

#F1_score

score_test = metrics.f1_score(y_test, y_predict_test, 
                              pos_label=list(set(y_test)), average = None)

print score_test

The code works perfect but I'm not certain about the results. So I wanted to verify with you.

Best Answer

  1. It is correct to run cross validation on only the training data. You want to keep your test set completely separate from the training set, which is used to tune the model. This way you get an unbiased estimate of model performance because the model has never been exposed to the data. The test set should not be used to tune the model any further.

  2. It is correct to divide the data into training and test parts and compute the F1 score for each- you want to compare these scores. As I said in answer 1, the point of using a test set is to evaluate the model on truly unseen data so you have an idea of how it will perform in production. If you see a large drop in performance between the training score and test score it is likely due to the model overfitting.

Related Question