Solved – When to do the split training and test set

cross-validationexploratory-data-analysislogisticmodel selection

I want to split my sample into a training and a testing set in order to cross-validate my findings. However, I am not sure in which point in the process I have to do the split.

Here is the order I want to do the analysis:

  1. Descriptive analysis on the whole sample
  2. Split into training and testing set
  3. Correlation analysis only on the training set (kick out highly correlated variables)
  4. Logistic regression only on the training set
  5. Use resulting logistic model on the testing sample and validate

Is this correct?

Thanks for your help in advance

Best Answer

The primary purpose of splitting into training and test sets is to verify how well would your model perform on unseen data, train the model on training set and verify its performance on the test set.

Since you haven't built your logistic regression model until step 4, you need not split the data until you reach that step.

Exploratory analysis and removing redundant data from the dataset Steps 1 to 3 can be performed on the entire dataset.

I would additionally recommend to use n-folds cross-fold validation and use its results for a better understanding of how your model would behave with unseen data. For this, split the dataset into $n$ random folds, then one-by-one set aside each fold, train the model on the rest of the data and then cross-check its performance on the set-aside fold. Evaluate the model using the n performance metrics so obtained.