Data Preprocessing – Should Data Be Split Before or After Cleaning?

data preprocessingmissing data

I have a dataset in excel with thousands of rows and columns data, with a lot missing data, do I split the data including this missing data or not? I read here tut the dataset is split before cleaning. However, on this repository there is no splitting of data.

Is there a correct way or does it depend on the particular data?

Best Answer

It depends on what you mean by "cleaning". If you would be only removing the rows with missing data or using some automated script to fix things like typos, it wouldn't really matter. The biggest difference in such case would be that if you first split and then removed rows with missing data, the train/test proportions could potentially change.

On another hand, say that initially your cleaning step consisted of removing the rows with missing data. You did the splitting after the step. After some consideration, you decided that instead of removing the rows, you would impute the missing data. If you introduce the change, now you would be leaking the data, since the imputation would be done before splitting, so it will "know" some of the characteristics of the whole dataset.

Splitting the data as the first step prevents from the potential problems as described above. If your code runs on train set independently of test set, there is no chance of a leak, no matter what the code does.