Solved – Do I need data separation in KNN

cross-validationk nearest neighbourmachine learning

I am trying to use KNN with cancer data.

I have split my data into train and test sets.

Am I right or I should use LOOCV?

Best Answer

Using test set and a validation set are related to completely different tasks.

1. Testing your model

You should test your model to measure the performance of your model. For this task, you have to separate a set of data from the same data distribution and keep it separately. You cannot touch this dataset or do any parameter tuning with this dataset in the model training process.

2. Validating your model

This is a subtask of your training process. You can reduce overfitting problem or non-generalized problem of your model on the fly by validating it. Overfitting or High Variance is caused by a hypothesis function that fits the available data but does not generalize well to predict new dataHere you use a small fraction of data got from your training data set. Leave One Out Cross Validation is one method of doing that.

Final Note: Testing with test data set is done at the very end of the pipeline and validation is done during the training process. Never ever use test data for training purposes. Also, you have to use test data reserved from the same distribution you used for training purpose.

Training using LOOCV

You can divide your training dataset into K bins (simply K sets). Now leave the first set and use other K-1 sets to train your model. After training in that round use that first set to test your model. In the next iteration leave the second set and use other K-1 sets to train. Then validate your model with the second set. Repeat this method K times. As an example: divide your dataset into 10 bins. Inside a loop, in every iteration leave one bin and train on other bins, then validate with the left bin. Repeat this for 10 times.

Hope you get it

Related Question