Solved – Partitioning with cross validation

I am new to data analytics having only started exploring the field this week. I have downloaded KNIME and am working with a single dataset to try out different classification algorithms.

I am currently trying out the decision tree algorithm and would like to include cross validation. Currently I partition the dataset 50/50 with the training data going to the learner node and the test to the predictor. Now for the part where I need you to help my understanding. If I want to use the cross validation node in KNIME to estimate the test data error rate, do I still need to partition the data before giving it to the cross-partitioner node?

Initially I assumed I needed to as my understanding of the testing set is that it is used to test the models classification ability by using records that are not in the training data, and with cross validation all of the records in the dataset are used as both training and test data at least once. However I have since seen the metaworkflow cross validation example here https://www.knime.org/introduction/examples.

Any advice to help clear up my confusion would be greatly appreciated.

Best Answer

The X-Partitioner is intended to replace the partition node, and the X-Aggregator is intended to replace the scorer.

In this setup, the learner and predictor are re-executed repeatedly according to the settings in the X-Partitioner. If you are new to KNIME this may be your first loop and if so, it's worth reading more about on their webpage (or youtube).

One limitation here is that it only works for error rates, if you want to validate against a different quality metric such as r^2 for a regression or RoC AUC for a binary classifier, you'll need to roll your own loop.

Good luck and have fun!

Best Answer

Related Solutions

Solved – Neural network with and without cross validation

Cross-Validation – Intuitive Explanation of Stratified and Nested Cross Validation

Related Question