Solved – When to use training and test data sets

cross-validationmultiple regression

So, I have a data set (Data set 1) with 2000 data points and 70 covariates on sale price of houses. All these properties were in the same area in USA.

I also have information on 200 other properties in a different area but near the first area (Data set 2).

I have no missing data.

I have been asked to produce a predictive linear regression model using the first data set (whilst using as few variables as necessary). Then using this model see how well it performs with data set 2.

Since my model is based only on the information in data set 1 should i split this into testing and training data sets before I decide my model? Also do i need to split the data into training and testing if I use cross validation?

Best Answer

  1. Should I split Dataset 1 into testing and training data sets before I decide my model? Yes! for feature selection and model selection you'll need to use cross-validation. However, when you have selected the appropriate features and parameters of the model, you can train the model with all your data on Dataset 1 which will then be used to classify samples on Dataset 2.

  2. Do I need to split the data into training and testing if I use cross validation? Cross-validation means dividing the data to training and test sets. If you mean k-fold cross-validation by cross-validation then you don't necessarily need do an extra layer of training and testset splits.