So, I have a data set (Data set 1) with 2000 data points and 70 covariates on sale price of houses. All these properties were in the same area in USA.
I also have information on 200 other properties in a different area but near the first area (Data set 2).
I have no missing data.
I have been asked to produce a predictive linear regression model using the first data set (whilst using as few variables as necessary). Then using this model see how well it performs with data set 2.
Since my model is based only on the information in data set 1 should i split this into testing and training data sets before I decide my model? Also do i need to split the data into training and testing if I use cross validation?
Best Answer
Should I split Dataset 1 into testing and training data sets before I decide my model? Yes! for
feature selection
andmodel selection
you'll need to use cross-validation. However, when you have selected the appropriate features and parameters of the model, you can train the model with all your data onDataset 1
which will then be used to classify samples onDataset 2
.Do I need to split the data into training and testing if I use cross validation? Cross-validation means dividing the data to training and test sets. If you mean
k-fold
cross-validation by cross-validation then you don't necessarily need do an extra layer of training and testset splits.