First in terms of usage. You can get the prediction in statsmodels in a very similar way as in scikit-learn, except that we use the results instance returned by fit
predictions = results.predict(X_test)
Given the predictions, we can calculate statistics that are based on the prediction error
prediction_error = y_test - predictions
There is a separate list of functions to calculate goodness of prediction statistics with it, but it's not integrated into the models, nor does it include R squared. (I've never heard of R squared used for out of sample data.) Calculating those requires a bit more work by the user and statsmodels does not have the same set of statistics, especially not for classification or models with a binary response variable.
To your other two points:
Linear regression is in its basic form the same in statsmodels and in scikit-learn. However, the implementation differs which might produce different results in edge cases, and scikit learn has in general more support for larger models. For example, statsmodels currently uses sparse matrices in very few parts.
The most important difference is in the surrounding infrastructure and the use cases that are directly supported.
Statsmodels follows largely the traditional model where we want to know how well a given model fits the data, and what variables "explain" or affect the outcome, or what the size of the effect is.
Scikit-learn follows the machine learning tradition where the main supported task is chosing the "best" model for prediction.
As a consequence, the emphasis in the supporting features of statsmodels is in analysing the training data which includes hypothesis tests and goodness-of-fit measures, while the emphasis in the supporting infrastructure in scikit-learn is on model selection for out-of-sample prediction and therefore cross-validation on "test data".
This points out the distinction, there is still quite a lot of overlap also in the usage. statsmodels also does prediction, and additionally forecasting in a time series context.
But, when we want to do cross-validation for prediction in statsmodels it is currently still often easier to reuse the cross-validation setup of scikit-learn together with the estimation models of statsmodels.
The difference between the scores can be explained as follows
In your first model, you are performing cross-validation. When cv=None
, or when it not passed as an argument, GridSearchCV will default to cv=3
. With three folds, each model will train using 66% of the data and test using the other 33%. Since you already split the data in 70%/30% before this, each model built using GridSearchCV uses about 0.7*0.66=0.462 (46.2%) of the original data.
In your second model, there is no k-fold cross-validation. You have a single model that is trained on 70% of the original data, and tested on the remaining 30%. Since the model has been given much more data, a higher score is as expected.
In your last model, you train another single model on 70% of the data. However this time you do not test it using the 30% of the data you saved for testing. As you suspected, you are looking at the training error, not the testing error. It is almost always the case that the training error is better than the test error, so the higher score is, again, as expected.
When and how we can use GridSearchCv on Regression model ?
GridSearchCV should be used to find the optimal parameters to train your final model. Typically, you should run GridSearchCV then look at the parameters that gave the model with the best score. You should then take these parameters and train your final model on all of the data. It is important to note that if you have trained your final model on all of your data, you cannot test it. For any correct test, you must must reserve some of the data.
Best Answer
There is no difference.
Principal component regression (PCR) is linear regression after principal component analysis (PCA) is done on the set of predictors and (usually) only a small subset of principal components is retained.
For reading more about this topic, you might be interested in this thread How can top principal components retain the predictive power on a dependent variable (or even lead to better predictions)? and links therein.