Solved – GridSearchCV Regression vs Linear Regression vs Stats.model OLS

machine learningpythonr-squaredregressionscikit learn

I am trying to build multiple linear regression model with 3 different method and I am getting different results for each one. I think that I have to get the same results but Where is this difference come from?

Using GridSearchCV

X_train, X_test, y_train, y_test = cross_validation.train_test_split(data, ground_truth_data, 
    test_size=0.3,random_state =1 )
model = linear_model.LinearRegression()
parameters = {'fit_intercept':[True,False], 'normalize':[True,False], 'copy_X':[True, False]}
grid = GridSearchCV(model,parameters, cv=None)
grid.fit(X_train, y_train)
print "r2 / variance : ", grid.best_score_
print("Residual sum of squares: %.2f"
              % np.mean((grid.predict(X_test) - y_test) ** 2))

The output is:

r2 / variance : 0.823041227357

Residual sum of squares: 0.18

Using Linear Regression without GridSearchCV

X_train, X_test, y_train, y_test = cross_validation.train_test_split(data, ground_truth_data, 
   test_size=0.3,random_state =1 )
model = linear_model.LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print "r2/variance : ", model.score(X_test,y_test)
print("Residual sum of squares: %.2f"
              % np.mean((model.predict(X_test) - y_test) ** 2))

The output is:

r2 / variance : 0.883799174674

Residual sum of squares: 0.18

Using Statsmodel OLS method

X_train, X_test, y_train, y_test = cross_validation.train_test_split(data, ground_truth_data,     test_size=0.3,random_state =1 )

x_train = sm.add_constant(X_train)
model = sm.OLS(y_train, x_train)
results = model.fit()
print "r2/variance : ", results.rsquared

The output is :

r2/variance : 0.893686634315

I have confused on three different point.

  1. Why using GridSearchCV does not increase the r_score and why sum of error is same ?

    My guess is GridSearchCV make some cross validation (maybe k-fold) so the r_square score is decrease when we use it. But I am not clear on this issue.

  2. What is the difference between Scikit and Statsmodel OLS ?

> My guess is Statsmodel OLS looks the training error and Scikit looks the test error. So I think that using Scikit OLS is more rational.

  1. When and how we can use GridSearchCv on Regression model ?

> I have not to much guess.

Thanks for every idea.

Best Answer

The difference between the scores can be explained as follows

In your first model, you are performing cross-validation. When cv=None, or when it not passed as an argument, GridSearchCV will default to cv=3. With three folds, each model will train using 66% of the data and test using the other 33%. Since you already split the data in 70%/30% before this, each model built using GridSearchCV uses about 0.7*0.66=0.462 (46.2%) of the original data.

In your second model, there is no k-fold cross-validation. You have a single model that is trained on 70% of the original data, and tested on the remaining 30%. Since the model has been given much more data, a higher score is as expected.

In your last model, you train another single model on 70% of the data. However this time you do not test it using the 30% of the data you saved for testing. As you suspected, you are looking at the training error, not the testing error. It is almost always the case that the training error is better than the test error, so the higher score is, again, as expected.

When and how we can use GridSearchCv on Regression model ?

GridSearchCV should be used to find the optimal parameters to train your final model. Typically, you should run GridSearchCV then look at the parameters that gave the model with the best score. You should then take these parameters and train your final model on all of the data. It is important to note that if you have trained your final model on all of your data, you cannot test it. For any correct test, you must must reserve some of the data.

Related Question