Solved – GridSearchCV Regression vs Linear Regression vs Stats.model OLS

machine learningpythonr-squaredregressionscikit learn

I am trying to build multiple linear regression model with 3 different method and I am getting different results for each one. I think that I have to get the same results but Where is this difference come from?

Using GridSearchCV

X_train, X_test, y_train, y_test = cross_validation.train_test_split(data, ground_truth_data, 
    test_size=0.3,random_state =1 )
model = linear_model.LinearRegression()
parameters = {'fit_intercept':[True,False], 'normalize':[True,False], 'copy_X':[True, False]}
grid = GridSearchCV(model,parameters, cv=None)
grid.fit(X_train, y_train)
print "r2 / variance : ", grid.best_score_
print("Residual sum of squares: %.2f"
              % np.mean((grid.predict(X_test) - y_test) ** 2))

The output is:

r2 / variance : 0.823041227357

Residual sum of squares: 0.18

Using Linear Regression without GridSearchCV

X_train, X_test, y_train, y_test = cross_validation.train_test_split(data, ground_truth_data, 
   test_size=0.3,random_state =1 )
model = linear_model.LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print "r2/variance : ", model.score(X_test,y_test)
print("Residual sum of squares: %.2f"
              % np.mean((model.predict(X_test) - y_test) ** 2))

The output is:

r2 / variance : 0.883799174674

Residual sum of squares: 0.18

Using Statsmodel OLS method

X_train, X_test, y_train, y_test = cross_validation.train_test_split(data, ground_truth_data,     test_size=0.3,random_state =1 )

x_train = sm.add_constant(X_train)
model = sm.OLS(y_train, x_train)
results = model.fit()
print "r2/variance : ", results.rsquared

The output is :

r2/variance : 0.893686634315

I have confused on three different point.

Why using GridSearchCV does not increase the r_score and why sum of error is same ?

My guess is GridSearchCV make some cross validation (maybe k-fold) so the r_square score is decrease when we use it. But I am not clear on this issue.
What is the difference between Scikit and Statsmodel OLS ?

> My guess is Statsmodel OLS looks the training error and Scikit looks the test error. So I think that using Scikit OLS is more rational.

When and how we can use GridSearchCv on Regression model ?

> I have not to much guess.

Thanks for every idea.

Best Answer

The difference between the scores can be explained as follows

In your first model, you are performing cross-validation. When cv=None, or when it not passed as an argument, GridSearchCV will default to cv=3. With three folds, each model will train using 66% of the data and test using the other 33%. Since you already split the data in 70%/30% before this, each model built using GridSearchCV uses about 0.7*0.66=0.462 (46.2%) of the original data.

In your second model, there is no k-fold cross-validation. You have a single model that is trained on 70% of the original data, and tested on the remaining 30%. Since the model has been given much more data, a higher score is as expected.

In your last model, you train another single model on 70% of the data. However this time you do not test it using the 30% of the data you saved for testing. As you suspected, you are looking at the training error, not the testing error. It is almost always the case that the training error is better than the test error, so the higher score is, again, as expected.

When and how we can use GridSearchCv on Regression model ?

GridSearchCV should be used to find the optimal parameters to train your final model. Typically, you should run GridSearchCV then look at the parameters that gave the model with the best score. You should then take these parameters and train your final model on all of the data. It is important to note that if you have trained your final model on all of your data, you cannot test it. For any correct test, you must must reserve some of the data.

Related Solutions

Regression – Difference Between Statsmodel OLS and Scikit-Learn Linear Regression

First in terms of usage. You can get the prediction in statsmodels in a very similar way as in scikit-learn, except that we use the results instance returned by fit

predictions = results.predict(X_test)

Given the predictions, we can calculate statistics that are based on the prediction error

prediction_error = y_test - predictions

There is a separate list of functions to calculate goodness of prediction statistics with it, but it's not integrated into the models, nor does it include R squared. (I've never heard of R squared used for out of sample data.) Calculating those requires a bit more work by the user and statsmodels does not have the same set of statistics, especially not for classification or models with a binary response variable.

To your other two points:

Linear regression is in its basic form the same in statsmodels and in scikit-learn. However, the implementation differs which might produce different results in edge cases, and scikit learn has in general more support for larger models. For example, statsmodels currently uses sparse matrices in very few parts.

The most important difference is in the surrounding infrastructure and the use cases that are directly supported.

Statsmodels follows largely the traditional model where we want to know how well a given model fits the data, and what variables "explain" or affect the outcome, or what the size of the effect is. Scikit-learn follows the machine learning tradition where the main supported task is chosing the "best" model for prediction.

As a consequence, the emphasis in the supporting features of statsmodels is in analysing the training data which includes hypothesis tests and goodness-of-fit measures, while the emphasis in the supporting infrastructure in scikit-learn is on model selection for out-of-sample prediction and therefore cross-validation on "test data".

This points out the distinction, there is still quite a lot of overlap also in the usage. statsmodels also does prediction, and additionally forecasting in a time series context. But, when we want to do cross-validation for prediction in statsmodels it is currently still often easier to reuse the cross-validation setup of scikit-learn together with the estimation models of statsmodels.

Solved – feature scaling giving reduced output (linear regression using gradient descent)

My guess is that you have accidentally transformed y_train (somewhere hidden in the code you have not posted). This because this reproducible snippets works

import numpy as np
import pandas as pd
import math
from sklearn import preprocessing

dat = pd.read_csv("/home/steffen/workspaces/airfoil/airfoil_self_noise.dat",sep="\t",low_memory=False,header=None)

apply_scaler = True

# split into train 2/3 and test 1/3
rng = np.random.RandomState(42)

n_rows = dat.shape[0]
n_train = math.floor(0.66*n_rows)

permutated_indices = rng.permutation(n_rows)

train_dat = dat.loc[permutated_indices[:n_train],:]
test_dat =  dat.loc[permutated_indices[n_train:],:]

# separate the response variable (last column) from the predictor variables
x_train = train_dat.iloc[:,1:-1]
y_train = (train_dat.iloc[:,-1])[:, np.newaxis]

x_test = test_dat.iloc[:,1:-1]
y_test = (test_dat.iloc[:,-1])[:, np.newaxis]

# train
# fit the scaler to predictor variables and apply it afterwards
scaler = preprocessing.StandardScaler().fit(x_train)

if apply_scaler:
    x_train = pd.DataFrame(scaler.transform(x_train))

# add constant one for the intercept parameter
x_train = pd.concat([pd.DataFrame(np.ones(shape=(x_train.shape[0],1)),index=x_train.index),x_train],axis=1)

# fit parameters of linear regression using batch gradient descent
# Hands-On Machine Learning with Scikit-Learn & Tensorflow, page 115
eta = 0.1 # learning rate
n_iterations = 1000
m = x_train.shape[0]
theta = rng.randn(x_train.shape[1],1)

for iteration in range(n_iterations):
    gradients = (2 / m) * x_train.T.dot(x_train.dot(theta) - y_train)
    theta = theta - eta * gradients

# to apply the fitted parameters, first we have to transform the test-data in the same way
# apply scaler
if apply_scaler:
    x_test = pd.DataFrame(scaler.transform(x_test))

# add constant one for the intercept parameter
x_test = pd.concat([pd.DataFrame(np.ones(shape=(x_test.shape[0],1)),index=x_test.index),x_test],axis=1)

# apply fitted parameters
y_predict =x_test.dot(theta)

# compare output
out=np.column_stack((y_test, y_predict))
print(pd.DataFrame(out).head())
# root mean squared error
print("error %f"% np.sqrt(np.power(y_test-y_predict,2).mean()))

This leads to this output

         0           1
0  120.573  127.108268
1  127.220  123.492931
2  113.045  122.393120
3  119.606  122.570836
4  131.971  127.270743
error 6.175637

which is fine.

It is interesting to see that for learning rate 0.1 this simple batch gradient descent implementation fails to converge if no normalization is performed (apply_scaler=False, eta=0.1), while the Linear Regression implementation of scikit learn still finds a solution. Reducing the learning rate dramatically (eta=0.0001) leads to convergence again.

This is one example where the Gradient Descent is limited, as discussed here: Do we need gradient descent to find the coefficients of a linear regression model.

Best Answer

Related Solutions

Regression – Difference Between Statsmodel OLS and Scikit-Learn Linear Regression

Solved – feature scaling giving reduced output (linear regression using gradient descent)

Related Question