Solved – sklearn Linear Regression vs Batch Gradient Descent

gradient descentlinearregressionscikit learn

tldr: Why would sklearn LinearRegression give a different result than gradient descent?

My understanding is that LinearRegression is computing the closed form solution for linear regression (described well here Why use gradient descent for linear regression, when a closed-form math solution is available?). LinearRegression is not good if the data set is large, in which case stochastic gradient descent needs to be used. I have a small data set and wanted to use Batch Gradient Descent (self written) as an intermediate step for my own edification.

I get different regression weights using LinearRegression and Batch Gradient Descent. Shouldn't the solution be unique?

Code:

import numpy as np
import pandas as pd
from sklearn import linear_model
import matplotlib.pyplot as plt

data=pd.read_csv(r'') #Data set attached
X=data[['Size','Floor','Broadband  Rate']]
y=data['Rental Price']

#Sklearn Linear Regression
ols=linear_model.LinearRegression(fit_intercept=True, normalize=False)
LR=ols.fit(X,y)
Res_LR=y.values-LR.predict(X) #Residuals 
print('Intercept', LR.intercept_, 'Weights', LR.coef_)

#Batch Gradient Descent
def error_delta(x,y,p,wn):
    total=0
    row,column=np.shape(x)

    for i in range(0,row):
        if wn!=0:total+=(y[i]-(p[0]+np.dot(p[1:len(p)],x[i,:])))*x[i,wn-1]
        else: total+=(y[i]-(p[0]+np.dot(p[1:len(p)],x[i,:])))*1    
    return total

def weight_update(x,y,p,alpha):
    old=p
    new=np.zeros(len(p))
    for i in range(0,len(p)): new[i]=old[i]+alpha*error_delta(x,y,old,i)
    return new

weight=[-.146,.185,-.044,.119] #random starting conditions
alpha=.00000002 #learning rate

for i in range(0,500): #Seems to have converged by 100
    weight=weight_update(X.values,y.values,weight,alpha)

Res_BGD=np.zeros(len(X.values))
for i in range(0,len(X.values)): Res_BGD[i]=y.values[i]-(weight[0]+np.dot(weight[1::],X.values[i,:]))

print('Inercept', weight[0], 'Weights', weight[1:len(weight)]) 

plt.plot(np.arange(0,len(X.values)),Res_LR,color='b')
plt.plot(np.arange(0,len(X.values)), Res_BGD,color='g')
plt.legend(['Res LR', 'Res BGD'])
plt.show()

The data set is below (10 points)

Size,Floor,Broadband  Rate,Energy Rating,Rental Price
"  500   ","    4    ","    8    ","    C ","   320 "
"  550   ","    7    ","    50   ","    A ","   380 "
"  620   ","    9    ","    7    ","    A ","   400 "
"  630   ","    5    ","    24   ","    B ","   390 "
"  665   ","    8    ","    100  ","    C ","   385 "
"  700   ","    4    ","    8    ","    B ","   410 "
"  770   ","    10   ","    7    ","    B ","   480 "
"  880   ","    12   ","    50   ","    A ","   600 "
"  920   ","    14   ","    8    ","    C ","   570 "
"  1000  ","    9    ","    24   ","    B ","   620 "

When you plot the residuals, the performance is comparable despite very different weights

SklearnLR Intercept 19.5615588974 Weights [ 0.54873985 4.96354677 -0.06209515]

BGD Inercept -0.145402077197 Weights [ 0.62549182 -0.0344091 0.11473203]

Thoughts? Also, if there's any programming feedback, I'm open to more efficient ways to code this as well.

Best Answer

There are some problems in your question.

  1. Can you make sure the iterative solver converge?

Note that, we can solve linear regression / minimizing squared loss in different ways. My experience of using python scikit-learn is the default set up usually will not give the result that converge. It is possible that we are limiting number of iterations in iterative solver, and stopped early. If we stop early, it is half done work, so it will not be as same as the optimal solution you got from other algorithms.

  1. I would not agree on

LinearRegression is not good if the data set is large, in which case stochastic gradient descent needs to be used.

If we are using QR decomposition, even data is on the level of millions (hopefully this is large enough), as well as number of features is not big, we can solve it in second. Check this R code. You may surprised that we can solve a linear regression on million data points with less than 1 sec.

x=matrix(runif(2e6),ncol=2)
y=runif(1e6)
stime = proc.time()
lm(y~x)
print(proc.time()-stime)