Solved – Why does scaling the features affect the prediction of a regression

machine learningpythonregressionsvm

I'm working on a regression problem using the support vector regression model from sklearn and using MinMax to scale the features, but by using it I get a different result for the regression, does that makes sense?

import pandas as pd
import numpy as np
from sklearn import  svm
from sklearn.preprocessing import MinMaxScaler

np.random.seed(0)
X_training = np.random.rand(100,15)*10
Y_training = np.random.rand(100,1)*10
model = svm.SVR()

without scaling:

model.fit(X_training,Y_training)
print model.predict(X_training)[0:10]

array([ 4.99980599,  6.99479293,  4.9784396 ,  5.03911175,  6.99557904,
        6.57214885,  6.99454049,  5.60940831,  6.99989978,  5.98628179])
Using MinMax scaler:

scaler = MinMaxScaler()
X_scaled  = scaler.fit_transform(X_training)
model.fit(X_scaled,Y_training)
model.predict(X_scaled)[0:10]

array([ 5.63521939,  6.70378514,  5.83393228,  5.33274858,  6.47539108,
        5.61135278,  5.7890052 ,  5.74425789,  6.15799404,  6.1980326 ])

Although the prediction is in the same order of magnitude there is a significant difference between both cases.

Best Answer

Regularization encompass techniques aimed at restricting model complexity. The Support Vector Machine is usually $\ell_2$-regularized, except for the intercept term, which brings coefficients asymptotically towards zero, as the cost function is amended with $\|w\|_2^2=\sum_{i=1}^pw_i^2$.

As you can see, the regularization penalty actually depends on the magnitude of the coefficients, which in turn depends on the magnitude of the features themselves. So there you have it, when you change the scale of the features you also change the scale of the coefficients, which are thus penalized differently, resulting in diverging solutions.

Scikit-learn PCA

pca = PCA()

Scale and transform data to get Principal Components

X_reduced = pca.fit_transform(scale(X))

Variance (% cumulative) explained by the principal components

np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)

array([  73.39,   93.1 ,   98.63,   99.89,  100.  ])

Seems like the first two components indeed explain most of the variance in the data.

10-fold CV, with shuffle

n = len(X_reduced)
kf_10 = cross_validation.KFold(n, n_folds=10, shuffle=True, random_state=2)

regr = LinearRegression()
mse = []

Do one CV to get MSE for just the intercept (no principal components in regression)

score = -1*cross_validation.cross_val_score(regr, np.ones((n,1)), y.ravel(), cv=kf_10, scoring='mean_squared_error').mean()    
mse.append(score)

Do CV for the 5 principle components, adding one component to the regression at the time

for i in np.arange(1,6):
    score = -1*cross_validation.cross_val_score(regr, X_reduced[:,:i], y.ravel(), cv=kf_10, scoring='mean_squared_error').mean()
    mse.append(score)

fig, (ax1, ax2) = plt.subplots(1,2, figsize=(12,5))
ax1.plot(mse, '-v')
ax2.plot([1,2,3,4,5], mse[1:6], '-v')
ax2.set_title('Intercept excluded from plot')

for ax in fig.axes:
    ax.set_xlabel('Number of principal components in regression')
    ax.set_ylabel('MSE')
    ax.set_xlim((-0.2,5.2))

Scikit-learn PLS regression

mse = []

kf_10 = cross_validation.KFold(n, n_folds=10, shuffle=True, random_state=2)

for i in np.arange(1, 6):
    pls = PLSRegression(n_components=i, scale=False)
    pls.fit(scale(X_reduced),y)
    score = cross_validation.cross_val_score(pls, X_reduced, y, cv=kf_10, scoring='mean_squared_error').mean()
    mse.append(-score)

plt.plot(np.arange(1, 6), np.array(mse), '-v')
plt.xlabel('Number of principal components in PLS regression')
plt.ylabel('MSE')
plt.xlim((-0.2, 5.2))

Solved – Training SVM yields predictions of only 1

What is tuned_parameters?

If I train SVC with default parameters on your dataset, it works fine with 61% accuracy, predicting both classes.

model.predict(X_test) with the model trained with your parameters outputs both 0 and 1 for me, with 98% accuracy:

model = svm.SVC(kernel = 'rbf', C=10, gamma=10)
model.fit(X_train, y_train)
print(model.predict(X_test))
print(model.score(X_test, y_test))

So the question is, how do you check what the model outputs on your test set? You may have a mistake there.

Best Answer

Related Solutions

Solved – Principal Component Analysis and Regression in Python

Scikit-learn PCA

Scikit-learn PLS regression

Solved – Training SVM yields predictions of only 1

Related Question