Solved – Partial Least Squares Using Python – Understanding Predictions

partial least squarespythonregressionscikit learn

I am having trouble constructing/applying a regression equation from PLS to make a prediction in a manner that can obtain the same predicted values that the model produces when calling the model.predict() method.

I'm downloading and using the example data set from here: https://openmv.net/info/blender-efficiency
Here's some code:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import math
from sklearn.cross_decomposition import PLSRegression as PLSR
from scipy.stats import linregress

#Load Data
df = pd.read_csv(path)
Y = df[['BlendingEfficiency']].values
X = df[['ParticleSize', 'MixerDiameter', 'MixerRotation', 'BlendingTime']].values

#Model
plsModel = PLSR(n_components=2, scale=True)
plsModel.fit(X,Y)
yPred = plsModel.predict(X, copy=True)


#Plotting Model Results:
linearmodel = linregress(Y[:,0], yPred[:,0])
modelY = [linearmodel.slope*x + linearmodel.intercept for x in Y[:,0]]
residuals = modelY-Y[:,0]

fig,ax = plt.subplots(2,2, figsize=(15,10))
ax[0,0].scatter(Y[:,0], yPred[:,0], marker="o")
ax[0,0].plot(Y[:,0], modelY, color="grey", linestyle="-")
ax[0,1].scatter(Y[:,0], modelY-Y[:,0])
ax[0,0].set_title("Model")
ax[0,0].set_ylabel("Predicted Y")
ax[0,0].set_xlabel("Observed Y")
ax[0,1].set_title("Residuals")
ax[0,1].set_xlabel("Observed Y")
ax[1,0].plot(Y[:,0])
ax[1,0].plot(modelY, color="red")
ax[1,0].set_title("Y Run Chart")
ax[1,1].plot(residuals)
text = ax[1,1].set_title("Residuals Run Chart")

Model Results

#Attempt to do a manual prediction with a regression equation:
y_intercept = plsModel.y_mean_ - np.dot(plsModel.x_mean_, plsModel.coef_)
y2 = np.dot(X[2,:], plsModel.coef_[:,0]) + y_intercept
print("Value from model.predict() = " + str(yPred[2][0]))
print("Value from constructed regression equation = " + str(y2))

Results:

Value from model.predict() = 88.66871049240711

Value from constructed regression equation = 105.14650668685694

Questions:

(1) Why don't I get the same result for y2 as the model.predict() method when I try to use the y-intercept and regression coefficients from the model? What am I doing wrong?

(2) Why do the model residuals look so 'weird'? The Residual vs Y is an almost-perfect linear relationship, and in the Residuals Run Chart, the shape of the Residuals is the same as the Y values reflected around the x-axis (which you can see if you plot the residuals*-1).

Update:

I'm still trying to figure this out, but I've been able to get the regression equation to work, but only if I do my own standardization (zero means, unit standard deviations) before running the model and passing scale=False to the model constructor, applying the regression equation, and then re-scaling the predictions that I get from the regression equation.

My original (perhaps mistaken) understanding was that the model inherently handles the scaling and rescaling, but perhaps this is only true when calling the model.fit() and model.predict() methods, but the model.coef_ coefficients are perhaps only supplied in the scaled/standardized variable "space"?

Best Answer

You are correct. You need to scale the independent variables of the data to be predicted using the stddev and mean obtained from the training set. Similarly, you need to back-scale and add the mean to the prediction for the dependent variables to obtain final prediction that matches with the predict() function. OR, with few algebraic steps, you can update coefficients and the mean to account for mean and scaling.

The residuals plot is incorrect since the predictions are incorrect because of not accounting for centering and scaling.

Related Question