Solved – Replicating a linear regression example from Hastie, Tibshirani and Friedman

pythonregression

One of the standing examples in the book The Elements of Statistical Learning by Hastie, Tibshirani and Friedman (hereafter referred to as HTF) uses prostate cancer data (available here and described here) from a study by Stamey et al. (1989).

I'm trying to reproduce the results quoted in the (second edition of) HTF for ordinary linear regression, for which they report the following coefficients (table 3.2, page 50):

Intercept:  2.45
lcavol   :  0.72
weight   :  0.29
age      : -0.14
lbph     :  0.21
svi      :  0.31
lip      : -0.29
gleason  : -0.02
egg45    :  0.28

However, my own analysis gives woefully different numbers (using scikit-learn's LinearRegression, statsmodels's OLS and computing the coefficients manually using formula 3.6 in HTF). I get

Intercept:  0.429
lcavol   :  0.577
lweight  :  0.614
age      : -0.019
lbph     :  0.145
svi      :  0.737
lcp      : -0.206
gleason  : -0.030
pgg45    :  0.009

using the following (Python 3) code:

# Import data
# -----------
import pandas as pd

df = pd.read_csv('prostate.data', sep='\t', usecols=range(1, 11))

# Extract training set used by HTF
# --------------------------------
train_mask = (df['train'] == 'T')
cols_X = [col for col in df.columns if col not in ['lpsa', 'train']]
cols_y = ['lpsa']

X_train = df.loc[train_mask, cols_X]
y_train = df.loc[train_mask, cols_y]

# Use scikit-learn's LinearRegression
# -----------------------------------
from sklearn import linear_model

ols = linear_model.LinearRegression(normalize=True)
ols.fit(X_train, y_train)

print('Intercept: {:6.3f}'.format(ols.intercept_[0]))
for i, col in enumerate(X_train.columns):
    print('{:9s}: {:6.3f}'.format(col, ols.coef_[0, i]))

# Use statsmodels OLS
# -------------------
import statsmodels.api as sm

result = sm.OLS(y_train, sm.add_constant(X_train)).fit()
print(result.summary())

# Use formula 3.6 of HTF
# ----------------------
X_ext = np.hstack([np.ones((X_train.shape[0], 1)), X_train.values])
np.matmul(np.matmul(np.linalg.inv(np.matmul(X_ext.T, X_ext)), X_ext.T), y_train)

Am I doing something wrong, or are the numbers reported in the book that are wrong?

EDIT: Redefining

X_train = (X_train - X_train.mean(axis=0)) / (X_train.std(axis=0))

before any of the fits leads to coefficients that are consistent with those reported in HTF:

Intercept:  2.4523
lcavol:     0.7164
weight:     0.2926
age:       -0.1425
lbph:       0.2120
svi:        0.3096
lip:       -0.2890
gleason:   -0.0209
pgg45:      0.2773

The book would probably benefit from making this clearer (I've seen other people confused by this same issue). Thanks to all that responded!

Best Answer

As they say in the text:

We fit a linear model to the log of prostate-specific antigen, lpsa, after first standardizing the predictors to have unit variance. We randomly split the dataset into a training set of size 67 and a test set of size 30. We applied least squares estimation to the training set, producing the estimates, standard errors and Z-scores shown in Table 3.2

You have not standardized the predictors.

Related Question