One of the standing examples in the book The Elements of Statistical Learning by Hastie, Tibshirani and Friedman (hereafter referred to as HTF) uses prostate cancer data (available here and described here) from a study by Stamey et al. (1989).
I'm trying to reproduce the results quoted in the (second edition of) HTF for ordinary linear regression, for which they report the following coefficients (table 3.2, page 50):
Intercept: 2.45
lcavol : 0.72
weight : 0.29
age : -0.14
lbph : 0.21
svi : 0.31
lip : -0.29
gleason : -0.02
egg45 : 0.28
However, my own analysis gives woefully different numbers (using scikit-learn
's LinearRegression
, statsmodels
's OLS
and computing the coefficients manually using formula 3.6 in HTF). I get
Intercept: 0.429
lcavol : 0.577
lweight : 0.614
age : -0.019
lbph : 0.145
svi : 0.737
lcp : -0.206
gleason : -0.030
pgg45 : 0.009
using the following (Python 3) code:
# Import data
# -----------
import pandas as pd
df = pd.read_csv('prostate.data', sep='\t', usecols=range(1, 11))
# Extract training set used by HTF
# --------------------------------
train_mask = (df['train'] == 'T')
cols_X = [col for col in df.columns if col not in ['lpsa', 'train']]
cols_y = ['lpsa']
X_train = df.loc[train_mask, cols_X]
y_train = df.loc[train_mask, cols_y]
# Use scikit-learn's LinearRegression
# -----------------------------------
from sklearn import linear_model
ols = linear_model.LinearRegression(normalize=True)
ols.fit(X_train, y_train)
print('Intercept: {:6.3f}'.format(ols.intercept_[0]))
for i, col in enumerate(X_train.columns):
print('{:9s}: {:6.3f}'.format(col, ols.coef_[0, i]))
# Use statsmodels OLS
# -------------------
import statsmodels.api as sm
result = sm.OLS(y_train, sm.add_constant(X_train)).fit()
print(result.summary())
# Use formula 3.6 of HTF
# ----------------------
X_ext = np.hstack([np.ones((X_train.shape[0], 1)), X_train.values])
np.matmul(np.matmul(np.linalg.inv(np.matmul(X_ext.T, X_ext)), X_ext.T), y_train)
Am I doing something wrong, or are the numbers reported in the book that are wrong?
EDIT: Redefining
X_train = (X_train - X_train.mean(axis=0)) / (X_train.std(axis=0))
before any of the fits leads to coefficients that are consistent with those reported in HTF:
Intercept: 2.4523
lcavol: 0.7164
weight: 0.2926
age: -0.1425
lbph: 0.2120
svi: 0.3096
lip: -0.2890
gleason: -0.0209
pgg45: 0.2773
The book would probably benefit from making this clearer (I've seen other people confused by this same issue). Thanks to all that responded!
Best Answer
As they say in the text:
You have not standardized the predictors.