Solved – Regression produces a high coefficient of determination, but also a high MSE

mser-squaredregressionregression coefficients

I've ran several regression models on a dataset (the SEER cancer dataset). I'm trying to use regression to calculate how many months a cancer patient can expect to live. Each record consists of around 20 features, such as tumor size, race, etc. The training data includes the months each patient lives, which I use to train the regressor.

I split the dataset into train and held out test sets, but I'm observing something interesting. I'm getting a high, satisfactory $R^2$ (scikit learn score function), but also a high MSE. My code (for one of the algorithms I used) is below

alldata = pd.read_csv('alldata.csv')

cols = [col for col in alldata.columns if col not in ['Survival months', 'Survived']]

X = alldata[cols].values
y = alldata["Survival months"].values

Xr, Xt, yr, yt = train_test_split(X, y, random_state=6131997)
rfr = RandomForestRegressor(n_estimators=2000, oob_score=True, n_jobs=-1)
rfr = rfr.fit(Xr, yr)
ypred = rfr.predict(Xt)
acc = rfr.score(Xt, yt)
print mean_squared_error(yt, ypred)
print rfr.oob_score_
print acc
scores = cross_val_score(rfr, Xr, yr, cv=10)
print scores
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std()*2))

The output:

76.9924308448 (MSE)
0.899907201044 (oob score)
0.894280365688 (score / R^2 / coefficient of determination)
[ 0.9027185   0.89860441  0.90686577  0.90264802  0.90139131  0.90345359
  0.89532146  0.89681607  0.90167825  0.89130112] (10 fold crossvalidation outputs)
Accuracy: 0.90 (+/- 0.01) (average of the crossvalidation outputs, and their stdev)

How can the regression model correlate with the data 90% of the time, yet have such a high MSE? That's around 8.8 months of error on average with the prediction. I'm quite new to regression, so I may be interpreting the MSE incorrectly.

Could you give your opinion on my results: are they satisfactory or not, and by what metric / why did you choose that metric? The data includes patients that had survival times ranging from 0 months to 70 months and beyond; 8.8 months seems like a significant amount to be off.

Best Answer

The mean square error MSE reported for this regression is about 77 months squared, the ridiculous default display format of your software aside.

You quite rightly work with its square root, about 8.8 months, which we can call the root MSE or RMSE. Several other names exist. Regardless of terminology, I would assert that the MSE is usually no help at all in thinking about the model while the RMSE can be helpful, the difference lying in using familiar units of measurement. But even with square rooting, a few large errors or residuals can pull up either measure, and that is consistent with many errors being quite small.

Without seeing your data I am not at all surprised by that result for either measure. In this kind of data, the uncertainty of predictions is high even given good information on the characteristics of the patients. Everyone has heard stories of people given 6 months to survive but lasting 2 years, and so forth. That's usually a matter of expert guesses being uncertain rather than a regression model leaving a considerable degree of scatter, but there is a link.

In broad terms, $R^2$ and RMSE are on quite different scales. If they appear to be conflicting, that can always be resolved by looking at the complete set of errors or residuals. Even with several predictors in a regression model, you can always inspect one or more of a histogram of residuals, a plot of residuals versus predicted values, or a plot of observed versus predicted values.

An aside on two minor points of language

Even when used informally, I would advise strongly against wording such as

  1. "the regression model correlate[s] with the data 90% of the time"

It is in no sense helpful, or even meaningful, to regard $R^2$ as the fraction of the time, or even of the data, that the model shows a correlation. Correlation is a property of a dataset, not individual observations. Moreover, correlation is a matter of degree, not a yes or no matter.

  1. "a significant amount to be off"

The word "significant" is over-loaded in discussions of statistical applications. There are good words like "big", "large", "notable", "major" to be used when your meaning is informal. Reserve the term "significant" for discussing the results of a formal significance test.

P.S. In 45 years or so of using statistics, I don't think I have ever seen $R^2$ reported to as many as 13 decimal places. Most experienced researchers wouldn't use more than 3.