Binary Data – How to Calculate the Coefficient of Determination for Binary Responses

binary datar-squared

D.R. Cox and Nanny Wermuth seem to suggest that the coefficient of determination (R squared) is misleading when you have binary responses, in fact if I am understanding well, they are saying that the maximum value it can be is 0.36.

Is this applicable for OLS models with some key significant independent variables taking a binary value?

Further discussion and insights on the topic would be appreciated.

Best Answer

The coefficient of determination can be calculated in various ways which coincide for linear regression. (If that's not true, then something in the comparison is not the coefficient of determination.) Away from that case, it gets messier. Various analogues or alternatives are often (but not always) labelled pseudo-. Watch out also for adjusted relatives penalising for using several predictors.

I have found the paper of Zheng and Agresti (2000) to be helpful in this territory.

Zheng and Agresti (2000) discussed the correlation between the response and the fitted or predicted response as a general measure of predictive power for generalized linear models (GLMs). This measure has the advantages of referring to the original scale of measurement, of being applicable to all types of GLM and of being familiar to many users of statistics. Preferably, it should be used as a comparative measure for different models applied to the same data set, given that restrictions on values of the response may imply limitations on its value (see e.g. Cox and Wermuth, 1992).

For an arbitrary GLM, this correlation is invariant under a location-scale transformation and it is the positive square root of the average proportion of variance explained by the predictors. However, again for an arbitrary GLM, it need not equal the positive square root of other definitions of R-square (e.g. Hardin and Hilbe, 2001); and it need not be monotone increasing in the complexity of the predictors, although in practice that is common. The correlation is necessarily sensitive to outliers.

As the predicted is a function of the observed, the correlation calculated from a sample may be expected to be biased upwards. A jackknifed correlation is recommended as one alternative. Zheng and Agresti provide more discussion of this point, including other estimators and a bootstrap approach to providing confidence intervals for the correlation and to estimating the degree of overfitting.

Cox, D.R. and N. Wermuth. 1992. A comment on the coefficient of determination for binary responses. American Statistician 46: 1-4.

Hardin, J. and J. Hilbe. 2001 (and later editions). Generalized Linear Models and Extensions. College Station, TX: Stata Press.

Zheng, B. and A. Agresti. 2000. Summarizing the predictive power of a generalized linear model. Statistics in Medicine 19: 1771-1781.

Note The use of binary predictors in regression need not limit $R^2$ in linear regression. A simple example is the use of one continuous predictor and one binary predictor. As in principle the data could all lie on two straight lines, a value of 1 is achievable.

Related Question