Solved – the intuitive explanation and derivation of Mallows’s $C_p$ statistic (model selection)

intuitionmodel selectionregression

$$C_p = \frac{SS_E(p)}{\hat{\sigma}^2} – 2 + 2p $$

Here is the formula of Mallows's $C_p$ statistic, wher $SSe_p$ is residuals sum of square, $\hat\sigma^2$ is estimated variance of residuals, $n$ – number of observations and $p$ – number of predictors in model.

Questions: How can we intuitively understand and explain this statistic? Why does (from an intuitive point of view) it take such a form?

And what is the interpretation of the statistic? What does the output value mean? How can we interpret the result?

Best Answer

Suppose we are working with a linear model $Y = X\beta + \varepsilon$ for $\varepsilon \sim \mathcal N(X\beta, \sigma^2 I)$. Then (up to a constant) the log likelihood $l$ of $\beta$ is given by $$ -2 \times l(\beta, \sigma^2 | Y) = \frac{1}{\sigma^2}|| Y - X\beta||^2. $$

Recall the definition of the AIC: $$ AIC(\hat \beta, \hat \sigma^2) = -2 l(\hat \beta, \hat \sigma^2 | y) + 2 p $$ where $p$ is the dimension of our model.

We have that $$ C_p = \frac{1}{\hat \sigma^2} ||Y - X\hat \beta||^2 + 2p - 2 $$

so we can see that the AIC and $C_p$ differ only by a constant, therefore their respective argminima are the same. As @DJohnson mentioned in the comments, $C_p$ is only ever really used for variable selection, i.e. we care about its argmin rather than its actual values. This means that (for this particular model, at least) we can interpret argminima of $C_p$ in terms of the argminima of the AIC and there's a whole body of work on that. See here or here, for instance.

In effect, I'm completely echoing DJohnson's comment that this isn't a particularly useful statistic and there's no point in wasting time trying to understand it by itself. I advocate for framing it in terms of AIC, which is definitely worth understanding (even if you don't like it or use it), and then putting your mental effort there (and on related *IC things like BIC, AICc, and etc).