Can someone please explain how the third line becomes the fourth line?
Introduction to Statistical Learning Eq. 4.32
linear algebramachine learningmatrix
Related Solutions
I think that you are confusing the two residual sum of squares that you have. You have one RSS to estimate the $\hat{\sigma}^2$ in the formula, this RSS is in some sense independent of the number of parameters, $p$. This $\hat{\sigma}^2$ should be estimated using all your covariates, giving you a baseline unit of error. You should call the RSS in the formula for AIC: $\text{RSS}_{p_i}$, meaning that it corresponds to model $i$ with $p$ parameters, (There may be many models with $p$ parameters). So the RSS in the formula is calculated for a specific model, while the RSS for $\hat{\sigma}^2$ is for the full model.
This is also noted in the page before, where $\hat{\sigma}^2$ is introduced for $C_p$.
So the RSS for the formula in AIC is not indepednent of $p$, it is calculated for a given model. Introducing $\hat{\sigma}^2$ to all of this is just to have a baseline unit for the error, such that there is a "fair" comparison between the number of parameters and the reduction in error. You need to compare the number of parameters to something that is scaled w.r.t. the magnitude of the error.
If you would not scale the RSS by the baseline error, it might be that the RSS is dropping much more than the number of variables introduced and thus you become more greedy in adding in more variables. If you scale it to some unit, the comparison to the number of parameters is independent of the magnitude of the baseline error.
This is not the general way to calculate AIC, but it essentially boils down to something similar to this in cases where it is possible to derive simpler versions of the formula.
For the first equation, it's the result of zero gradient; $$ \begin{aligned} S &= \sum_{j=1}^p (y_j-\beta_j)^2 +\lambda\sum_{j=1}^p\beta_j^2\\ \end{aligned} $$ at extrema, $$ \begin{aligned} \frac{\partial S}{\partial \beta_j} &=0\\ -2(y_j -\beta_j) +2\lambda\beta_j &= 0\\ \beta_j &= \frac{y_j}{1+\lambda}. \end{aligned} $$
I think you should be able to derive the other expression using the same technique shown above and use the fact that $$ \vert \beta_j \vert = \begin{cases} \beta_j \ \text{if} \ \beta_j > 0\\ -\beta_j \ \text{if} \ \beta_j < 0\end{cases}. $$
Best Answer
It is an issue of expanding and tidying up. You have for example
so
$\log\left(\frac{\pi_k}{\pi_K}\right) -\frac12(x-\mu_k)^T\Sigma^{-1}(x-\mu_k) +\frac12(x-\mu_K)^T\Sigma^{-1}(x-\mu_K)$
$= \log\left(\frac{\pi_k}{\pi_K}\right) -\frac12 x^T\Sigma^{-1}x+ x^T\Sigma^{-1}\mu_k- \frac12\mu_k^T\Sigma^{-1}\mu_k +\frac12 x^T\Sigma^{-1}x- x^T\Sigma^{-1}\mu_K+ \frac12\mu_K^T\Sigma^{-1}\mu_K$
$= \log\left(\frac{\pi_k}{\pi_K}\right) - \frac12(\mu_k^T\Sigma^{-1}\mu_k - \mu_K^T\Sigma^{-1}\mu_K) + x^T\Sigma^{-1}\mu_k- x^T\Sigma^{-1}\mu_K$ $= \log\left(\frac{\pi_k}{\pi_K}\right) - \frac12\mu_k^T\Sigma^{-1}\mu_k -\mu_K^T\Sigma^{-1}\mu_K + x^T\Sigma^{-1}(\mu_k- \mu_K)$