Solved – AIC, BIC and GCV: what is best for making decision in penalized regression methods

aicbiccross-validationlassoridge regression

My general understanding is AIC deals with the trade-off between the goodness of fit of the model and the complexity of the model.

$AIC =2k -2ln(L)$

$k$ = number of parameters in the model

$L$ = likelihood

Bayesian information criterion BIC is closely related with AIC.The AIC penalizes the number of parameters less strongly than does the BIC. I can see these two are used everywhere historically. But generalized cross validation (GCV) is new to me. How GCV can relate to BIC or AIC? How these criteria, together or separate used in selection of penalty term in panelized regression like ridge ?

Edit:
Here is an example to think and discuss:

    require(lasso2)
    data(Prostate)
    require(rms)

    ridgefits = ols(lpsa~lcavol+lweight+age+lbph+svi+lcp+gleason+pgg45,
           method="qr", data=Prostate,se.fit = TRUE, x=TRUE, y=TRUE)
    p <- pentrace(ridgefits, seq(0,1,by=.01))
    effective.df(ridgefits,p)
    out <- p$results.all
    par(mfrow=c(3,2))
    plot(out$df, out$aic, col = "blue", type = "l", ylab = "AIC", xlab = "df"  )
    plot(out$df, out$bic, col = "green4", type = "l", ylab = "BIC",  xlab = "df" )
    plot(out$penalty, out$df,  type = "l", col = "red", 
     xlab = expression(paste(lambda)), ylab = "df" )
    plot(out$penalty, out$aic, col = "blue", type = "l",  
      ylab = "AIC", xlab = expression(paste(lambda))  )
    plot(out$penalty, out$bic, col = "green4", type = "l", ylab = "BIC", 
      xlab= expression(paste(lambda))

require(glmnet)
y <- matrix(Prostate$lpsa, ncol = 1)
x <- as.matrix (Prostate[,- length(Prostate)])
cv <- cv.glmnet(x,y,alpha=1,nfolds=10)
plot(cv$lambda, cv$cvm, col = "red", type = "l", 
      ylab = "CVM",   xlab= expression(paste(lambda))

enter image description here

Best Answer

I think of BIC as being preferred when there is a "true" low-dimensional model, which I think is never the case in empirical work. AIC is more in line with assuming that the more data we acquire the more complex a model can be. AIC using the effective degrees of freedom, in my experience, is a very good way to select the penalty parameter $\lambda$ because it is likely to optimize model performance in a new, independent, sample.

Related Question