Solved – Variable selection vs Model selection

aicbicfeature selectionmodel selection

So I understand that variable selection is a part of model selection. But what exactly does model selection consist of? Is it more than the following:

1) choose a distribution for your model

2) choose explanatory variables, ?

I ask this because I am reading an article Burnham & Anderson: AIC vs BIC where they talk about AIC and BIC in model selection. Reading this article I realize I have been thinking of 'model selection' as 'variable selection' (ref. comments Does BIC try to find a true model?)

An excerpt from the article where they talk about 12 models with increasing degrees of "generality" and these models show "tapering effects" (Figure 1) when KL-Information is plotted against the 12 models:

DIFFERENT PHILOSOPHIES AND TARGET MODELS

Despite that the target of BIC is a more general model than the target
model for AIC, the model most often selected here by BIC will be less
general than Model 7 unless n is very large. It might be Model 5 or 6.
It is known (from numerous papers and simulations in the literature)
that in the tapering-effects context (Figure 1), AIC performs better
than BIC. If this is the context of one’s real data analysis, then AIC
should be used.

How can BIC ever choose a model more complex than AIC in model selection I do not understand! What specifically is "model selection" and when specifically does BIC choose a more "general" model than AIC?

If we are talking about variable selection, then BIC must surely always choose the model with lowest amount of variables, correct? The $2ln(N)k$ term in BIC will always penalize added variables more than the $2k$ term in AIC. But is this not unreasonable when "the target of BIC is a more general model than the target model for AIC"?

EDIT:

From a discussion in the comments in Is there any reason to prefer the AIC or BIC over the other? we see a small discussion between @Michael Chernick and @user13273 in the comments, leading me to believe that this is something that is not that trivial:

I think it is more appropriate to call this discussion as "feature"
selection or "covariate" selection. To me, model selection is much
broader involving specification of the distribution of errors, form of
link function, and the form of covariates. When we talk about AIC/BIC,
we are typically in the situation where all aspects of model building
are fixed, except the selection of covariates. – user13273 Aug 13 '12
at 21:17

Deciding the specific covariates to include in a model does commonly
go by the term model selection and there are a number of books with
model selection in the title that are primarily deciding what model
covariates/parameters to include in the model. – Michael Chernick Aug
24 '12 at 14:44

Best Answer

Sometimes modelers separate variable selection into a distinct step in model development. For instance, they would first perform exploratory analysis, research the academic literature and industry practices then come up with a list of candidate variables. They'd call this step variable selection.

Next, they'd run a bunch of different specifications with many different variable combinations such as OLS model: $$y_i=\sum_{j_m} X_{ij_m}\beta_{j_m}+\varepsilon_i,$$ where $j_m$ denotes variable $j$ in a model $m$. They'd pick the best model out of all models $m$ manually or in an automated routines. So, these people would call the latter stage model selection.

This is similar to how in machine learning people talk about feature engineering, when they come up with variables. You plug the features into LASSO or similar frameworks where you build a model using these features (variables). In this context it makes a sense to separate out the variable selection into a distinct step, because you let the algorithm to pick the right coefficients for variables, and don't eliminate any variables. Your judgment (in regard to which variable goes into a model) is isolated in the variable selection step, then the rest is up to the fitting algorithm.

In the context of the paper you cited, this is all irrelevant. The paper uses BIC or AIC to select between different model specifications. It doesn't matter whether you had the variable selection as a separate step in this case. All that matters is which variables are in any particular model specification $m$, then you look at their BIC/AIC to pick the best. They account for sample sizes and number of variables.