Solved – Why can’t Bayesian variable selection be used with categorical variables with more than 2 levels

bayesianfeature selectionregression

I am reading this article which is the first approach on Bayesian variable selection. In the discussion section it says that one of the major limitations of the particular method is that it cannot be used with a class variable that has more than two levels. Does anyone know why?

Best Answer

This is just how I always interpreted it, so I'd happily be corrected:

Their approach was suggested for the classic linear model $ y=\sum_j \beta_j x_j + \epsilon $ and they argued putting a spike and slap prior on the $\beta_j$.

If we have a categorical variable (class variable) $C$ with more than 2 categories $c_k, k =1, \dots, K, K>2$ we can embed them in the linear model by creating $K-1$ dummy variables that contrast $k-1$ categories with the reference category.

Let $x_j$ denote the $p$ metric variables and $d_s$ the $s=1,\dots, K-1$ dummy variables. Then the model becomes (I just use a single categorical variable):

$ y=\sum_j \beta_j x_j + \sum_s\gamma_s d_s + \epsilon $

with $\beta_j$ denoting the coefficients for the metric variables and $\gamma_s$ the coefficients for the dummy variables (note I only used $\beta$ and $\gamma$ for making their difference explicit).

If we now set up a spike and slap prior for the $\beta$ and the $\gamma$ we have for the metric variables

$ P(\beta_j=0) = h_{0j} \\ P(\beta_j<b,\beta_j\neq0)=(b+f_j)h_{1j}\\ P(|\beta_j|>f_j)=0 $

and for the dummies

$ P(\gamma_s=0) = g_{0s} \\ P(\gamma_s<b,\gamma_s\neq0)=(b+r_s)g_{1s}\\ P(|\gamma_s|>r_s)=0 $

with the definition of the $f_j$ and $r_s$ as in their paper.

The key is now that we have $K-1$ dummies, $K-1$ $\gamma_s$ and $K-1$ "spike and slaps" and the prior over the submodels (2.7 in their paper) is therefore a product over all $K-1$ coefficients. In other words, the selection happens on the level of the coefficients for the dummies and therefore on the level of the $K-1$ categories of $C$, not on the variable $C$ itself. The shrinkage therefore refers to the difference between category $c_k$ and the reference category.

This has a number of implications:

  • Shrinkage depends on the coding scheme

  • The choice of the reference category matters

  • Selection only refers to the currently chosen reference category (but in class variables that should be arbitrary)
  • The selected models are not invariant against permutations of class labels

$C$ is therefore always selected whenever a category difference of two levels of $C$ is selected. Put differently, $C$ is only excluded from the model when all $\gamma_s$ are shrunk to zero. For class variables with more than 2 categories, their approach therefore does non-invariant coefficient selection rather than variable selection.