Solved – Bayesian variable selection — does it really work

bayesianfeature selectionjagsmultiple regressionregression

I thought I might toy with some Bayesian variable selection, following a nice blog post and the linked papers therein. I wrote a program in rjags (where I am quite a rookie) and fetched price data for Exxon Mobil, along with some things that are unlikely to explain its returns (e.g. palladium prices) and other things that should be highly correlated (like the SP500).

Running lm(), we see that there strong evidence of an overparameterized model, but that palladium should definitely be excluded:

Call:
lm(formula = Exxon ~ 0 + SP + Palladium + Russell + OilETF + 
    EnergyStks, data = chkr)

Residuals:
       Min         1Q     Median         3Q        Max 
-1.663e-03 -4.419e-04  3.099e-05  3.991e-04  1.677e-03 

Coefficients:
           Estimate Std. Error t value Pr(>|t|)    
SP          0.51913    0.19772   2.626 0.010588 *  
Palladium   0.01620    0.03744   0.433 0.666469    
Russell    -0.34577    0.09946  -3.476 0.000871 ***
OilETF     -0.17327    0.08285  -2.091 0.040082 *  
EnergyStks  0.79219    0.11418   6.938 1.53e-09 ***

After converting to returns, I tried running a simple model like this

  model {
    for (i in 1:n) {
      mean[i]<-inprod(X[i,],beta)
      y[i]~dnorm(mean[i],tau)
    }
    for (j in 1:p) {
      indicator[j]~dbern(probindicator)
      betaifincluded[j]~dnorm(0,taubeta)
      beta[j] <- indicator[j]*betaifincluded[j]
    }
    tau~dgamma(1,0.01)
    taubeta~dgamma(1,0.01)
    probindicator~dbeta(2,8)
  }

but I found that, pretty much regardless of the parameters to the chosen gamma distributions, I got pretty nonsensical answers, such as an unvarying 20% inclusion probability for each variable.

I also got tiny, tiny regression coefficients, which I am willing to tolerate since this supposed to be a selection model, but that still seemed weird.

                              Mean        SD  Naive SE Time-series SE
SP         beta[1]       -4.484e-03   0.10999  0.003478       0.007273
Palladium  beta[2]        1.422e-02   0.16646  0.005264       0.011106
Russell    beta[3]       -2.406e-03   0.08440  0.002669       0.003236
OilETF     beta[4]       -4.539e-03   0.14706  0.004651       0.005430
EnergyStks beta[5]       -1.106e-03   0.07907  0.002500       0.002647
SP         indicator[1]   1.980e-01   0.39869  0.012608       0.014786
Palladium  indicator[2]   1.960e-01   0.39717  0.012560       0.014550
Russell    indicator[3]   1.830e-01   0.38686  0.012234       0.013398
OilETF     indicator[4]   1.930e-01   0.39485  0.012486       0.013229
EnergyStks indicator[5]   2.070e-01   0.40536  0.012819       0.014505
           probindicator  1.952e-01   0.11981  0.003789       0.005625
           tau            3.845e+03 632.18562 19.991465      19.991465
           taubeta        1.119e+02 107.34143  3.394434       7.926577

Is Bayesian variable selection really that bad/sensitive? Or am I making some glaring error?

Best Answer

In the BUGS code, mean[i]<-inprod(X[i,],beta) should be mean[i]<-inprod(X[i,],beta[]).

Your priors on tau and taubeta are too informative.

You need a non-informative prior on betaifincluded, use e.g. a gamma(0.1,0.1) on taubeta. This may explain why you get tiny regression coefficients.

Related Solutions

Solved – How exactly does Chi-square feature selection work

The chi-square test is a statistical test of independence to determine the dependency of two variables. It shares similarities with coefficient of determination, R². However, chi-square test is only applicable to categorical or nominal data while R² is only applicable to numeric data.

From the definition, of chi-square we can easily deduce the application of chi-square technique in feature selection. Suppose you have a target variable (i.e., the class label) and some other features (feature variables) that describes each sample of the data. Now, we calculate chi-square statistics between every feature variable and the target variable and observe the existence of a relationship between the variables and the target. If the target variable is independent of the feature variable, we can discard that feature variable. If they are dependent, the feature variable is very important.

Mathematical details are described here:http://nlp.stanford.edu/IR-book/html/htmledition/feature-selectionchi2-feature-selection-1.html

For continuous variables, chi-square can be applied after "Binning" the variables.

An example in R, shamelessly copied from FSelector

# Use HouseVotes84 data from  mlbench package
library(mlbench)# For data
library(FSelector)#For method
data(HouseVotes84)

#Calculate the chi square statistics 
weights<- chi.squared(Class~., HouseVotes84)

# Print the results 
print(weights)

# Select top five variables
subset<- cutoff.k(weights, 5)

# Print the final formula that can be used in classification
f<- as.simple.formula(subset, "Class")
print(f)

Not related to so much in feature selection but the video below discusses the chisquare in detail https://www.youtube.com/watch?time_continue=5&v=IrZOKSGShC8

Solved – Why can’t Bayesian variable selection be used with categorical variables with more than 2 levels

This is just how I always interpreted it, so I'd happily be corrected:

Their approach was suggested for the classic linear model $ y=\sum_j \beta_j x_j + \epsilon $ and they argued putting a spike and slap prior on the $\beta_j$.

If we have a categorical variable (class variable) $C$ with more than 2 categories $c_k, k =1, \dots, K, K>2$ we can embed them in the linear model by creating $K-1$ dummy variables that contrast $k-1$ categories with the reference category.

Let $x_j$ denote the $p$ metric variables and $d_s$ the $s=1,\dots, K-1$ dummy variables. Then the model becomes (I just use a single categorical variable):

$ y=\sum_j \beta_j x_j + \sum_s\gamma_s d_s + \epsilon $

with $\beta_j$ denoting the coefficients for the metric variables and $\gamma_s$ the coefficients for the dummy variables (note I only used $\beta$ and $\gamma$ for making their difference explicit).

If we now set up a spike and slap prior for the $\beta$ and the $\gamma$ we have for the metric variables

$ P(\beta_j=0) = h_{0j} \\ P(\beta_j<b,\beta_j\neq0)=(b+f_j)h_{1j}\\ P(|\beta_j|>f_j)=0 $

and for the dummies

$ P(\gamma_s=0) = g_{0s} \\ P(\gamma_s<b,\gamma_s\neq0)=(b+r_s)g_{1s}\\ P(|\gamma_s|>r_s)=0 $

with the definition of the $f_j$ and $r_s$ as in their paper.

The key is now that we have $K-1$ dummies, $K-1$ $\gamma_s$ and $K-1$ "spike and slaps" and the prior over the submodels (2.7 in their paper) is therefore a product over all $K-1$ coefficients. In other words, the selection happens on the level of the coefficients for the dummies and therefore on the level of the $K-1$ categories of $C$, not on the variable $C$ itself. The shrinkage therefore refers to the difference between category $c_k$ and the reference category.

This has a number of implications:

Shrinkage depends on the coding scheme
The choice of the reference category matters
Selection only refers to the currently chosen reference category (but in class variables that should be arbitrary)
The selected models are not invariant against permutations of class labels

$C$ is therefore always selected whenever a category difference of two levels of $C$ is selected. Put differently, $C$ is only excluded from the model when all $\gamma_s$ are shrunk to zero. For class variables with more than 2 categories, their approach therefore does non-invariant coefficient selection rather than variable selection.

Best Answer

Related Solutions

Solved – How exactly does Chi-square feature selection work

Solved – Why can’t Bayesian variable selection be used with categorical variables with more than 2 levels

Related Question