Solved – Nuisance covariate or variable of no interest in machine learning

generalized linear modelmachine learningpredictorregressionsvm

I'm trying to differentiate two groups of patients using various machine learning algorithms, including support-vector machines (SVM).

As far as the details of the analysis go, I would like to train the sample on a separate group and cross-validate on another.

The problem is that patients are different in some categorical variables (gender for example) and continuous variables (age for example) none of which are of interest. In regression analysis using generalized linear models, it is easy to factor out nuisance variables. I'm wondering whether there is a way in machine learning as general, and SVM in particular to factor out the effect of nuisance variable. In some papers I have seen that authors include nuisance variable to somehow normalize them.

Best Answer

In general, you can run a feature selection algorithm to preprocess the data and remove irrelevant features.

SVMs are quite robust against non-contributing features (noise). You shouldn't really care of removing features manually, it will happen "automatically". On the other hand training time will increase since since finding a solution will be harder. In the primal (linear kernel) you expect unimportant features to receive low weights (close to zero) compared to the important features; in the dual similar effect although more difficult to interpret the final model (since you don't have feature weights, but similarity to training data).

I assume you don't have enough data, thus to avoid the introduction of bias: you need to utilise a resampling method (e.g. bootstrapping) and combine it proper separation of data in training/validation/test sets. Check answers in "How to evaluate/select cross validation method?". Bootstrapping is implemented in "boot" package of R.