Solved – the best way to go about feature selection (I am a beginner in ML)

datasetfeature selectionmachine learningpredictive-modelspredictor

I was reading this tutorial about loan prediction:
https://rstudio-pubs-static.s3.amazonaws.com/190551_15f6124632824534b7e397ce7ad2f2b8.html

In the 'Preparation' section, the author cuts down the dataset from 111 variables down to 18 by "selecting out irrelevant data, poorly documented data and less important features".

My question is this: is there an efficient way to go through all 118 variables and work out whether they are "irrelevant/unimportant"? I was trying to do it, but there is no way to tell whether a certain factor (eg number of bank accounts that a borrower owns) will be a useful predictive feature.

I have heard of feature selection algorithms, but will these be reliable for cutting 118 features down to 20-30 features? If so, would it be better to trim it down manually first, and then put it through a feature selection algorithm?

Best Answer

Like most aspects of statistics, variable selection is a balancing act.

Manually trimming the list of potential predictor variables can protect against overfitting, as most commonly used variable selection algorithms are context-free - that is, they only look at relationships within the dataset, and can't factor in the wider meanings of variables. This means that an automated algorithm may pick up relationships in a large number of predictor variables that are illusory and won't generalise outside the dataset.

This makes manual elimination of "bad" variables a good initial step in some cases. However:

  1. You introduce your own biases into the analysis. This has to be done to some extent in any analysis, but manually deciding that particular variables are not fit-for-purpose could be seen as playing with the model too much. This could especially be the case when done very heavy-handedly or not done in a well-justified way.
  2. The context behind the variables has to "matter". There are some cases where you may have variables that don't mean anything simple (for example, if you're using principal components). In those cases it doesn't make sense to eliminate variables manually, because an advantage of using your human understanding of the context has been lost.
  3. If you're specifically trying to find unexpected relationships in the data, obviously eliminating variables that you wouldn't expect to be important would undermine this aim.

So yes, while there are ways of automatically selecting variables (for example Stepwise Selection or LASSO regression), they should only be used where appropriate. In the example case, the analyst used their knowledge of the subject matter to eliminate unimportant variables.

Related Question