Solved – Feature selection based on mean, standard deviation and mean absolute deviation

feature selectionmulti-classmulticollinearity

Suppose we have a large dataset (~ 60000 entries, 58 variables, 4 class labels).

For each variable mean, standard deviation and mean absolute deviation are calculated – separately for every class label.

Some of the variables are expected to be collinear and some are possibly not helpful at all in distinguishing between the classes.

Values lie on different scales, ranging from $10^{-3}$ to $10^{4}$.

The questions are:

  1. Knowing mean, std and mad of each variable for different classes, can we perform feature selection (at least make an estimation, filter really unhelpful features), based on comparison of these values?
  2. Which feature selection approach is recommended for the aforementioned case?

Best Answer

with means and std's you can perform ANOVA on every feature and keep only some fraction of most significant ones. But you should do it on a training set and test set separately.

I am not aware if there is any recommended approach, it is always problem dependent. But with that many samples you can afford to try bunch of methods