Solved – Feature selection based on mean, standard deviation and mean absolute deviation

feature selectionmulti-classmulticollinearity

Suppose we have a large dataset (~ 60000 entries, 58 variables, 4 class labels).

For each variable mean, standard deviation and mean absolute deviation are calculated – separately for every class label.

Some of the variables are expected to be collinear and some are possibly not helpful at all in distinguishing between the classes.

Values lie on different scales, ranging from $10^{-3}$ to $10^{4}$.

The questions are:

Knowing mean, std and mad of each variable for different classes, can we perform feature selection (at least make an estimation, filter really unhelpful features), based on comparison of these values?
Which feature selection approach is recommended for the aforementioned case?

Best Answer

with means and std's you can perform ANOVA on every feature and keep only some fraction of most significant ones. But you should do it on a training set and test set separately.

I am not aware if there is any recommended approach, it is always problem dependent. But with that many samples you can afford to try bunch of methods

Related Solutions

Solved – Best way to select useful features using R software

One idea would be to use the rfe function in the caret package. Use the option rfeControl = rfeControl(functions=rfFunctions) to calculate variable importance using a random forest.

The rfe algorithm is explained in detail in the vignette: Alg 2

If a random forest performs well on your dataset, this is usually a good way to improve it. Or maybe the random forest alone gives you sufficiently accurate predictions.

You can also use the glmnet package to use the elastic net for regularization/selection. This will be MUCH faster, and often performs quiet well. If you've already got a glm model that you like, glmnet might improve it.

tl;dr: If a random forest works well on your data, try rfe with the rfFuncs. If a linear model works well, try glmnet, or rfe with lmFuncs.

Solved – What kind of feature selection can Chi square test be used for

I think part of your confusion is about which types of variables a chi-squared can compare. Wikipedia says the following about this:

It tests a null hypothesis stating that the frequency distribution of certain events observed in a sample is consistent with a particular theoretical distribution.

Thus it compares frequency distributions, also known as counts, also known as non-negative numbers. The different frequency distributions are defined by the categorical variable; i.e. for each of the values of a categorical variable there needs to be a frequency distribution that can be compared to the other ones.

There are several ways to get the frequency distribution. It might be from a second categorical variable wherein the co-occurances with the first categorical variable are counted to get a discrete frequency distribution. Another option is to use a (multiple) numerical variable for different values of a categorical variable, it can (e.g.) sum the values of the numerical variable. In fact, if categorical variables are binarised the former is a specific version of the later.

Example

As an example look at these sets of variables:

x = ['mouse', 'cat', 'mouse', 'cat']
z = ['wild', 'domesticated', 'domesticated', 'domesticated']

The categorical variables x and y can be compared by counting the co-occurances, and this is what happens with a chi-squared test:

                 'mouse'    'cat'
'wild'              1         0
'domesticated'      1         2

However, you can also binarise the values of 'x' and get the following variables:

x1 = [1, 0, 1, 0]
x2 = [0, 1, 0, 1]
z = ['wild', 'domesticated', 'domesticated', 'domesticated']

Counting the values is now equal to summing the values that correspond to the value of z.

                 x1    x2
'wild'           1     0
'domesticated'   1     2

As you can see a single categorical variable (x) or multiple numerical variables (x1 and x2) are equally represented by in the contingency table. Thus chi-squared tests can be applied on a categorical variable (the label in sklearn) combined with another categorical variable or multiple numerical variables (the features in sklearn).

Best Answer

Related Solutions

Solved – Best way to select useful features using R software

Solved – What kind of feature selection can Chi square test be used for

Related Question