Solved – Information gain with numerical data

entropyrandom forest

I'm making a random forest classifier.
In every tutorial, there is a very simple example of how to calculate entropy with Boolean attributes.
In my problem I have attribute values that are calculated by tf-idf schema, and values are real numbers.
Is there some clever way of applying an information gain function so it will calculate IG with real-number weights, or should I use discretization like:

0 = 0  
(-0 - 0.1> = 1  
(-0.1 - 0.2> = 2

etc.

EDIT
I have function:

$$
IG(X) = E(C) – E(C,A)
$$

$$
E(C) = \sum\limits_{i=1}^C-P(c_i) * log(P(c_i))
$$

and
$$
E(C,A) = \sum\limits_{a\in A}P(a) * E(a)
$$

The problem is iI have infinite number of possible values of $$ A $$ and i think, that I should perform dicretization of these values, shouldn't I?

Best Answer

Yes, you want to descretize your data. In fact it's generally good practice to do this in Machine Learning as it opens things up for more general algorithms.

The way to do it is an optimization problem, I guess you can think of it as a clustering problem too, you want to choose a bucketing of your real values such that the resulting categorical variables maximize the pairwise distance between the variables while ensuring the internal distance is minimal.

Now smoothing for the probability estimation will be very important here otherwise each bucket may end up with a tiny number of data points. Make sure you smooth aggressively towards 1/J for buckets which have a small number of examples in them. How to do this is unfortunately somewhat underdocumented in the literature and only Laplacian Smoothing (which is terrible, and has no justification at all) appears to be well known.

As a rule of thumb / rough hack, you could insist each bucket has some minimum number of data points, or you could use something like a p value.

Related Question