Solved – Why log-transform to normal distribution for decision trees

cartmachine learning

On page 304 of chapter 8 of An Introduction to Statistical Learning with Applications in R (James et al.), the authors say:

We use the Hitters data set to predict a baseball player’s Salary based on Years (the number of years that he has played in the major leagues) and Hits (the number of hits that he made in the previous year). We first remove observations that are missing Salary values, and log-transform Salary so that its distribution has more of a typical bell-shape. (Recall that Salary is measured in thousands of dollars.)

No additional motivation for the log-transform is given. Being that the data are being fed into decision tree algorithms, why was it important to force the data into a normal distribution? I thought most/all decision tree algorithms were invariant to scale changes.

Best Answer

In this case, the salary is the target (dependent variable/outcome) of the decision tree, not one of the features (independent variables/predictors). You are correct that decision trees are insensitive to the scale of the predictors, but since I suspect there are a small number of extremely large salaries, transforming the salaries might improve predictions because loss functions which minimize square error will not be so strongly influenced by these large values.

Related Solutions

Decision Trees – Conditional Inference Trees vs Traditional Decision Trees: A Comparative Study

For what it's worth:

both rpart and ctree recursively perform univariate splits of the dependent variable based on values on a set of covariates. rpart and related algorithms usually employ information measures (such as the Gini coefficient) for selecting the current covariate.

ctree, according to its authors (see chl's comments) avoids the following variable selection bias of rpart (and related methods): They tend to select variables that have many possible splits or many missing values. Unlike the others, ctree uses a significance test procedure in order to select variables instead of selecting the variable that maximizes an information measure (e.g. Gini coefficient).

The significance test, or better: the multiple significance tests computed at each start of the algorithm (select covariate - choose split - recurse) are permutation tests, that is, the "the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points." (from the wikipedia article).

Now for the test statistic: it is computed from transformations (including identity, that is, no transform) of the dependent variable and the covariates. You can choose any of a number of transformations for both variables. For the DV (Dependant Variable), the transformation is called the influence function you were asking about.

Examples (taken from the paper):

if both DV and covariates are numeric, you might select identity transforms and calculate correlations between the covariate and all possible permutations of the values of the DV. Then, you calculate the p-value from this permutation test and compare it with p-values for other covariates.
if both DV and the covariates are nominal (unordered categorical), the test statistic is computed from a contingency table.
you can easily make up other kinds of test statistics from any kind of transformations (including identity transform) from this general scheme.

small example for a permutation test in R:

require(gtools)
dv <- c(1,3,4,5,5); covariate <- c(2,2,5,4,5)
# all possible permutations of dv, length(120):
perms <- permutations(5,5,dv,set=FALSE) 
# now calculate correlations for all perms with covariate:
cors <- apply(perms, 1, function(perms_row) cor(perms_row,covariate)) 
cors <- cors[order(cors)]
# now p-value: compare cor(dv,covariate) with the 
# sorted vector of all permutation correlations
length(cors[cors>=cor(dv,covariate)])/length(cors)
# result: [1] 0.1, i.e. a p-value of .1
# note that this is a one-sided test

Now suppose you have a set of covariates, not only one as above. Then calculate p-values for each of the covariates like in the above scheme, and select the one with the smallest p-value. You want to calculate p-values instead of the correlations directly, because you could have covariates of different kinds (e.g. numeric and categorical).

Once you have selected a covariate, now explore all possible splits (or often a somehow restricted number of all possible splits, e.g. by requiring a minimal number of elements of the DV before splitting) again evaluating a permutation-based test.

ctree comes with a number of possible transformations for both DV and covariates (see the help for Transformations in the party package).

so generally the main difference seems to be that ctree uses a covariate selection scheme that is based on statistical theory (i.e. selection by permutation-based significance tests) and thereby avoids a potential bias in rpart, otherwise they seem similar; e.g. conditional inference trees can be used as base learners for Random Forests.

This is about as far as I can get. For more information, you really need to read the papers. Note that I strongly recommend that you really know what you're doing when you want to apply any kind of statistical analysis.

Decision Trees – Are They Almost Always Binary Trees?

This is mainly a technical issue: if you don't restrict to binary choices, there are simply too many possibilities for the next split in the tree. So you are definitely right in all the points made in your question.

Be aware that most tree-type algorithms work stepwise and are even as such not guaranteed to give the best possible result. This is just one extra caveat.

For most practical purposes, though not during the building/pruning of the tree, the two kinds of splits are equivalent, though, given that they appear immediately after each other.

Best Answer

Related Solutions

Decision Trees – Conditional Inference Trees vs Traditional Decision Trees: A Comparative Study

Decision Trees – Are They Almost Always Binary Trees?

Related Question