MATLAB: Proper use of ClassificationTree.fit for categorical variables

classification treeStatistics and Machine Learning Toolbox

The documentation for fitting classification trees states that X needs to be a floating point array, but also indicates that X can represent categorical variables (using the 'CategoricalPredictors' Name-Value argument).

Is the proper way to handle this to

(1) take the categorical variable, e.g.

category1 = {'duck','duck','goose','squash','quartz'}';
category2 = {'animal','animal','animal','vegetable','mineral'}';

(2) run those through grp2idx()

numcat1 = grp2idx(category1);
numcat2 = grp2idx(category2);

(3) Embed those in my X:

X = [numcat1 numcat2 otherTrulyNumericalVariables]

(4) Identify those as categorical

tree = ClassificationTree.fit(X,Y,'CategoricalPredictors',[1 2])

Seems like that's probably right, but I'd love an expert to vet that idea. The documentation doesn't have a categorical example.

Best Answer

Yes, this would be one way to accomplish this. You'd have to be careful when you convert new data to numeric for prediction. If the new data are missing a level (for example, 'goose' does not appear in the value set), grp2idx can return different indices for the same categorical values. One way to avoid this pitfall would be by using the nominal type and specifying the level order explicitly, for example:

category1 = nominal({'duck','duck','goose','squash','quartz'},...
      [],{'goose','squash','quartz' 'duck'})
numcat1 = double(category1)

Depending on how you get your data, you might find it easier to put your entire data (numeric and categorical variables) into a table or, if you are not in R2013b yet, into a dataset object and then extract numeric and categorical variables from that object.

Related Solutions

MATLAB: Obtaining the mapping of categories to indices when using ‘grp2idx’

You can use 'grp2ind' to convert an array from categorical to numeric indices.

To understand the mapping of a category to the numeric index, you can use the 'categories' function.

To demonstrate, let 'c' be our categorical vector of labels :

>> c = categorical({'Male','Female','Female','Male','Female'})

Now, convert it to numeric indices using :

>> nums = grp2idx(c);

To get the mapping of a category to the indices/integers :

>> order_cat = categories(c);

Now,

if you want to get the numeric index that the 'Male' category corresponds to, you can use :

>> m = find(order_cat == "Male")  %This is basically the index of the "Male" category in 'order_cat'

or if you want to convert a new vector of index labels back into categorical labels, you can simply use the 'categorical' function as below:

>> categorical(nums, [1:size(order_cat)], order_cat)

MATLAB: Decision tree non-numerical data statistics toolbox

The type of tree you need is defined by the type of output. If your output is numeric ("numeric" here means that you can do greater and less comparisons and compute a meaningful distance between values), regression tree is the right choice.

For either type of tree, you need to convert your inputs to a numeric matrix. Then you can indicate what variables are non-numeric (categorical) using the 'CategoricalPredictors' parameter; if all your variables are categorical, set it to 'all'.

You can convert your non-numeric data to numeric in many ways. One way would be to use the categorical class in MATLAB on each variable in your data, for example:

>> colors = categorical({'g' 'r' 'b'; 'b' 'r' 'g'});
>> numeric_colors = double(colors);

Then use the new numeric variables as columns in the matrix you pass to the fit function.

Best Answer

Related Solutions

MATLAB: Obtaining the mapping of categories to indices when using ‘grp2idx’

MATLAB: Decision tree non-numerical data statistics toolbox

Related Question