Solved – Multi-value categorical attributes in R

rsvm

I have a training data set with both numerical and categorical variables, and one class variable. I want to build a classification model (e.g.,SVM), and for this goal I need to transform all variables into convenient format. I m confused about my categorical variables. Let me give you an example about one of them.

The categorical variable in each observation represents a Google search query (usually 3-10 comma-separated words, see example below).

----------+----------------------------+-------------------+----------------
search_id | query_words (categorical)  |..(other variables)| class variable
----------+----------------------------+-------------------+----------------
1         | how,to,grow,tree           |..                 | 4
2         | smartfone,htc,buy,price    |..                 | 7
3         | buy,house,realty,london    |..                 | 6
4         | where,to,go,weekend,cinema |..                 | 4
...       | ...                        |..                 | ...
----------+----------------------------+-------------------+----------------

The words in this categorical variable are disordered and the same words may occur in different observations (that's logical).
Number of unique words for all observations = few thousands.
Number of observations: ~150.000.000

Since this categorical variable (query_words) is very important for my classification analysis, I need to train my model with it. My question is how to represent it to use for e.g., SVM.

In each observation I can sort words alphabetically to order them. If I will use a numeric vector with few thousands elements (one per each unique word) I can represent this variable for each observation as e.g.:

query_words[1] = (0,0,..1,..0,..1,..1,..0,...1,..0) # very big vector

But I don't believe it will work effectively. How should I handle this categorical variable. I m using R for analysis.

Best Answer

You essentially have a feature (aka attribute) for each word. In many SVM implementations this is not as inefficient as you might think as they are generally optimized to handle sparse matrices.

This is very standard in text classification, usually not with a binary value for each attribute, but rather a calculation representing the words frequency in the document using Term Frequency (TF) and Inverse Document Frequency (IDF). Much of the research surrounds what that calculation should be. Since you seem to be dealing with queries and repetition of words in the query isn't a problem, you could use a binary value to start and see what kind of results you get from that.