Solved – Using Non-numeric Features

categorical-encodingfeature selectiongeneralized linear modelgradient descentmachine learning

I'm just starting out with machine learning. The example I was shown during a mini course I took was the predicting of the sale price of a house given features like:

  • size of house
  • number of floors
  • age of house
  • number of rooms

Given those features, it's trivial to train an algorithm to minimize some cost function, which in the course I took, was the sum of the differences between the value in the training set and the function used to fit it.

My question: how would I work with a feature that is not represented as a number?

For example, what if one of the features was, say, the name of the closest high school?

Best Answer

In most cases, you find a way to turn the non-numeric feature in a numeric one, and then go from there.

The simplest solution is to generate a set of indicator variables. For example, if you have $n$ different schools, you might add a set of $n$ variables $S_1, S_2, \ldots S_n$ to each data point. To indicate that $i$th school on your list is the closest, set $S_i = 1$ and set the rest of the variables to zero. This works well when 1) the identity of the closest school matters and 2) you can enumerate the schools present in your data set.

You might also think that the school identity per se doesn't actually carry much information; it's just a proxy for information about school size, test scores, student:teacher ratio, etc. You could join your data set with another data source that has that sort of information. The features would now be something like "size of the nearest high school", "Average SAT score at the nearest high school", etc.

It's also possible that the name of the school has a little bit of signal in it. For example, a good school system might have magnet and/or lab schools in it. You could design features to extract these from a string containing the school name. These would then be added, as indicator variables, to your feature set. This process, often called feature engineering, may require some domain knowledge.

However, in some cases, you can work directly on the non-numeric data. This is particularly true when building a discriminative classifier (or anything else using distances). For example, there are special kernels for support vector machines that allow you directly operate on strings (e.g., http://www.jmlr.org/papers/volume2/lodhi02a/lodhi02a.pdf), without turning them into something like a bag-of-words vector or something like that.