Solved – Using Non-numeric Features

categorical-encodingfeature selectiongeneralized linear modelgradient descentmachine learning

I'm just starting out with machine learning. The example I was shown during a mini course I took was the predicting of the sale price of a house given features like:

size of house
number of floors
age of house
number of rooms

Given those features, it's trivial to train an algorithm to minimize some cost function, which in the course I took, was the sum of the differences between the value in the training set and the function used to fit it.

My question: how would I work with a feature that is not represented as a number?

For example, what if one of the features was, say, the name of the closest high school?

Best Answer

In most cases, you find a way to turn the non-numeric feature in a numeric one, and then go from there.

The simplest solution is to generate a set of indicator variables. For example, if you have $n$ different schools, you might add a set of $n$ variables $S_1, S_2, \ldots S_n$ to each data point. To indicate that $i$th school on your list is the closest, set $S_i = 1$ and set the rest of the variables to zero. This works well when 1) the identity of the closest school matters and 2) you can enumerate the schools present in your data set.

You might also think that the school identity per se doesn't actually carry much information; it's just a proxy for information about school size, test scores, student:teacher ratio, etc. You could join your data set with another data source that has that sort of information. The features would now be something like "size of the nearest high school", "Average SAT score at the nearest high school", etc.

It's also possible that the name of the school has a little bit of signal in it. For example, a good school system might have magnet and/or lab schools in it. You could design features to extract these from a string containing the school name. These would then be added, as indicator variables, to your feature set. This process, often called feature engineering, may require some domain knowledge.

However, in some cases, you can work directly on the non-numeric data. This is particularly true when building a discriminative classifier (or anything else using distances). For example, there are special kernels for support vector machines that allow you directly operate on strings (e.g., http://www.jmlr.org/papers/volume2/lodhi02a/lodhi02a.pdf), without turning them into something like a bag-of-words vector or something like that.

Related Solutions

Solved – Why are derived features used in neural networks

1): Including derived features is a way to inject expert knowledge into the training process, and so to accelerate it. For example, I work with physicists a lot in my research. When I'm building an optimization model, they'll give me 3 or 4 parameters, but they usually also know certain forms that are supposed to appear in the equation. For example, I might get variables $n$ and $l$, but the expert knows that $n*l$ is important. By including it as a feature, I save the model the extra effort of finding out that $n*l$ is important. Granted, sometimes domain experts are wrong, but in my experience, they usually know what they're talking about.

2): There are two reasons I know of for this. First, if you have thousands of features supplied (as often happens in real world data), and are short on CPU time for training (also a common occurrence), you can use a number of different feature selection algorithms to pare down the feature space in advance. The principled approaches to this often use information-theoretic measures to select the features with the highest predictive power. Second, even if you can afford to train on all the data and all the features you have, neural networks are often criticized for being 'black box' models. Reducing the feature space in advance can help to mitigate this issue. For example, a user looking at the NN cannot easily tell whether a weight of 0.01 means "0, but the optimization process didn't quite get there" or "This feature is important, but has to be reduced in value prior to use". Using feature selection in advance to remove useless features makes this less of an issue.

Solved – In neural networks, how to tell the feature which contributes the most to the output value

This is more of a statistics question than a specific programming one. If you have your heart set on using neural nets an example using feature selection with Garson's algorithm is here. Below I have provided the code that you can try. Hopefully this can give you something to start with.

But please note this is only one possible answer. There are many other approaches people have taken as this is an active area of research (neural networks are complex!). There are very likely other methods that may be more suitable, more efficient, etc. You may not want to even use neural nets (I don't know your specific reasons). Depending on your data it may be better to use some alternative feature selection up front before the neural net. A simple google scholar search for 'neural network feature selection' will return several papers on the matter. There are many very strong opinions on the subject of neural networks so be warned there is no definitive answer out there.

# code from link noted above (slightly updated)
require(clusterGeneration)
require(nnet)

#define number of variables and observations
set.seed(2)
num.vars<-8
num.obs<-10000

#define correlation matrix for explanatory variables
#define actual parameter values
cov.mat<-genPositiveDefMat(num.vars,covMethod=c("unifcorrmat"))$Sigma
rand.vars<-mvrnorm(num.obs,rep(0,num.vars),Sigma=cov.mat)
parms<-runif(num.vars,-10,10)
y<-rand.vars %*% matrix(parms) + rnorm(num.obs,sd=20)

#prep data and create neural network
y<-data.frame((y-min(y))/(max(y)-min(y)))
names(y)<-'y'
rand.vars<-data.frame(rand.vars)
mod1<-nnet(rand.vars,y,size=8,linout=T)

require(devtools)

#import 'gar.fun' from beckmw's Github - this is Garson's algorithm
source_gist('6206737')

#use the function on the model created above
gar.fun('y',mod1)

Here is the output plot. You can see there are both positive and negative values. The negative and positive values reflect negative and positive relationships between the variable and the response variable. nnet variable importance plot

Best Answer

Related Solutions

Solved – Why are derived features used in neural networks

Solved – In neural networks, how to tell the feature which contributes the most to the output value

Related Question