Conv Neural Network – Can CNNs Be Useful with Encoded Categorical Features?

conv-neural-networkmachine learningneural networks

Convolution Neural Network (CNN) has been very successful in numerical features / inputs / independent variables, such as computer vision or or audio signal. And It can be viewed as extracting the useful information in frequency domain. (See What is the intuition behind convolutional neural network?)

But would it be useful in categorical feature. For example, suppose all of our features are categorical: a person's gender, race, education level, etc., what is the convolution mean on the data matrix (one-hot encoded) like that?


More background, my data are not images or audio, but more similar to UCI income data, with the increasing popularity in deep learning and CNN, I am curious if it is also useful in non-images.

Best Answer

I would hazard a guess that in this case your convolution kernels start memorizing groups of features and useful correlations between consecutive pairs of features. As a concrete example, imagine we have three features: (age, gender, education,race). You can then use, say, 1x2 convolution kernels with weights $w_1,w_2$, which slide across your features. So your kernel will extract weighted pairings:

$out1=w_1*age+w_2*gender$,

$out2=w_1*gender+w_2*education$,

$out3=w_1*education+w_2*race$.

The simplest kernel might look like $w_1=1$, $w_2=0$, in which case the outs would just be age,gender and education. Another kernel might be or $w_1=0,w_2=1$ which would get you "race" in the last out3.

When features are correlated, and the set of features is nonlinear, you might see weights like $w_1=0.6,w_2=-0.3$, and then when your out's are passed along to fully connected layers, maybe only out1 is used to capture the correlation between age and gender and your fully connected layers will manage the rest.

Usually though, convolutions are useful when your features are invariant under translation (an example is detection of faces in images), which in this case they are not. So I think your network would just use only say, out1 for the first kernel, out2 for the second kernel, etc. Again if there's some correlation, it will likely be more convoluted, or especialy if it gets stuck in a local minimum.

You might think this is useful in NLP, but the evidence here is slimmer. A great example is convolutional neural nets used for text classification here:

https://arxiv.org/abs/1502.01710.pdf

The kernels here are defined over one-hot encodings of words. However it's difficult to understand if all the network is doing is just memorizing sequences of words in a very expensive way. Example: "[blah] was delicious!" would strongly correlate with "food."

FastText was trained to do the same thing, except instead of training kernels, they literally enumerated billions of n-grams and used each as a feature in a sparse logistic classifier. The result was considerably faster (we're talking minutes vs days for training) and essentially as accurate:

https://arxiv.org/abs/1607.01759

Related Question