Solved – How to prepare the input layer for recurrent neural network if there are many categorical variables

categorical datadata transformationmachine learningneural networksrecurrent neural network

I am building a recurrent neural network (RNN). The feature set contains many categorical variables. Some of them are like users and items. In this case, if I use one-hot encoding and concatenate these vectors into a big one, the resulting vector will be super sparse. Is it fine to do this? I am not sure if this is normal for RNN.

Is there any other way to handle this case?

Best Answer

The "default" way of dealing with categorical variables for neural networks is to use embeddings. The most popular usage is word embeddings, where words are represented by vector representation (learned or pre-trained). The advantages of such approach is that it has smaller dimensionality then if you used one-hot encodings and they usually form meaningful representations of words, i.e. similar words have similar representations in the embedding space. Same idea can be applied to any other categorical variables, Sycorax gave one reference of paper by Guo and Berkhahn, but you can check also other references and this Medium post. Embeddings were used for categorical variables in Kaggle competitions, but you can also find many examples of recommender systems using such representations, for example this recent post on Google Cloud blog.

Related Question