Machine Learning – Encoding Hierarchical Relations Between Samples for Improved Feature Engineering

categorical-encodingfeature-engineeringinteractionmachine learning

I have a set of samples with a corresponding set of continuous features which I am using to make predictions about a particular property of the samples. However, the samples are organized in a hierarchy, let's say a family tree where each sample relates to another with some family relation (e.g. sister, second great cousin once removed etc..), and I am trying to work out how to also include this hierarchical information in our model (currently a neural network).

So far I have considered:

  • Ordinal encoding. This does preserve some important information e.g. siblings will be consecutive integers, however some consecutive integers won't be closely related as this relationship isn't really ordinal
  • Picking a certain level in the hierarchy to use to group samples and assign category labels (e.g. Every descendant of Grandfather Jack is in Group A) which can then be transformed with one hot encodings or embeddings. This second approach is more appealing, but some of the more detailed information is lost.

I'm wondering if any appropriate encoding exists for this kind of data?

Best Answer

One way to do this is to calculate pairwise similarity (or distance) between each of the samples using some metric. This gives a high-dimensional matrix which preserves a lot of the important information. In order to then use this, use some dimensionality reduction technique (e.g. PCA) --- because there is a high degree of dependency across the variables the dimensionality reduction is really effective.