Machine Learning – Should Data Be Separated or Combined When Using Extra Categorical Feature?

data transformationdatasetfeature selectioninteractionmachine learning

I have regression/classification problem. Dataset contains data from 4 sensors on 4 positions (1,2,3,4). Processes measured on all 4 positions are equivalent and same label and features describe all 4 positions, but data is not strictly the same. I am considering 3 options:

  • ignore position -> one data set, one prediction model
  • use position as extra categorical variable, one dataset, one model
  • split data on 4 datasets and make pipeline for each position separately

Dataset consists of 50-70 features (depends on features selection) and about 200 samples that is 50 per position (hopefully I will get some more). I am afraid that difference between positions would overshadow difference between samples and I wouldn't be able to predict label (quality of the process). On the other hand splitting datasets means less samples and more complicated development and deployment process.

As a measure of difference between positions vs variability of samples within one position I compared difference of means between two positions with std of every feature for one position. And it seems they are of comparable size for most features.
enter image description here

What other metrics, criteria, factors should I use in decision process? Is there some 4th option I haven't thought of?

(Position 1 is actually a little bit different so I will at least split dataset for position 1 and positions 2,3,4.)

Pairplots:
Pairplots of randomly selected features.

Best Answer

I think you can merge the datasets and randomly shuffle the points. The model trained on the merged data with the categorical variable will be more robust if you are using a supervised learning approach. But if you want an unsupervised model, I recommend to not use the categorical variable and use separate pipelines.

If you use separate pipelines for your supervised approach, you can also train on one dataset and test on another dataset to check how the model performs. The data with similar distribution will have similar results I guess. This will be good to understand how the model performs in alien setting and the learned feature map space.

There should not be any issue with predicting labels while merging the dataset. The model will map them according to your features in a supervised approach.

Related Question