Solved – Machine Learning model with aggregated data as training

aggregationdata transformationmachine learningpercentage

I would like to predict a LABEL: A,B or C using a classification machine learning model.

My data to train the model is like:

LABEL AGE12-18 AGE19-24 AGE25-35
A         10         30      60
B         40         20      40
C          5          5      90  

Where AGE-12-18, AGE19-24 and AGE25-35 are the percentage of users with age between [12-19),[19-25) and [25-35) in each cluster. Then

AGE12-18+AGE19-24+AGE25-35=100%

So, I have aggregations of A,B,C instead of all the data.

I would like to transform this data to predict users with data like:

USER AGE    AGECAT
a    24   AGE19-25
b    32   AGE25-35

I was thinking to create a new dataset with a distribution with the same % of users in each cluster as:

LABEL               AGECAT
A       AGE12-18 X 10 rows
A       AGE19-24 x 30 rows
A       AGE25-35 x 60 rows

However, I don't like really much this solution as I am not sure If it is going to work.
I have seen another similar question with aggregated dependent variable but not with the independent variables.

Do anybody knows if this is correct of any other way to achieve a classification model with this data?
Thank you

Best Answer

If you only have the percentages of each age category within each label, then that does not let you do much to predict the labels*. You'd need the number** of people within each cell of your table. Creating a row per person would indeed work in the way you mention.

* The problem is that you do not know how common each label is overall. If the age breakdown in a particular label is a specific way, then your prediction still depends enormously on whether that label occurs in 0.1% of people overall, 50% or 99.9%.

** Percentages with each label within each age category would also work for getting a prediction, but you would - without knowing the numbers behind the percentages - not be able to characterize the performance of your model or the uncertainty in your predictions (even if your data is a sample from the true population of interest).

Related Question