Solved – model for machine learning on non-aggregated data, where we have a target variable, but also a grouping variable

aggregationclassificationmachine learning

Background:
I am currently working on some predictive modeling of some client shopping data to see if it is possible to categorize clients into one of nine ordinal categories according to their spending habits. Our target variable is ‘client category’.

There are many rows of data per client (anywhere from one to a thousand).
So both the training data and testing data have multiple rows per client ID, though the amount of rows are different for each client. It would seem to me that one approach is to aggregate data into the format of one row per client (aggregating over grouping variable ‘client ID number’) and then apply some modelling techniques to this aggregated dataset.

I have tried this, the techniques I have tried are multi-layer perceptron, decision tree and random forests. With this approach I get some fair predictions (as high as 90%), but these predictions perform very poorly over some of the ordinal categories (the ones with low frequency don't appear in the predictions at all). I even converted the ordinal variable into a continuous scale and the predictions were still not as good as I would like.

I have tried to improve this model by aggregating over other indices such as median and standard deviation – not just the mean, in order to reduce information loss on aggregation. This has not had a large effect on model improvement.

Question:
Given the assumption that we can sometimes lose information by aggregating data, is there a model for machine learning on non-aggregated data, where we have a target variable, but also a grouping variable (in this case client ID)?

My gut feeling is time series analysis is not too different, just with the extra factor of time, so I think it is possible to add an extra factor as a grouping variable.

I understand this may result in a slow model (computationally).

P.S. I did try a two-step approach of applying machine learning algorithms to the initial (non-aggregated) dataset to produce some dummy predictor variables and then aggregate them when the rest of the model was aggregated, feeding them into the new model, but it did not seem to produce good results.

Best Answer

This question is ancient, but it seems like you are looking for ordinal regression. Basically, for your 9 ordinal categories, you want to make 8 classifiers. The first classifier would be "is the category greater than 1 or less than or equal to 1?" The second classifier would be "is the category greater than 2, or less than or equal to 2?", etc.

If you're still looking for a solution to the problem, I'd look at the literature on ordinal regression. It may fit your problem better than aggregating your data or doing more complicated techniques first.

Related Question