Solved – Clustering as a means of splitting up data for logistic regression

clusteringdata mininglogistic

I'm trying to predict the success or failure of students based on some features with a logistic regression model. To improve the performance of the model, I've already thought about splitting up the students into different groups based on obvious differences and building separate models for each group. But I think it might be difficult to identify these groups by examination, so I thought of splitting the students up by clustering on their features. Is this a common practice in building such models? Would you suggest that I break it down into obvious groups (for example, first term students vs. returning students) and then perform clustering on those groups, or cluster from the start?

To try to clarify:

What I mean is that I'm considering using a clustering algorithm to break my training set for the logistic regression into groups. I would then do separate logistic regressions for each of those groups. Then when using the logistic regression to predict the outcome for a student, I would choose which model to use based on which group they best fit into.

Perhaps I could do the same thing by including a group identifier, for example, a 1 if the student is returning and a 0 if not.

Now you've got me thinking about whether it might be advantageous to cluster the training data set and using their cluster label as a feature in the logistic regression, rather than building separate logistic regression models for each population.

If it's useful to include a group identifier for those who are returning students vs. new students, might it also be useful to expand the list of groups? Clustering seems like a natural way to do this.

I hope that's clear …

Best Answer

I believe that if you have a significant difference in your dependent variable between your clusters then the approach of clustering first will DEFINITELY be helpful. Regardless of your chosen learning algorithm.

It is my opinion that running a learnign algorithm on a whole base can cover up meaningful differences at a lower level of aggregation.

Anyone heard of simpson's paradox, it's a hard case of a deeper problem where you have different correlations in different groups that are covered up by larger sample noise and or weaker correlations of a larger group.

Related Question