Solved – How to use data analysis output (e.g. clustering) in predictive regression

clusteringregression

I performed some data analysis and visualizations on my dataset and found there are likely $k$ clusters present. How can I use this in a predictive regression setting?

My first thought is to create a regression model within each cluster. Therefore given a testing $x$, decide what cluster it belongs to, and perform the predictive regression within that cluster. Note that the regression is completely independent of the data in other clusters.
But this seems like there is data inefficiency, especially worse if the size of the cluster is not very large.

On the other hand, the performance of a global regression model is not necessarily affected by the presence of clusters, so I can ignore the clusters. But this seems like a waste of knowledge.

TLDR: How can I use clusters in regression?

Best Answer

The shortest answer: it depends. Appearance of several clusters in the data is a strong hint of several data-generating processes in play. It is quite possible that they have different error term properties. Disagree with @Zach, since cluster classification adds no new information into the regression, and the differences in effects would simply be dumped into the "cluster" indicator. The first question to ask would in my opinion be: are the coefficients in the cluster-specific regressions significantly different from each other (or from the pooled regression)?

Especially if you are severely lacking degrees of freedom, pooled regression looks much more promising.

So far we have been talking about in-sample properties of your set, but as long as you want to use it for prediction, it is out-of-sample performance that should ultimately drive your decision-making.

Related Question