Solved – Clustering data based on regression coefficients

clusteringk-meansregression coefficients

Context: In my master thesis, I am examining the evolution of maintainability issues over time on a set of around 2000 Android applications. For every application in the dataset, I have the counts of reported maintainability issues for each week of the application’s lifetime.

Furthermore, for every application I am fitting different regression models on the data (linear, quadratic, cubic, quartic) to obtain fitted model coefficients for each app’s evolution of maintainability issues.

In order to cluster the apps based on their similarity, the idea is to apply the K-means clustering algorithm, with regression model coefficients as inputs. In this way, I am hoping that K-means will automatically cluster the applications based on their similarities in the fitted models.

Since this is my first encounter with machine learning techniques, my question is: Is this a viable approach? Does it make sense to feed
the regression coefficients to k-means? Or should I just feed the raw data points?

Thanks in advance!

Best Answer

I had a very similar question and resources to answer it are thin on the ground. However this paper will be of use.

Basically, I believe it is OK to use the approach you suggested - although I don't think I would want to cluster linear models along with quadratic etc., I think clustering like with like makes more sense. While there's little info on stackexchange and other online statistics resources, this kind of clustering regression outputs has seen use in the academic community.

EDIT - Here's a relevant excerpt from the abstract (emphasis added):

Functional data can be clustered by plugging estimated regression coefficients from individual curves into the k-means algorithm. Clustering results can differ depending on how the curves are fit to the data. Estimating curves using different sets of basis functions corresponds to different linear transformations of the data. k-means clustering is not invariant to linear transformations of the data. The optimal linear transformation for clustering will stretch the distribution so that the primary direction of variability aligns with actual differences in the clusters. It is shown that clustering the raw data will often give results similar to clustering regression coefficients obtained using an orthogonal design matrix.

The paper is Tarpey, Thaddeus. “Linear Transformations and the k-Means Clustering Algorithm: Applications to Clustering Curves.” The American Statistician 61.1 (2007): 34–40. PMC. Web. 10 Jan. 2018.

Related Question