Context: In my master thesis, I am examining the evolution of maintainability issues over time on a set of around 2000 Android applications. For every application in the dataset, I have the counts of reported maintainability issues for each week of the application’s lifetime.
Furthermore, for every application I am fitting different regression models on the data (linear, quadratic, cubic, quartic) to obtain fitted model coefficients for each app’s evolution of maintainability issues.
In order to cluster the apps based on their similarity, the idea is to apply the K-means clustering algorithm, with regression model coefficients as inputs. In this way, I am hoping that K-means will automatically cluster the applications based on their similarities in the fitted models.
Since this is my first encounter with machine learning techniques, my question is: Is this a viable approach? Does it make sense to feed
the regression coefficients to k-means? Or should I just feed the raw data points?
Thanks in advance!
Best Answer
I had a very similar question and resources to answer it are thin on the ground. However this paper will be of use.
Basically, I believe it is OK to use the approach you suggested - although I don't think I would want to cluster linear models along with quadratic etc., I think clustering like with like makes more sense. While there's little info on stackexchange and other online statistics resources, this kind of clustering regression outputs has seen use in the academic community.
EDIT - Here's a relevant excerpt from the abstract (emphasis added):
The paper is Tarpey, Thaddeus. “Linear Transformations and the k-Means Clustering Algorithm: Applications to Clustering Curves.” The American Statistician 61.1 (2007): 34–40. PMC. Web. 10 Jan. 2018.