I have a table with each row representing single printer model, its features, and price. I want to know how price is formed based on these features. What should I start with? Multiple regression, so I could cut off insignificant features? Cluster analysis to define small clusters with equal price? What are the ways to do the task?
Solved – Cluster analysis or regression
clusteringregression
Related Solutions
Latitude
and longitude
are scale variables. Time
is scale, too (I hope it is linear, not cyclic). Provider
is nominal variable. I see two options:
- Use Two-step cluster analysis. This is the method of choice if you have many (thousands) of objects (nodes) to cluster. This method has a nice option to detect outliers automatically; aside from this it is quite coarse method.
- Use Hierarchical cluster analysis basing it on Gower coefficient (look here for links where you could compute it). This clustering is appropriate if the number of objects is, say, up to 500. You should choose among several agglomeration methods. With Gower coefficient, since it is not euclidean/metric, only average, single, complete methods should be considered consistent (but not Ward or centroid or median). You probably choose between average and complete (or try both), for single produces too oblong clusters.
There exist, of course, other clustering methods potentially apropriate (for example, a modification of K-Means that can take nominal variables), but I haven't use them, so can't recommend.
A good way to decide on the proper number of clusters is to use some internal clustering criterion, such as Silhouette statistic, cophenetic correlation, BIC, etc. (if you use SPSS, find macros to compute them on my web-page). In clustering, you produce and save a range of cluster solutions (say, from 20-cluster solution to 2-cluster solution) which are variables of cluster membership, and then check by one or more clustering criterions which of the solutions represent the most well-separated clusters - ideal solution is when density inside clusters is high and between them is low.
It is obvious that using "option 1" dataset you virtually declare that similar respondents are those respondents who bought (or stole) the same items at the same turn or visit. I don't believe this is your research goal. In addition, problematic NA responses arise.
"Option2" dataset is what you should use, but do recode 2 into 1, to make the data binary. You then can take these variables in TwoStep as categorical variables. But here arises another doubt. TwoStep is for nominal categorical variables only; and you probably will not want to treat the binary variables as dichotomous nominal. Treating them as nominal means that, for you, respondents who did not buy the same item are as similar with each other as those who did bought the same item. Rather, you'll want to treat those "0 and 0" respondents as neither similar nor dissimilar, - this suggests using similarity measures such as Jaccard measure. But this in turn precludes using TwoStep and requires using Hierarchical clustering or other clustering methods apt for binary data (those other, unfortunalely, are not found in SPSS).
But you can't do hierarchical clustering of 75000 respondents - its too many, both practically and theoretically, for hierarchical method. You see - you've got trapped.
One way out will be to look for clustering algorithms (outside SPSS) which are for big (large) sparse binary data
. Search this site for this word combination: you'll get a few related questions, to read.
Another way out may be to do hierarchical clustering on random subsets of your data (subsets of size, say, 500 respondents). Clustering of subsamples and cross-validation is beneficial, as it escapes overfitting threat. But, in the context of clustering, it is quite a big work. I recommend you to read papers on cluster analysis by subsamples.
A third and the easiest way will be to do K-means clustering of your data. It solves the problem of big dataset. However, K-means is often seen as theoretically inappropriate for binary as well as count variables. They say that because the method involves computation of floating point geometric centroids, it requires interval, ideally - continuous, variables. That said, people use K-means with binary or count data "all the time". It appears to me that in high-dimensional settings such as yours (1500 variables) the relative contributions of "continuety" and "dimensionality" to the formation of centroids shifts towards dimensionality anyway - even if you had quite fine-grained Likert scale variables. That seems to excuse, to an extent, applying K-means to a wide binary dataset as yours.
If you choose K-means then I recommend you to normalize each row (respondent) in the dataset to unit sum-of-squares (= unit sum, since the data are binary) prior clustering. Why to do it? According to this, when you normalize vector magnitudes (L2 norms) you make the euclidean distance between the vectors directly reflect cosine similarity between them: $d^2=2(1-cos)$. And it is cosine similarity (= Ochiai binary measure) which is a justifiable alternative to Jaccard measure I mentioned in the 2nd paragraph above. Both Jaccard and Ochiai treat "0 and 0" respondents as neither similar nor dissimilar - that is what you need. So, using K-means on such way normalized data is in a sense analogous to using hierarchical clustering on Ochiai measure.
Best Answer
Welcome to the site.
I don't see how cluster analysis helps you with what you want to do. Regression is much more appropriate. That is, you have a dependent variable (price) and a bunch of independent variables (features) = a classic regression problem.
Of course, problems may arise. This would depend on how many different printer models there are, how many features there are, how many levels each feature has, and so on.