Latitude
and longitude
are scale variables. Time
is scale, too (I hope it is linear, not cyclic). Provider
is nominal variable. I see two options:
- Use Two-step cluster analysis. This is the method of choice if you have many (thousands) of objects (nodes) to cluster. This method has a nice option to detect outliers automatically; aside from this it is quite coarse method.
- Use Hierarchical cluster analysis basing it on Gower coefficient (look here for links where you could compute it). This clustering is appropriate if the number of objects is, say, up to 500. You should choose among several agglomeration methods. With Gower coefficient, since it is not euclidean/metric, only average, single, complete methods should be considered consistent (but not Ward or centroid or median). You probably choose between average and complete (or try both), for single produces too oblong clusters.
There exist, of course, other clustering methods potentially apropriate (for example, a modification of K-Means that can take nominal variables), but I haven't use them, so can't recommend.
A good way to decide on the proper number of clusters is to use some internal clustering criterion, such as Silhouette statistic, cophenetic correlation, BIC, etc. (if you use SPSS, find macros to compute them on my web-page). In clustering, you produce and save a range of cluster solutions (say, from 20-cluster solution to 2-cluster solution) which are variables of cluster membership, and then check by one or more clustering criterions which of the solutions represent the most well-separated clusters - ideal solution is when density inside clusters is high and between them is low.
It is obvious that using "option 1" dataset you virtually declare that similar respondents are those respondents who bought (or stole) the same items at the same turn or visit. I don't believe this is your research goal. In addition, problematic NA responses arise.
"Option2" dataset is what you should use, but do recode 2 into 1, to make the data binary. You then can take these variables in TwoStep as categorical variables. But here arises another doubt. TwoStep is for nominal categorical variables only; and you probably will not want to treat the binary variables as dichotomous nominal. Treating them as nominal means that, for you, respondents who did not buy the same item are as similar with each other as those who did bought the same item. Rather, you'll want to treat those "0 and 0" respondents as neither similar nor dissimilar, - this suggests using similarity measures such as Jaccard measure. But this in turn precludes using TwoStep and requires using Hierarchical clustering or other clustering methods apt for binary data (those other, unfortunalely, are not found in SPSS).
But you can't do hierarchical clustering of 75000 respondents - its too many, both practically and theoretically, for hierarchical method. You see - you've got trapped.
One way out will be to look for clustering algorithms (outside SPSS) which are for big (large) sparse binary data
. Search this site for this word combination: you'll get a few related questions, to read.
Another way out may be to do hierarchical clustering on random subsets of your data (subsets of size, say, 500 respondents). Clustering of subsamples and cross-validation is beneficial, as it escapes overfitting threat. But, in the context of clustering, it is quite a big work. I recommend you to read papers on cluster analysis by subsamples.
A third and the easiest way will be to do K-means clustering of your data. It solves the problem of big dataset. However, K-means is often seen as theoretically inappropriate for binary as well as count variables. They say that because the method involves computation of floating point geometric centroids, it requires interval, ideally - continuous, variables. That said, people use K-means with binary or count data "all the time". It appears to me that in high-dimensional settings such as yours (1500 variables) the relative contributions of "continuety" and "dimensionality" to the formation of centroids shifts towards dimensionality anyway - even if you had quite fine-grained Likert scale variables. That seems to excuse, to an extent, applying K-means to a wide binary dataset as yours.
If you choose K-means then I recommend you to normalize each row (respondent) in the dataset to unit sum-of-squares (= unit sum, since the data are binary) prior clustering. Why to do it? According to this, when you normalize vector magnitudes (L2 norms) you make the euclidean distance between the vectors directly reflect cosine similarity between them: $d^2=2(1-cos)$. And it is cosine similarity (= Ochiai binary measure) which is a justifiable alternative to Jaccard measure I mentioned in the 2nd paragraph above. Both Jaccard and Ochiai treat "0 and 0" respondents as neither similar nor dissimilar - that is what you need. So, using K-means on such way normalized data is in a sense analogous to using hierarchical clustering on Ochiai measure.
Best Answer
You can use log-likelihood distance with variables all continuous; in fact it is the default.
It is difficult to say without the data why your euclidean results seem poor. Automatic detection of number of clusters with BIC or AIC criterions is probably somewhat more apt with log-likelihood distance because they are based on the same paradigm as it. With euclidean distance, I recommend you to specify various fixed number of clusters and check if the clusters are meaningful to you. Also, check if your 4 variables are highly correlated (two-step cluster method assumes no or weak correlation).