Solved – How to cluster survey data

binary dataclusteringcorrespondence-analysisdimensionality reductionsvd

I have designed a rather long (250 Qn) survey designed to uncover user clusters. The questions are such that the pattern of answering should elicit user clusters, but I am having trouble uncovering these with my analyses to date.

For example, a typical Qn might be: 'Are you more of a dog or cat person?' Or, 'Chocolate or vanilla?' etc.

I'm coding these questions in a binary format. So, if the user answered the two questions above with 1. Dog and 2. Vanilla, the user's answer matrix would look like:
[1 0 0 1]
Signifying that the user chose the first and fourth answer, where the answer space is [Dog Cat Chocolate Vanilla]

I have roughly 300 respondents who have answered all 250 questions, giving around 800 possible answers, so my binary [user x answer] matrix is 300 x 800.

I have run SVD on this matrix. The first factor relates to the number of people who selected that answer (magnitude) as expected. The second factor clusters nicely into male / female (I know because I ask gender) respondents.

My problem is, all other factors are Gaussian and offer no way for me to split them into groups. A plot-matrix of the factors shows no grouping whatsoever.
A clue: when I look at the highest and lowest factor values for factors 3, 4, & 5, I can determine that there are definite personality types represented. For example, cautious/risky or conservative/outgoing or frugal/outlandish. But these are just the tails of a Gaussian. I am completely unable to separate these questions by anything but a 'random' threshold to the Gaussian tail.

The goal is to have a subset of the 250 questions that would allow me to quickly characterize a respondent, but right now, the only clustering I am able to assign is that of gender.

Best Answer

Just addressing the coding part of the question, building on the comments. Most stats packages are set up to deal with a coding like

What operating system are you using?

1 = Mac
2 = PC
3 = Linux

with a flag on that variable that it is a categorical factor and hence the 1, 2, 3 should not be interpreted as a continuous variable but just as the form of coding. This approach can also nicely accommodate a coding for NA.

This way of coding is useful because you can do a lot of categorical data analysis that involves matching one variable against another eg in a contingency table or (getting a bit fancier) correspondence analysis. This is easy to get the stats package to do with the above set of coding but requires a fair bit of mucking about in your approach.

Then, in much subsequent analysis, you need to specify some kind of contrasts to get it back to a set of binary variables similar (but not identical) to what you have started with. Commonly, one level of each variable is set as the "corner point" (eg "Mac") and the other levels of each categorical variable get their own binary variable in the new coding. The idea being that each row of data is assumed to be a Mac unless they have a 1 in the PC or Linux columns.

It's not uncommon for machine-read survey data to come in with a set of binary variables such as you have, but then typically the analyst starts by converting them to multi-level categorical factors such as I describe above. And a subsequent conversion back to multiple variables of 0s and 1s is typically done under the hood by the stats package if you ask it to do eg cluster analysis - so long as the stats package understands that these are categorical variables and hence treats them accordingly.

So the main advantage of this approach is that most stats packages are built around such a coding system and hence it will be easier to try the various methods of cluster analysis or other forms of categorical data analysis of which there are multitudes.

As an aside, a real problem you will have, whatever your approach, is that 300 participants is not really enough to explore a survey with 250 questions.

Related Question