Solved – coding survey data for cosine similarity and euclidean distance

distance-functionssimilaritiessurvey

I want to know how to code survey data such that a similarity function can be applied on it.

Say I want to use cosine similarity. All the search results and QA I've found while in my search deal only with the similarity between documents, with vectors consisting of word frequencies or tf/idf.

What about survey data? What is the sensible/common/useful way of coding survey data such that similarity can be compared? (Is it even sensible to use functions like cosine similarity for this?)

My data is record data, purely categorical, neither binary nor numerical. Should I code it into numerical data? My data looks like this (3 sample records):

Do you like Technology?  | Current GPA       | Institute name
Y                        | Band 1 (3.75-4)   | UUIC
N                        | Band 3 (3.0-3.5)  | ADU
N                        | Band 2 (3.5-3.75) | UUIC

etc. These are just 3 questions, my survey had a lot more questions, but I hope you get the idea.

Is it sensible to code the data into numerical vectors, where for eg. I represent yes/no values as binary variables, and assign numbered categories to other values? In which case the above 3 records would become:

(1, 1, 1)
(0, 3, 2)
(0, 2, 1)

Where UUIC = 1, ADU = 2, and the GPA bands are represented simply by 1, 2, 3, 4, etc..

And then apply cosine similarity or euclidean distance? Would this make sense? I've been searching for similar examples for a while now but everything that comes up seems to be about document similarity. There doesn't seem to be much beginner's help on how to deal with survey data.

Best Answer

  1. Both cosine similarity and euclidean distance require scale (=metric) level data, that is, interval or ratio level. I suppose it is what you mean by "numeric". Also, binary data (1 vs 0) will do (though there is theoretical controversy). Nominal data - convert it into dummy binary data first. Ordinal data - see to choose either treat it as interval or nominal.
  2. Squared euclidean distance and cosine similarity are exactly related. You always can transform one into the other. 1.
  3. It is not generally a good idea to compute a (dis)similarity coefficient on a hodge-podge of different characteristics (different types and/or units) even if you do recodings mentioned in point 1 and do appropriate standardizations prior the computation, because there remains issue of weighting (relative importance) of the characteristics. Gower similarity (rather than cosine/euclidean) is the measure of choice, if you are nevertheless determined to base the coefficient on mixed characteristics. It can "take" interval, ordinal, binary and nominal ones (and with used-defined weighting, if necessary). 1, 2.
  4. If you are going to compute similarity based on binary characteristics only, be aware that not all of of a great variety of binary "matching" similarity coefficients equally well suit natural binary characteristics and dummy variables (i.e. former nominal ones). 1.