Solved – How to find correlation among multiple columns of quantitative and qualitative data in Python

clusteringcorrelationmathematical-statisticspython

I have 2 data sets. One data set called user data and the other called end song data.

User data has the following information:

  1. gender of user (m or f)
  2. age range of user (7 different age ranges and some are nan)
  3. country of user (over 50 countries)
  4. account age in weeks
  5. user id

End song data has the following information:

  1. song the user played
  2. milliseconds the song was played
  3. context in which the song was played (playlist, album, collection, etc)
  4. track_id
  5. product (open or closed)
  6. end_timestamp

I am new to data science and come from a more computer science background, so I need some help with the statistical aspect. This is a lot more data in many more dimensions I have ever worked with, and I don't know where to start by analyzing it.

I am trying to learn how to go about analyzing this data set. What I do know is basic statistical measures to compare 2 numerical columns based on their correlation coefficient. I am looking to learn something that's more higher level to compare correlations across different columns. I am also confused as to how to include the qualitative data in my calculations.

Does anyone have any ideas how I can look for correlations between user demographic features (or their behavior) and their overall listening, or their average session lengths using Python? Thank you for your help.

Ideas I had:
1. First divide the data set up by country and male and female and then find correlation matrixes for the remainder of the numerical data for all the males and females separately in each country?
2. Is there some sort of way to do clustering on this?

Best Answer

Since you seem to be a novice, I'd strongly suggest familiarizing yourself with pandas and scikit-learn; these are the "tools of the trade" for people doing data science using Python.

Before starting with any analysis (such as finding "correlations"), try to be specific about what question you're trying to answer. With a data set as rich as the one you've been provided, there are many possible questions, each with different models that might be appropriate.

For instance, you might try answering the question "can user session lengths be predicted given user demographic information?". This is an example of a supervised learning problem, since the training data has already been labeled (i.e., you know the user session lengths). Furthermore, the variable you're trying to predict is continuous, which makes this is a regression (and not a classification) problem. As a first attempt, I'd trying building a linear regression model to predict the output given the inputs, after one-hot-encoding the categorical variables.

Another more difficult problem is "can we divide users into clusters based on their demographic information and/or listening habits?". This is an example of an unsupervised learning problem, since you don't already have labels for the clusters. Since your features will end up being a mix of categorical and continuous, simple k-means clustering probably won't work, and models of the type discussed in this answer might be useful.