Solved – Correlation using Logistic Regression and Pearson

logisticpearson-rrregression

I am so sorry, I am beginner in statistic analysis, I have project using R to analyze the correlation between dependent variables and independents variables.

In this case I have two dependent variables (1. Extrovert, 2. Introvert).
And the independent variables i have the data from (Call Log-> how long they call everyday, how many they call everyday, SMS log-> how length text in SMS body every day, how many they sent/received sms for each day).

I am so confused how I can do it, please anyone can give me some good references about it.
I also have some questions such as :

  1. I use the different type of variables, independent variables (data type : numeric) but dependent variable (data type is categorical), so it is possible to apply logistic regression and Pearson?
  2. Or any someone will give me some advice the better solution such as another methods for solving this problem.

The example of data from dput()

structure(list(sumcallin = c(462L, 998L, 335L, 179L, 34L, 0L, 
0L, 0L, 0L, 0L), caountcallin = c(7L, 5L, 8L, 5L, 1L, 1L, 0L, 
1L, 1L, 1L), sumcallout = c(1068L, 81L, 519L, 393L, 342L, 0L, 
583L, 1902L, 358L, 1017L), countcallout = c(15L, 3L, 10L, 5L, 
6L, 0L, 3L, 3L, 3L, 3L), sumreceived = c(322L, 75L, 20L, 35L, 
8L, 35L, 135L, 103L, 471L, 173L), countreceived = c(15L, 4L, 
2L, 3L, 1L, 2L, 7L, 3L, 18L, 5L), sumsent = c(171L, 31L, 25L, 
23L, 8L, 55L, 87L, 9L, 400L, 258L), countsent = c(10L, 4L, 1L, 
3L, 1L, 3L, 4L, 1L, 13L, 8L), personality = structure(c(2L, 2L, 
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("extro", "intro"), class = "factor")), .Names = c("sumcallin", 
"caountcallin", "sumcallout", "countcallout", "sumreceived", 
"countreceived", "sumsent", "countsent", "personality"), row.names = c(1L, 
2L, 3L, 4L, 5L, 37L, 38L, 39L, 40L, 41L), class = "data.frame")

Thank you for your help.

Best Answer

There's a difference between predicting variables and finding out correlation.

Logistic regression is predictor, more specifically, binary classifier. "Classifier" means that it tries to assign some class to every observation. "Binary" means that there are exactly 2 classes. Moreover, logistic regression produces probability with which each observation belongs to each class.

If you want to predict extroversion/introversion, there are 2 options for you:

  1. Use each of them as a class and give binary answer. This is simple: person will be assigned either "extrovert" or "introvert" label.
  2. Use fuzzy logic. Logistic regression will give you some number between 0 and 1, which represents how much person belongs to specified class. E.g. if you set introversion to 0 and extroversion to 1, and logistic regression return 0.7, then we can say that person is 70% extrovert and 30% introvert. This one is good for capturing things like ambiversion.

Logistic regression works with both - continuous variables and categorical (encoded as dummy variables), so you can directly run logistic regression on your dataset.

Pearson, on other hand, defines correlation. Correlation is simply normalized covariation, and covariation measures how 2 random variables co-variate, that is, how change in one variable is related to change in another one.

Strictly speaking, Pearson correlation cannot deal with categorical variables (mostly because categorical variables don't have a notion of mean, which Pearson is based on). However, having only 2 binary variables you can consider them as continuous (with values of 1 and 0) and calculate a kind of correlation. This is clearly a hack, but it should work for simple explorational analysis.

Related Question