I just found out that machine learning also has logistic regression as one of its methods. Can someone please tell me the differences between logistic regression in statistics and machine learning? I've seen lecture slides on logistic regression from a machine learning course, but I can't see the difference with the coverage of logistic regression in a statistics course.
Does logistic regression in machine learning have no need to check for multicollinearity?
The reason I asked this is because I've tried to run a dataset through R's glm
function with binomial logit, and then I ran the same dataset through Apache Mahout's trainlogistic
. But the resulting coefficients are different.
This is the command I use in R:
w1.glm <- glm(anw ~ cs, data = w1, family = "binomial")
This is the result of summary(w1.glm)
:
glm(formula = anw ~ cs, family = "binomial", data = w1)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5400 0.1073 0.1924 1.0047 1.0047
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.42077 0.02588 16.26 <2e-16 ***
cs 1.89342 0.06427 29.46 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 11762.5 on 10660 degrees of freedom
Residual deviance: 9250.3 on 10659 degrees of freedom
And this is the command I use in Mahout:
/usr/local/mahout/bin/mahout trainlogistic --input w1.csv --output ./model --target anw --categories 2 --predictors cs --types numeric --features 20 --passes 100 --rate 50
Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /usr/local/mahout/mahout-examples-0.8-job.jar
20
anw ~
-19.553*cs + -7.512*Intercept Term
cs -19.55265
Intercept Term -7.51155
0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 -19.552646543 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 -7.511546797 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
13/11/01 02:04:47 INFO driver.MahoutDriver: Program took 22118 ms (Minutes: 0.3686333333333333)
Edited: Added the reason I asked the question in the title. Added the commands used to execute glm
in R and trainlogistic
in Mahout.
Best Answer
Logistic regression refers to the same thing in both fields. It seems like Mahout does some things by default that make its implementation of logistic a little more than just logistic. First, Mahout seems to be regularizing the coefficients. If its doing this by default, I would also expect it to be standardizing (scaling and centering) the inputs. Passing it a value of lambda=0 should prevent regularization, but you still have to make sure that the inputs are not being standardized.
If you want to do regularized GLM in R check out the glmnet package.