Solved – Multinomial logit: mlogit vs statsmodels

mlogitpythonrstatsmodels

I am doing a comparison between mlogit in R and statsmodels in python and have had trouble getting them to produce the same result. I'm wondering if the difference is a result of libraries or I am specifying something incorrectly. Any help would be appreciated.

I am using the "TravelMode" dataset to test the two.
In R:

> library("mlogit")
> library("AER")
> data("TravelMode", package="AER")
> write.csv(TravelMode, "travelmode.csv")
> TM <- mlogit.data(TravelMode, choice = "choice", shape = "long", 
                    chid.var = "individual", alt.var = "mode", drop.index = TRUE)
> TMlogit = mlogit(mFormula(choice ~ vcost), TM)
> summary(TMlogit)
Call:
mlogit(formula = mFormula(choice ~ vcost), data = TM, method = "nr", 
    print.level = 0)

Frequencies of alternatives:
    air   train     bus     car 
0.27619 0.30000 0.14286 0.28095 

nr method
4 iterations, 0h:0m:0s 
g'(-H)^-1g = 0.000482 #'
successive function values within tolerance limits 

Coefficients :
                    Estimate Std. Error t-value  Pr(>|t|)    
train:(intercept) -0.3885180  0.2622157 -1.4817 0.1384272    
bus:(intercept)   -1.3712065  0.3599380 -3.8096 0.0001392 ***
car:(intercept)   -0.8711172  0.3979705 -2.1889 0.0286042 *  
vcost             -0.0138883  0.0055318 -2.5106 0.0120514 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Log-Likelihood: -280.54
McFadden R^2:  0.011351 
Likelihood ratio test : chisq = 6.4418 (p.value = 0.011147)

In statsmodels:

> import pandas as pd
> import statsmodels.formula.api as smf
> TM = pd.read_csv('travelmode.csv')
> TM = pd.concat([TM, pd.get_dummies(TM['mode'])], axis=1)
> TMlogit = smf.mnlogit('choice ~ train + bus + car + vcost -1', TM)
> TMlogit_fit = TMlogit.fit()
Optimization terminated successfully.
         Current function value: 0.550273
         Iterations 6
> TMlogit_fit.summary()
<class 'statsmodels.iolib.summary.Summary'>
"""
                          MNLogit Regression Results                          
==============================================================================
Dep. Variable:                      y   No. Observations:                  840
Model:                        MNLogit   Df Residuals:                      836
Method:                           MLE   Df Model:                            3
Date:                Thu, 17 Mar 2016   Pseudo R-squ.:                 0.02145
Time:                        15:04:48   Log-Likelihood:                -462.23
converged:                       True   LL-Null:                       -472.36
                                        LLR p-value:                 0.0001497
=================================================================================
y=choice[yes]       coef    std err          z      P>|z|      [95.0% Conf. Int.]
---------------------------------------------------------------------------------
train            -0.3249      0.172     -1.891      0.059        -0.662     0.012
bus              -1.4468      0.205     -7.070      0.000        -1.848    -1.046
car              -0.7247      0.157     -4.603      0.000        -1.033    -0.416
vcost            -0.0105      0.002     -6.282      0.000        -0.014    -0.007
=================================================================================
"""

I would think the values of the coefficients would be closer to each other when comparing between the two models. Any help would be appreciated.

Best Answer

I'm the creator of pylogit. Thanks for using my package!

To answer your question, the differences in estimation results comes from differences in the way choice data is represented in statsmodels versus mlogit. The TravelMode dataset is in long format natively (i.e. when you wrote it to csv). I.e. it has one row per alternative per observation. However, statsmodels assumes ones data is in wide format, with one row per observation. Thus there are 4 times as many "observations" in the statsmodels model (840) than there are in your mlogit model. Try calling length(unique(TravelMode$individual)).

Moreover, your variables (pd.get_dummies(TM['mode'])) in your statsmodel model do not represent alternative specific constants, but are simply the alternative identifiers for each row of the long-format data. This is in contrast to the intercepts that were estimated in the mlogit model.

Related Question