I am doing a comparison between mlogit in R and statsmodels in python and have had trouble getting them to produce the same result. I'm wondering if the difference is a result of libraries or I am specifying something incorrectly. Any help would be appreciated.
I am using the "TravelMode" dataset to test the two.
In R:
> library("mlogit")
> library("AER")
> data("TravelMode", package="AER")
> write.csv(TravelMode, "travelmode.csv")
> TM <- mlogit.data(TravelMode, choice = "choice", shape = "long",
chid.var = "individual", alt.var = "mode", drop.index = TRUE)
> TMlogit = mlogit(mFormula(choice ~ vcost), TM)
> summary(TMlogit)
Call:
mlogit(formula = mFormula(choice ~ vcost), data = TM, method = "nr",
print.level = 0)
Frequencies of alternatives:
air train bus car
0.27619 0.30000 0.14286 0.28095
nr method
4 iterations, 0h:0m:0s
g'(-H)^-1g = 0.000482 #'
successive function values within tolerance limits
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
train:(intercept) -0.3885180 0.2622157 -1.4817 0.1384272
bus:(intercept) -1.3712065 0.3599380 -3.8096 0.0001392 ***
car:(intercept) -0.8711172 0.3979705 -2.1889 0.0286042 *
vcost -0.0138883 0.0055318 -2.5106 0.0120514 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Log-Likelihood: -280.54
McFadden R^2: 0.011351
Likelihood ratio test : chisq = 6.4418 (p.value = 0.011147)
In statsmodels:
> import pandas as pd
> import statsmodels.formula.api as smf
> TM = pd.read_csv('travelmode.csv')
> TM = pd.concat([TM, pd.get_dummies(TM['mode'])], axis=1)
> TMlogit = smf.mnlogit('choice ~ train + bus + car + vcost -1', TM)
> TMlogit_fit = TMlogit.fit()
Optimization terminated successfully.
Current function value: 0.550273
Iterations 6
> TMlogit_fit.summary()
<class 'statsmodels.iolib.summary.Summary'>
"""
MNLogit Regression Results
==============================================================================
Dep. Variable: y No. Observations: 840
Model: MNLogit Df Residuals: 836
Method: MLE Df Model: 3
Date: Thu, 17 Mar 2016 Pseudo R-squ.: 0.02145
Time: 15:04:48 Log-Likelihood: -462.23
converged: True LL-Null: -472.36
LLR p-value: 0.0001497
=================================================================================
y=choice[yes] coef std err z P>|z| [95.0% Conf. Int.]
---------------------------------------------------------------------------------
train -0.3249 0.172 -1.891 0.059 -0.662 0.012
bus -1.4468 0.205 -7.070 0.000 -1.848 -1.046
car -0.7247 0.157 -4.603 0.000 -1.033 -0.416
vcost -0.0105 0.002 -6.282 0.000 -0.014 -0.007
=================================================================================
"""
I would think the values of the coefficients would be closer to each other when comparing between the two models. Any help would be appreciated.
Best Answer
I'm the creator of
pylogit
. Thanks for using my package!To answer your question, the differences in estimation results comes from differences in the way choice data is represented in
statsmodels
versusmlogit
. TheTravelMode
dataset is in long format natively (i.e. when you wrote it to csv). I.e. it has one row per alternative per observation. However,statsmodels
assumes ones data is in wide format, with one row per observation. Thus there are 4 times as many "observations" in thestatsmodels
model (840) than there are in yourmlogit
model. Try callinglength(unique(TravelMode$individual))
.Moreover, your variables (
pd.get_dummies(TM['mode'])
) in yourstatsmodel
model do not represent alternative specific constants, but are simply the alternative identifiers for each row of the long-format data. This is in contrast to the intercepts that were estimated in themlogit
model.