How do I go about doing a multinomial logistic regression when I have 70 million observations? Is it feasible? It seems that R is out of the question due to memory constraints?
Solved – Multinomial logistic regression for big data
large datalogisticmultinomial-distributionregression
Related Solutions
With a multinomial logit model you impose the constraint that all the predicted probabilities add up to 1. When you use separate binary logit model you can no longer impose that constraint, they are estimated in seperate models after all. So that would be the main difference between these two models.
As you can see in the example below (In Stata, as that is the program I know best), the models tend to be similar but not the same. I would be especially careful about extrapolating predicted probabilities.
// some data preparation
. sysuse nlsw88, clear
(NLSW, 1988 extract)
.
. gen byte occat = cond(occupation < 3 , 1, ///
> cond(inlist(occupation, 5, 6, 8, 13), 2, 3)) ///
> if !missing(occupation)
(9 missing values generated)
. label variable occat "occupation in categories"
. label define occat 1 "high" ///
> 2 "middle" ///
> 3 "low"
. label value occat occat
.
. gen byte middle = (occat == 2) if occat !=1 & !missing(occat)
(590 missing values generated)
. gen byte high = (occat == 1) if occat !=2 & !missing(occat)
(781 missing values generated)
// a multinomial logit model
. mlogit occat i.race i.collgrad , base(3) nolog
Multinomial logistic regression Number of obs = 2237
LR chi2(6) = 218.82
Prob > chi2 = 0.0000
Log likelihood = -2315.9312 Pseudo R2 = 0.0451
-------------------------------------------------------------------------------
occat | Coef. Std. Err. z P>|z| [95% Conf. Interval]
--------------+----------------------------------------------------------------
high |
race |
black | -.4005801 .1421777 -2.82 0.005 -.6792433 -.121917
other | .4588831 .4962591 0.92 0.355 -.5137668 1.431533
|
collgrad |
college grad | 1.495019 .1341625 11.14 0.000 1.232065 1.757972
_cons | -.7010308 .0705042 -9.94 0.000 -.8392165 -.5628451
--------------+----------------------------------------------------------------
middle |
race |
black | .6728568 .1106792 6.08 0.000 .4559296 .889784
other | .2678372 .509735 0.53 0.599 -.7312251 1.266899
|
collgrad |
college grad | .976244 .1334458 7.32 0.000 .714695 1.237793
_cons | -.517313 .0662238 -7.81 0.000 -.6471092 -.3875168
--------------+----------------------------------------------------------------
low | (base outcome)
-------------------------------------------------------------------------------
// separate logits:
. logit high i.race i.collgrad , nolog
Logistic regression Number of obs = 1465
LR chi2(3) = 154.21
Prob > chi2 = 0.0000
Log likelihood = -906.79453 Pseudo R2 = 0.0784
-------------------------------------------------------------------------------
high | Coef. Std. Err. z P>|z| [95% Conf. Interval]
--------------+----------------------------------------------------------------
race |
black | -.5309439 .1463507 -3.63 0.000 -.817786 -.2441017
other | .2670161 .5116686 0.52 0.602 -.735836 1.269868
|
collgrad |
college grad | 1.525834 .1347081 11.33 0.000 1.261811 1.789857
_cons | -.6808361 .0694323 -9.81 0.000 -.816921 -.5447512
-------------------------------------------------------------------------------
. logit middle i.race i.collgrad , nolog
Logistic regression Number of obs = 1656
LR chi2(3) = 90.13
Prob > chi2 = 0.0000
Log likelihood = -1098.9988 Pseudo R2 = 0.0394
-------------------------------------------------------------------------------
middle | Coef. Std. Err. z P>|z| [95% Conf. Interval]
--------------+----------------------------------------------------------------
race |
black | .6942945 .1114418 6.23 0.000 .4758725 .9127164
other | .3492788 .5125802 0.68 0.496 -.6553598 1.353918
|
collgrad |
college grad | .9979952 .1341664 7.44 0.000 .7350339 1.260957
_cons | -.5287625 .0669093 -7.90 0.000 -.6599023 -.3976226
-------------------------------------------------------------------------------
You can also fit one multinomial logistic model directly rather than fitting three rest-vs-one binary regressions.
To do so, if you call $y_i$ a categorical response coded by a vector of three $0$ and one $1$ whose position indicates the category, and if you call $\pi_i$ the vector of probabilities associated to $y_i$, you can directly minimize cross entropy : $$H = -\sum_i \sum_{j = 1..4} y_{ij} \log(\pi_{ij}) + (1 - y_{ij})\log(1 - \pi_{ij})$$ (this is also the negative log-likelihoood of the model).
The parameter of your multinomial logistic regression is a matrix $\Gamma$ with 4-1 = 3 lines (because a category is reference category) and $p$ columns where $p$ is the number of features you have (or $p + 1$ columns if you add an intercept). Each column corresponds to a feature. So to see importance of $j$-th feature you can for instance make a test (e.g. likelihood ratio test or Wald type test) for $\mathcal{H}_0 : \Gamma_{,j} = 0$ where $\Gamma_{,j}$ denotes $j$-th column of $\Gamma$. The $p$-value you get gives you the signicativity of your features.
In this case, likelihood ratio test actually sums up to looking at twice the gain of cross entropy you get by removing a feature, and comparing this to a $\chi^2_k$ distribution where $k$ is the dimension of the removed feature.
I Hope this helps.
Best Answer
(1) This doesn't seem like a multinomial regression questions, but rather a "how to use R with a large dataset" question. There is nothing intrinsic about multinomial regression that restricts your number of observations.
(2) I would use a commercial package.
(3) Many others have used R successfully. I think this questions has already been addressed here a number of times:
https://stackoverflow.com/questions/11055502/recommended-package-for-very-large-dataset-processing-and-machine-learning-in-r
https://stackoverflow.com/questions/13186077/work-in-r-with-very-large-data-set
...
http://www.r-bloggers.com/handling-large-datasets-in-r/