Solved – Multinomial logistic regression for big data

large datalogisticmultinomial-distributionregression

How do I go about doing a multinomial logistic regression when I have 70 million observations? Is it feasible? It seems that R is out of the question due to memory constraints?

Best Answer

(1) This doesn't seem like a multinomial regression questions, but rather a "how to use R with a large dataset" question. There is nothing intrinsic about multinomial regression that restricts your number of observations.
(2) I would use a commercial package.
(3) Many others have used R successfully. I think this questions has already been addressed here a number of times:
https://stackoverflow.com/questions/11055502/recommended-package-for-very-large-dataset-processing-and-machine-learning-in-r
https://stackoverflow.com/questions/13186077/work-in-r-with-very-large-data-set
...
http://www.r-bloggers.com/handling-large-datasets-in-r/

Related Solutions

Logistic Regression vs Multinomial Regression – Differences and Use Cases Explained

With a multinomial logit model you impose the constraint that all the predicted probabilities add up to 1. When you use separate binary logit model you can no longer impose that constraint, they are estimated in seperate models after all. So that would be the main difference between these two models.

As you can see in the example below (In Stata, as that is the program I know best), the models tend to be similar but not the same. I would be especially careful about extrapolating predicted probabilities.

// some data preparation
. sysuse nlsw88, clear                                                               
(NLSW, 1988 extract)                                                                 

.                                                                                    
. gen byte occat = cond(occupation < 3                 , 1,      ///                 
>                  cond(inlist(occupation, 5, 6, 8, 13), 2, 3))  ///                 
>                  if !missing(occupation)                                           
(9 missing values generated)                                                         

. label variable occat "occupation in categories"                                    

. label define occat 1 "high"   ///                                                  
>                    2 "middle" ///                                                  
>                    3 "low"                                                         

. label value occat occat                                                            

.                                                                                    
. gen byte middle = (occat == 2) if occat !=1 & !missing(occat)                      
(590 missing values generated)                                                       

. gen byte high   = (occat == 1) if occat !=2 & !missing(occat)                      
(781 missing values generated)                                                       


// a multinomial logit model
. mlogit occat i.race i.collgrad , base(3) nolog                                     

Multinomial logistic regression                   Number of obs   =       2237       
                                                  LR chi2(6)      =     218.82       
                                                  Prob > chi2     =     0.0000       
Log likelihood = -2315.9312                       Pseudo R2       =     0.0451       

-------------------------------------------------------------------------------      
        occat |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]      
--------------+----------------------------------------------------------------      
high          |                                                                      
         race |                                                                      
       black  |  -.4005801   .1421777    -2.82   0.005    -.6792433    -.121917      
       other  |   .4588831   .4962591     0.92   0.355    -.5137668    1.431533      
              |                                                                      
     collgrad |                                                                      
college grad  |   1.495019   .1341625    11.14   0.000     1.232065    1.757972      
        _cons |  -.7010308   .0705042    -9.94   0.000    -.8392165   -.5628451      
--------------+----------------------------------------------------------------      
middle        |                                                                      
         race |                                                                      
       black  |   .6728568   .1106792     6.08   0.000     .4559296     .889784      
       other  |   .2678372    .509735     0.53   0.599    -.7312251    1.266899      
              |                                                                      
     collgrad |                                                                      
college grad  |    .976244   .1334458     7.32   0.000      .714695    1.237793      
        _cons |   -.517313   .0662238    -7.81   0.000    -.6471092   -.3875168      
--------------+----------------------------------------------------------------      
low           |  (base outcome)                                                      
-------------------------------------------------------------------------------      

// separate logits:
. logit high   i.race i.collgrad , nolog                                             

Logistic regression                               Number of obs   =       1465       
                                                  LR chi2(3)      =     154.21       
                                                  Prob > chi2     =     0.0000       
Log likelihood = -906.79453                       Pseudo R2       =     0.0784       

-------------------------------------------------------------------------------      
         high |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]      
--------------+----------------------------------------------------------------      
         race |                                                                      
       black  |  -.5309439   .1463507    -3.63   0.000     -.817786   -.2441017      
       other  |   .2670161   .5116686     0.52   0.602     -.735836    1.269868      
              |                                                                      
     collgrad |                                                                      
college grad  |   1.525834   .1347081    11.33   0.000     1.261811    1.789857      
        _cons |  -.6808361   .0694323    -9.81   0.000     -.816921   -.5447512      
-------------------------------------------------------------------------------      

. logit middle i.race i.collgrad , nolog                                             

Logistic regression                               Number of obs   =       1656       
                                                  LR chi2(3)      =      90.13       
                                                  Prob > chi2     =     0.0000       
Log likelihood = -1098.9988                       Pseudo R2       =     0.0394       

-------------------------------------------------------------------------------      
       middle |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]      
--------------+----------------------------------------------------------------      
         race |                                                                      
       black  |   .6942945   .1114418     6.23   0.000     .4758725    .9127164      
       other  |   .3492788   .5125802     0.68   0.496    -.6553598    1.353918      
              |                                                                      
     collgrad |                                                                      
college grad  |   .9979952   .1341664     7.44   0.000     .7350339    1.260957      
        _cons |  -.5287625   .0669093    -7.90   0.000    -.6599023   -.3976226      
-------------------------------------------------------------------------------

Solved – Feature Importance for Multinomial Logistic Regression

You can also fit one multinomial logistic model directly rather than fitting three rest-vs-one binary regressions.

To do so, if you call $y_i$ a categorical response coded by a vector of three $0$ and one $1$ whose position indicates the category, and if you call $\pi_i$ the vector of probabilities associated to $y_i$, you can directly minimize cross entropy : $$H = -\sum_i \sum_{j = 1..4} y_{ij} \log(\pi_{ij}) + (1 - y_{ij})\log(1 - \pi_{ij})$$ (this is also the negative log-likelihoood of the model).

The parameter of your multinomial logistic regression is a matrix $\Gamma$ with 4-1 = 3 lines (because a category is reference category) and $p$ columns where $p$ is the number of features you have (or $p + 1$ columns if you add an intercept). Each column corresponds to a feature. So to see importance of $j$-th feature you can for instance make a test (e.g. likelihood ratio test or Wald type test) for $\mathcal{H}_0 : \Gamma_{,j} = 0$ where $\Gamma_{,j}$ denotes $j$-th column of $\Gamma$. The $p$-value you get gives you the signicativity of your features.

In this case, likelihood ratio test actually sums up to looking at twice the gain of cross entropy you get by removing a feature, and comparing this to a $\chi^2_k$ distribution where $k$ is the dimension of the removed feature.

I Hope this helps.

Best Answer

Related Solutions

Logistic Regression vs Multinomial Regression – Differences and Use Cases Explained

Solved – Feature Importance for Multinomial Logistic Regression

Related Question