Solved – Problem building multinomial logit model formula on huge data in R

logisticmultinomial-distributionr

I am attempting to build a Multinomial Logit model with dummy variables of the following form:

  • The dependent variable represents 0-8 discrete choices.
  • Dummy Variable 1: 965 dummy vars
  • Dummy Variable 2: 805 dummy vars

The data set I am using has the dummy columns pre-created, so it's a table of 72,381 rows and 1770 columns.

The first 965 columns represent the dummy columns for Variable 1; the next 805 columns represent the dummy columns for Variable 2.

I'm on a Sun Grid Machine at my university, so memory won't be an issue…

I have been able to generate the factors and generate mlogit data using code:

mldata<-mlogit.data(mydata, varying=NULL, choice="pitch_type_1", shape="wide")

my mlogit data looks like:

"dependent_var","A variable","B Var","chid","alt"
FALSE,"110","19",1,"0"
FALSE,"110","19",1,"1"
FALSE,"110","19",1,"2"
FALSE,"110","19",1,"3"
FALSE,"110","19",1,"4"
TRUE,"110","19",1,"5"
FALSE,"110","19",1,"6"
FALSE,"110","19",1,"7"
FALSE,"110","19",1,"8"
FALSE,"110","19",2,"0"
FALSE,"110","19",2,"1"
FALSE,"110","19",2,"2"
FALSE,"110","19",2,"3"
FALSE,"110","19",2,"4"
FALSE,"110","19",2,"5"
TRUE,"110","19",2,"6"
FALSE,"110","19",2,"7"
FALSE,"110","19",2,"8"
TRUE,"110","561",3,"0"

...

The mldata contains 651,431 rows.

If I try to run this full data set I get the following error:

> mlogit.model<- mlogit(dependent_var~0|A+B, data = mldata, reflevel="0")
Error in model.matrix.default(formula, data) :
allocMatrix: too many elements specified
Calls: mlogit ... model.matrix.mFormula -> model.matrix -> model.matrix.default
Execution halted

Smaller datasets (mldata with only 595 rows) and mlogit works fine and generates the expected regression output.

Is there a problem with mlogit and huge datasets?

I suppose this is perhaps not the best way to assess this kind of data, but I am trying to replicate a previous analysis that was completed on a similar amount of similar data.

Best Answer

Well, you are just exhausting RAM on your machine. Generally, you have four options:

  1. Fetch a bigger computer (rather a bad idea, since it is rather impossible to push more than few hundred GB in one node).
  2. Limit your problem.
  3. Look for HPC version of multinomial logit, probably outside R -- using sparse matrices, parallelizable among multiple nodes, stuff.
  4. Switch to same better scalable algorithm.

While you say that the problem was once solved, probably the way to go is option 3.

EDIT: I saw that the problem is in model.matrix.default; this seems quite common while the formula (those statements with ~) interpretation algorithm in R is not written too well in terms of memory use. If there is a way to run your model without using formulas, try it.

Related Question