I am attempting to build a Multinomial Logit model with dummy variables of the following form:
- The dependent variable represents 0-8 discrete choices.
- Dummy Variable 1: 965 dummy vars
- Dummy Variable 2: 805 dummy vars
The data set I am using has the dummy columns pre-created, so it's a table of 72,381 rows and 1770 columns.
The first 965 columns represent the dummy columns for Variable 1; the next 805 columns represent the dummy columns for Variable 2.
I'm on a Sun Grid Machine at my university, so memory won't be an issue…
I have been able to generate the factors and generate mlogit
data using code:
mldata<-mlogit.data(mydata, varying=NULL, choice="pitch_type_1", shape="wide")
my mlogit
data looks like:
"dependent_var","A variable","B Var","chid","alt"
FALSE,"110","19",1,"0"
FALSE,"110","19",1,"1"
FALSE,"110","19",1,"2"
FALSE,"110","19",1,"3"
FALSE,"110","19",1,"4"
TRUE,"110","19",1,"5"
FALSE,"110","19",1,"6"
FALSE,"110","19",1,"7"
FALSE,"110","19",1,"8"
FALSE,"110","19",2,"0"
FALSE,"110","19",2,"1"
FALSE,"110","19",2,"2"
FALSE,"110","19",2,"3"
FALSE,"110","19",2,"4"
FALSE,"110","19",2,"5"
TRUE,"110","19",2,"6"
FALSE,"110","19",2,"7"
FALSE,"110","19",2,"8"
TRUE,"110","561",3,"0"
...
The mldata
contains 651,431 rows.
If I try to run this full data set I get the following error:
> mlogit.model<- mlogit(dependent_var~0|A+B, data = mldata, reflevel="0")
Error in model.matrix.default(formula, data) :
allocMatrix: too many elements specified
Calls: mlogit ... model.matrix.mFormula -> model.matrix -> model.matrix.default
Execution halted
Smaller datasets (mldata
with only 595 rows) and mlogit
works fine and generates the expected regression output.
Is there a problem with mlogit
and huge datasets?
I suppose this is perhaps not the best way to assess this kind of data, but I am trying to replicate a previous analysis that was completed on a similar amount of similar data.
Best Answer
Well, you are just exhausting RAM on your machine. Generally, you have four options:
While you say that the problem was once solved, probably the way to go is option 3.
EDIT: I saw that the problem is in
model.matrix.default
; this seems quite common while the formula (those statements with~
) interpretation algorithm in R is not written too well in terms of memory use. If there is a way to run your model without using formulas, try it.