Solved – Problem with building mlogit model (with no alternative specific variables)

logisticlogitmultinomial-distributionrregression

I am confused with using mlogit package to build a multinomial logit model. In my data the only variables I have are the individual specific variables, to be consistent with terms from the mformula() method description (from the package documentation).

Here is the minimal example presenting the steps I am taking to build a model:

# load data 
library(RCurl)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))

url <- getURL("https://dl.dropboxusercontent.com/u/73455291/StackOverflow/student_state_data.csv")
student.data <- read.csv(text = url, sep=",", header=TRUE)
for(name in names(student.data)) student.data[, name] <- as.factor(student.data[, name])

# build model 
length(levels(student.data[, "result_state"]))   # [1] 3

library(mlogit)
student.data.m <- mlogit.data(student.data, shape = "wide", choice = "result_state")
model.m <- mlogit(result_state ~ 0 | var_1 + var_2 + var_3 + var_4 + var_5 + var_6 | 0, 
              data = student.data.m[-c(1:3), ])

# predict 
predict(model.m, newdata = student.data.m[c(1:3), ])

I receive the following error:

> predict(model.m, newdata = student.data.m[c(1:3), ])
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
contrasts can be applied only to factors with 2 or more levels

What am I doing wrong? Here is the console output to take a glimpse at the data:

> head(student.data)
  student_id var_1 var_2 var_3 var_4 var_5 var_6 result_state
1     216524     0     0    10     3  <NA>     1      state_3
2     245787     0     0    11     6  <NA>     0      state_3
3     120747     0     1     9     3  <NA>     1      state_3
4     130874     0     0     5     3  <NA>     0      state_1
5     156898     0     0     7     3  <NA>     0      state_3
6     241517     0     0     5     3  <NA>     1      state_1
> head(student.data.m)
          student_id var_1 var_2 var_3 var_4 var_5 var_6 result_state chid     alt
1.state_1     216524     0     0    10     3  <NA>     1        FALSE    1 state_1
1.state_2     216524     0     0    10     3  <NA>     1        FALSE    1 state_2
1.state_3     216524     0     0    10     3  <NA>     1         TRUE    1 state_3
2.state_1     245787     0     0    11     6  <NA>     0        FALSE    2 state_1
2.state_2     245787     0     0    11     6  <NA>     0        FALSE    2 state_2
2.state_3     245787     0     0    11     6  <NA>     0         TRUE    2 state_3

> sapply(names(student.data), function(name) length(levels(student.data[, name])))
  student_id        var_1        var_2        var_3        var_4        var_5        var_6 result_state 
        1000            2            2            8            8           12            2            3

Best Answer

The <NA>s in var_5 are killing you. Student 30135 has a complete row of data, so this works:

> new.data <- student.data.m[student.data.m$student_id == 30135, ]
> predict(model.m, newdata = new.data)
state_1      state_2      state_3 
3.980402e-01 5.812384e-09 6.019598e-01

Consider using the mice package (on CRAN) for imputation of those missing values, or dropping var_5 if it's not that theoretically important.

Related Solutions

Solved – Problem building multinomial logit model formula on huge data in R

Well, you are just exhausting RAM on your machine. Generally, you have four options:

Fetch a bigger computer (rather a bad idea, since it is rather impossible to push more than few hundred GB in one node).
Limit your problem.
Look for HPC version of multinomial logit, probably outside R -- using sparse matrices, parallelizable among multiple nodes, stuff.
Switch to same better scalable algorithm.

While you say that the problem was once solved, probably the way to go is option 3.

EDIT: I saw that the problem is in model.matrix.default; this seems quite common while the formula (those statements with ~) interpretation algorithm in R is not written too well in terms of memory use. If there is a way to run your model without using formulas, try it.

Solved – Why does GBM predict different values for the same data

The factors, as always. Seems like the model is not using the actual value of the factor, but rather something like the position in the factor-levels.

I was able to reproduce your error with the data OrchardSprays

data(OrchardSprays)

model <- gbm(decrease ~ rowpos+colpos+treatment, data=OrchardSprays, n.trees=1000, distribution="gaussian", interaction.depth=3, bag.fraction=0.5, train.fraction=1.0, shrinkage=0.1, keep.data=TRUE)

firstrow <- OrchardSprays[1,]
str(firstrow)

manualFirstrow <- data.frame(decrease=57,rowpos=1,colpos=1,treatment="D")
str(manualFirstrow)

predict(model,newdata=firstrow,n.trees=100)
predict(model,newdata=manualFirstrow,n.trees=100)
predict(model,newdata=data.frame(decrease=57,rowpos=1,colpos=1,treatment="A"),n.trees=100)

output:

> predict(model,newdata=firstrow,n.trees=100)
[1] 50.31276
> predict(model,newdata=manualFirstrow,n.trees=100)
[1] 20.67818
> predict(model,newdata=data.frame(decrease=57,rowpos=1,colpos=1,treatment="A"),n.trees=100)
[1] 20.67818

since A has position 1 in the levels of OrchardSprays$treatment. Adding the levels to the data declaration does the trick

manualFirstrow <- data.frame(decrease=57,rowpos=1,colpos=1,treatment=factor("D",levels(OrchardSprays$treatment)))
str(manualFirstrow)

predict(model,newdata=firstrow,n.trees=100)
predict(model,newdata=manualFirstrow,n.trees=100)

output:

> predict(model,newdata=firstrow,n.trees=100)
[1] 50.31276
> predict(model,newdata=manualFirstrow,n.trees=100)
[1] 50.31276

Best Answer

Related Solutions

Solved – Problem building multinomial logit model formula on huge data in R

Solved – Why does GBM predict different values for the same data

Related Question