Solved – Problem with building mlogit model (with no alternative specific variables)

logisticlogitmultinomial-distributionrregression

I am confused with using mlogit package to build a multinomial logit model. In my data the only variables I have are the individual specific variables, to be consistent with terms from the mformula() method description (from the package documentation).

Here is the minimal example presenting the steps I am taking to build a model:

# load data 
library(RCurl)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))

url <- getURL("https://dl.dropboxusercontent.com/u/73455291/StackOverflow/student_state_data.csv")
student.data <- read.csv(text = url, sep=",", header=TRUE)
for(name in names(student.data)) student.data[, name] <- as.factor(student.data[, name])

# build model 
length(levels(student.data[, "result_state"]))   # [1] 3

library(mlogit)
student.data.m <- mlogit.data(student.data, shape = "wide", choice = "result_state")
model.m <- mlogit(result_state ~ 0 | var_1 + var_2 + var_3 + var_4 + var_5 + var_6 | 0, 
              data = student.data.m[-c(1:3), ])

# predict 
predict(model.m, newdata = student.data.m[c(1:3), ])

I receive the following error:

> predict(model.m, newdata = student.data.m[c(1:3), ])
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
contrasts can be applied only to factors with 2 or more levels

What am I doing wrong? Here is the console output to take a glimpse at the data:

> head(student.data)
  student_id var_1 var_2 var_3 var_4 var_5 var_6 result_state
1     216524     0     0    10     3  <NA>     1      state_3
2     245787     0     0    11     6  <NA>     0      state_3
3     120747     0     1     9     3  <NA>     1      state_3
4     130874     0     0     5     3  <NA>     0      state_1
5     156898     0     0     7     3  <NA>     0      state_3
6     241517     0     0     5     3  <NA>     1      state_1
> head(student.data.m)
          student_id var_1 var_2 var_3 var_4 var_5 var_6 result_state chid     alt
1.state_1     216524     0     0    10     3  <NA>     1        FALSE    1 state_1
1.state_2     216524     0     0    10     3  <NA>     1        FALSE    1 state_2
1.state_3     216524     0     0    10     3  <NA>     1         TRUE    1 state_3
2.state_1     245787     0     0    11     6  <NA>     0        FALSE    2 state_1
2.state_2     245787     0     0    11     6  <NA>     0        FALSE    2 state_2
2.state_3     245787     0     0    11     6  <NA>     0         TRUE    2 state_3

> sapply(names(student.data), function(name) length(levels(student.data[, name])))
  student_id        var_1        var_2        var_3        var_4        var_5        var_6 result_state 
        1000            2            2            8            8           12            2            3

Best Answer

The <NA>s in var_5 are killing you. Student 30135 has a complete row of data, so this works:

> new.data <- student.data.m[student.data.m$student_id == 30135, ]
> predict(model.m, newdata = new.data)
state_1      state_2      state_3 
3.980402e-01 5.812384e-09 6.019598e-01 

Consider using the mice package (on CRAN) for imputation of those missing values, or dropping var_5 if it's not that theoretically important.