Multinomial Distribution – Fitting a Multinomial GLM for Very Large Datasets

categorical datageneralized linear modelmultinomial logitmultinomial-distributionnnet

I have compositional data where for two groups, where each is represented by two ages, there are 100 possible categories for which I observed counts:

set.seed(1)
df <- data.frame(group = c(rep("g1",200),rep("g2",200)),
                 age = c(rep("a1",100),rep("a2",100),rep("a1",100),rep("a2",100)),
                 category = rep(paste0("c",1:100),4),
                 n = c(rmultinom(1,7000,pgamma(shape=0.8,rate=0.1,q=seq(0.01,1,0.01))),
                       rmultinom(1,5000,pgamma(shape=0.8,rate=0.3,q=seq(0.01,1,0.01))),
                       rmultinom(1,1800,pgamma(shape=0.5,rate=0.1,q=seq(0.01,1,0.01))),
                       rmultinom(1,1200,pgamma(shape=0.9,rate=0.1,q=seq(0.01,1,0.01)))),
                 stringsAsFactors = F)

I want to fit a regression model to the multinomial counts as a response and estimate the interaction effect of the category * group, while controlling for age.

So far, I'm trying to use a multicategorical glm (with a binomial(link = 'logit')), to a data.frame where I transform the df$n (total counts) to a binomial (binary) form:

binomial.df <- do.call(rbind,lapply(unique(df$group),function(g){
  do.call(rbind,lapply(unique(dplyr::filter(df,group == g)$age),function(a){
    do.call(rbind,lapply(unique(dplyr::filter(df,group == g)$category),function(t){
      sum.non.category <- sum(dplyr::filter(df,group == g & age == a & category != t)$n)
      sum.category <- sum(dplyr::filter(df,group == g & age == a & category == t)$n)
      data.frame(group = g,age = a,category = t,assigned.category = c(rep(0,sum.non.category),rep(1,sum.category)))
    })) 
  }))
}))

binomial.df$group <- factor(binomial.df$group, levels = c("g1","g2"))
binomial.df$age <- factor(binomial.df$age, levels = c("a1","a2"))
binomial.df$category <- factor(binomial.df$category, levels = paste0("c",1:100))

mm.fit <- glm(assigned.category ~ category * group + age,data = binomial.df, family = binomial(link = 'logit'))

Clearly for this size of data the glm call will run for days or even longer, so I'm looking for a more tractable way.

Any idea?

BTW, I tried using nnet's multinom first:

df$group <- factor(df$group, levels = c("g1","g2"))
df$age <- factor(df$age, levels = c("a1","a2"))
df$category <- factor(df$category, levels = paste0("c",1:100))
mm.fit <- nnet::multinom(n ~ category * group + age, data=df)

But I get:

Error in nnet.default(X, Y, w, mask = mask, size = 0, skip = TRUE, softmax = TRUE,  : 
  too many (22220) weights

Best Answer

biglm::bigglm fits the model in under five minutes on my computer

mm.fit1 <- bigglm(assigned.category ~ category * group + age,data = binomial.df, family = binomial(link = 'logit'),tolerance=1e-5, maxit=20, chunksize=50000)
> summary(mm.fit1)
Large data regression model: bigglm(assigned.category ~ category * group + age, data = binomial.df, 
    family = binomial(link = "logit"), tolerance = 1e-05, maxit = 20, 
    chunksize = 50000)
Sample size =  1500000 
                        Coef    (95%     CI)     SE      p
(Intercept)          -8.0060 -9.0063 -7.0058 0.5001 0.0000
categoryc2            0.6935 -0.5315  1.9185 0.6125 0.2576
categoryc3            1.1794  0.0356  2.3232 0.5719 0.0392
categoryc4            1.5052  0.3994  2.6111 0.5529 0.0065
categoryc5            1.4480  0.3363  2.5597 0.5559 0.0092
categoryc6            2.1126  1.0534  3.1718 0.5296 0.0001
categoryc7            2.1999  1.1455  3.2543 0.5272 0.0000
categoryc8            2.2274  1.1744  3.2804 0.5265 0.0000
categoryc9            2.1426  1.0851  3.2001 0.5288 0.0001
categoryc10           2.0171  0.9522  3.0820 0.5324 0.0002
    [snip]
categoryc91:groupg2   0.0738 -2.1856  2.3331 1.1297 0.9479
categoryc92:groupg2  -0.2208 -2.4864  2.0449 1.1328 0.8455
categoryc93:groupg2  -0.0685 -2.3308  2.1939 1.1312 0.9517
categoryc94:groupg2   0.1083 -2.1485  2.3651 1.1284 0.9235
categoryc95:groupg2  -0.2480 -2.5135  2.0175 1.1328 0.8267
categoryc96:groupg2   0.0391 -2.2188  2.2971 1.1290 0.9723
categoryc97:groupg2  -0.1088 -2.3670  2.1494 1.1291 0.9232
categoryc98:groupg2   0.2224 -2.0352  2.4801 1.1288 0.8438
categoryc99:groupg2  -0.0700 -2.3329  2.1929 1.1315 0.9507
categoryc100:groupg2 -0.3352 -2.5989  1.9285 1.1319 0.7671

I get essentially the same coefficients at 10 iterations, but with a message saying the fit hasn't converged (note that the 1e-5 tolerance is $10^{-5}$ standard errors, so fairly tight).

Related Question