I have compositional data where for two groups, where each is represented by two ages, there are 100 possible categories for which I observed counts:
set.seed(1)
df <- data.frame(group = c(rep("g1",200),rep("g2",200)),
age = c(rep("a1",100),rep("a2",100),rep("a1",100),rep("a2",100)),
category = rep(paste0("c",1:100),4),
n = c(rmultinom(1,7000,pgamma(shape=0.8,rate=0.1,q=seq(0.01,1,0.01))),
rmultinom(1,5000,pgamma(shape=0.8,rate=0.3,q=seq(0.01,1,0.01))),
rmultinom(1,1800,pgamma(shape=0.5,rate=0.1,q=seq(0.01,1,0.01))),
rmultinom(1,1200,pgamma(shape=0.9,rate=0.1,q=seq(0.01,1,0.01)))),
stringsAsFactors = F)
I want to fit a regression model to the multinomial counts as a response and estimate the interaction effect
of the category * group
, while controlling for age
.
So far, I'm trying to use a multicategorical
glm
(with a binomial(link = 'logit')
), to a data.frame
where I transform the df$n
(total counts) to a binomial
(binary
) form:
binomial.df <- do.call(rbind,lapply(unique(df$group),function(g){
do.call(rbind,lapply(unique(dplyr::filter(df,group == g)$age),function(a){
do.call(rbind,lapply(unique(dplyr::filter(df,group == g)$category),function(t){
sum.non.category <- sum(dplyr::filter(df,group == g & age == a & category != t)$n)
sum.category <- sum(dplyr::filter(df,group == g & age == a & category == t)$n)
data.frame(group = g,age = a,category = t,assigned.category = c(rep(0,sum.non.category),rep(1,sum.category)))
}))
}))
}))
binomial.df$group <- factor(binomial.df$group, levels = c("g1","g2"))
binomial.df$age <- factor(binomial.df$age, levels = c("a1","a2"))
binomial.df$category <- factor(binomial.df$category, levels = paste0("c",1:100))
mm.fit <- glm(assigned.category ~ category * group + age,data = binomial.df, family = binomial(link = 'logit'))
Clearly for this size of data the glm
call will run for days or even longer, so I'm looking for a more tractable way.
Any idea?
BTW, I tried using nnet
's multinom
first:
df$group <- factor(df$group, levels = c("g1","g2"))
df$age <- factor(df$age, levels = c("a1","a2"))
df$category <- factor(df$category, levels = paste0("c",1:100))
mm.fit <- nnet::multinom(n ~ category * group + age, data=df)
But I get:
Error in nnet.default(X, Y, w, mask = mask, size = 0, skip = TRUE, softmax = TRUE, :
too many (22220) weights
Best Answer
biglm::bigglm
fits the model in under five minutes on my computerI get essentially the same coefficients at 10 iterations, but with a message saying the fit hasn't converged (note that the
1e-5
tolerance is $10^{-5}$ standard errors, so fairly tight).