Solved – glmnet: How to make sense of multinomial parameterization

categorical dataglmnetmultinomial-distribution

Following problem: I want to predict a categorical response variable with one (or more) categorical variables using glmnet().

However, I cannot make sense of the output glmnet gives me.

Ok, first let's generate two related categorical variables:

Generate Data

p <- 2 #number variables
mu <- rep(0,p)
sigma <- matrix(rep(0,p^2), ncol=p)
sigma[1,2] <- .8 #some relationship ..
diag(sigma) <- 1
sigma <- pmax(sigma, t(sigma))
n <- 100
set.seed(1)
library(MASS)
dat <- mvrnorm(n, mu, sigma)
#discretize
k <- 3 # number of categories
d <- apply(dat, 2, function(x) {
  q <- quantile(x, probs=seq(0,1, 1/k))[-c(1, k+1)]
  out <- numeric(length(x))
  for(i in 1:(k-1))
  {  out[x<q[k-i]] <- i } 
  return(out)
})
d <- data.frame(apply(d, 2, as.factor))
d[,2] <- relevel(d[,2], ref="0")
d[,1] <- relevel(d[,1], ref="0")
colnames(d) <- c("X1", "X2")

We get:

> table(d)
   X2
X1   0  1  2
  0 22 11  1
  1  9 14 10
  2  3  8 22

Prediction: multinom()

Then let's predict X1 by X2 using the multinom() from the nnet package:

library(nnet)
mod1 <- multinom(X1~X2, data=d)
mod1

which gives us:

Call:
multinom(formula = X1 ~ X2, data = d)

Coefficients:
  (Intercept)      X21      X22
1  -0.8938246 1.134993 3.196476
2  -1.9924124 1.673949 5.083518

Manual check

Now let's check, whether we can reproduce that manually:

tb <- table(d)
log(tb[2,1] / tb[1,1]) #intercept category1
[1] -0.8938179
log(tb[3,1] / tb[1,1]) #intercept category2
[1] -1.99243
log((tb[1,1]*tb[2,2]) / (tb[1,2]*tb[2,1])) #logodds-ratio cat X1 0vs1 in X2 0vs1
[1] 1.13498
#same for the three remaining log odds ratios

We produce the same numbers, good!

Prediction: glmnet()

Now let's do the same with glmnet:

library(glmnet)
y <- d[,1]
X <- model.matrix(X1~X2, data=d)[,-1]
mod2 <- glmnet(X, y, family="multinomial", lambda=c(0))
coef(mod2, s=0) #meaning of coefficients unclear!
$`0`
3 x 1 sparse Matrix of class "dgCMatrix"
                     1
(Intercept)  0.9620216
X21         -1.1349130
X22         -3.1958293   

$`1`
3 x 1 sparse Matrix of class "dgCMatrix"
                     1
(Intercept) 0.06825755
X21         .         
X22         .         

$`2`
3 x 1 sparse Matrix of class "dgCMatrix"
                     1
(Intercept) -1.0302792
X21          0.5388814
X22          1.8870363

Note that I set s=0, thus there is no regularisation and the parameters should contain the exact same information as the parameters of the multinom() function.

Still, we get very different parameters. This is due to the different parameterization they use in glmnet, see e.g.:

http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html (heading: Multinomial models)
or the corresponding paper:
http://www.jstatsoft.org/v33/i01/paper (heading: 4. Regularized multinomial regression)

But no matter how exactly one parameterizes, one should get the same $P(Y=k | X)$, the probability of category k conditional on X.

Conditional probabilities: multinom()

So I first I calculate these probabilities from multinom():

p.fit <- predict(mod1, type="probs")
head(d)
head(p.fit)
ccp <- matrix(0,3,3)
ccp[,3] <- p.fit[1,]
ccp[,2] <- p.fit[2,]
ccp[,1] <- p.fit[4,]
ccp
           [,1]      [,2]       [,3]
[1,] 0.64705896 0.3333332 0.03030114
[2,] 0.26470416 0.4242450 0.30303140
[3,] 0.08823688 0.2424218 0.66666746
colSums(ccp) #sum to 1, ok; sorry for the awful code ...
[1] 1 1 1

As we have a saturated model here, this should be the same as what we can calculate from the data:

emp <- table(d)/100
cemp <- apply(emp, 2, function(x) {
  x / sum(x)
})
cemp 
   X2
             0         1          2
  0 0.64705882 0.3333333 0.03030303
  1 0.26470588 0.4242424 0.30303030
  2 0.08823529 0.2424242 0.66666667

which is indeed the case.

Conditional probabilities: glmnet()

Now the same from glmnet:

c1 <- coef(mod2, s=0)
c <-matrix(rapply(c1, function(x) { as.matrix(x)}, how="unlist"), 3,3, byrow=T)

ccp2 <- matrix(0,3,3)
config <- rbind(c(0,0), c(1,0), c(0,1))

for(l in 1:3) #loop through categories
{
  denom <- numeric(3)
  for(i in 1:3) # loop through possible predictor combinations
  { 
    x1 <- config[i, 1]
    x2 <- config[i, 2]
    denom[i] <- exp(c[l,1] + x1 * c[l,2]  + x2 * c[l,3])
  }
  ccp2[l,1] <- denom[1] / sum(denom)
  ccp2[l,2] <- denom[2] / sum(denom)
  ccp2[l,3] <- denom[3] / sum(denom)
}
ccp2
          [,1]      [,2]       [,3]
[1,] 0.7340082 0.2359470 0.03004484
[2,] 0.3333333 0.3333333 0.33333333
[3,] 0.1073668 0.1840361 0.70859708
colSums(ccp2)
[1] 1.1747083 0.7533165 1.0719753

The cell conditional probabilities are somewhat related but different. Also they do not sum up to one.

So we have two problems here:

a) the conditional probabilities do not sum up to 1 and

b) the parameters do not describe what we see in the data: e.g. in row 2 there are differences across columns, but glmnet estimates both coefficients (not the intercept) as zero.

I used a linear regression problem and compared glm and glmnet with s=0 to make sure that s=0 means zero regularisation (the solutions were almost identical).

Any help & ideas would be highly appreciated!

Best Answer

About the parameters from multinom and glmnet, I found this answer beneficial, Can I use glm algorithms to do a multinomial logistic regression?

especially, "Yes, with a Poisson GLM (log linear model) you can fit multinomial models. Hence multinomial logistic or log linear Poisson models are equivalent."

So I'll show the reparametrization of the glmnet coefficients to multinom coefficients.

n.subj=1000
x1 <- rnorm(n.subj)
x2 <- rnorm(n.subj)
prob <- matrix(c(rep(1,n.subj), exp(3+2*x1+x2), exp(-1+x1-3*x2)), , ncol=3)
prob <- sweep(prob, 1, apply(prob, 1, sum), "/")

y = c()
for (i in 1:n.subj)
  y[i] <- sample(3, 1, replace = T, prob = prob[i,])

multinom(y~x1+x2)

x <- cbind(x1,x2); y2 <- factor(y)
fit <- glmnet(x, y2, family="multinomial", lambda=0, type.multinomial =     "grouped")
cf <- coef(fit)

cf[[2]]@x - cf[[1]]@x   # for the category 2
cf[[3]]@x - cf[[1]]@x   # for the category 3

Hope this helps. But I don't think I understand the equivalence of Generalized Linear Model(Poisson) and multinomial logistic model in and out.

Tell me if there's good and readable and "easily" understandable source..

Related Solutions

Solved – How to interpret all zero coefficients in the results of cv.glmnet

The fine-tuning of the penalization factor of Elastic Net during the cross validation has resulted in a penalty that shrinks all coefficients to zero.

Without being mathematically exact this seems to indicates that none of your features is very helpful. In this case Elastic Net will always predict the mean of the data it was trained on.

Your measure for accuracy is very problematic, as just predicting the mean can produce very high results.

For example, given the standard normal distribution the average absolute error is close to 0.8. Given a large sample size the range is easily around 8, giving you an accuracy of 0.9.

See here:

> set.seed(123)
> x <- rnorm(1e5)
> 1-mean(abs(x-mean(x)))/diff(range(x))
0.9056073

Solved – “Multinomial Deviance” in the glmnet package

Deviance is a specific transformation of a likelihood ratio. In particular, we consider the model-based likelihood after some fitting has been done and compare this to the likelihood of what is called the saturated model. This latter is a model that has as many parameters as data points and achieves a perfect fit, so by looking at the likelihood ratio we're measuring in some sense how far our fitted model is from a "perfect" model.

In the multinomial regression case we have data of the form $(x_1, y_1), (x_2, y_2), \ldots , (x_n, y_n)$ where $y_i$ is a $k$-vector which indicates which class observation $i$ belongs to (exactly one entry contains a one and the rest are zero). Now if we fit some model that estimates a vector of probabilities $\hat{p}(x) = (\hat{p}_1(x), \hat{p}_2(x), \ldots, \hat{p}_k(x))$ then the model-based likelihood can be written

$$ \prod_{i=1}^{n} \prod_{i=j}^{k} \hat{p}_j(x_i)^{y_{ij}} . $$

The saturated model on the other hand assigns probability one to each event that occurred, which means the vector of probabilities $\hat{p}_i$ is just equal to $y_i$ for each $i$ and we can write the ratio of these likelihoods as

$$ \prod_{i=1}^{n} \prod_{j=1}^{k} \left ( \frac{\hat{p}_j(x_i)}{y_{ij}} \right )^{y_{ij}} . $$

To find the deviance we take minus two times the log of this quantity (this transformation has importance in mathematical statistics because of a connection with the $\chi^2$ distribution) to get

$$ -2 \sum_{i=1}^{n} \sum_{j=1}^{k} y_{ij} \log \left ( \frac{\hat{p}_j(x_i)}{y_{ij}} \right ) . $$

(It's also worth pointing out that we treat zero times the log of anything as zero in this situation. The reason for this is that it's consistent with the idea that the saturated likelihood should equal one.)

The only part of this that's peculiar to glmnet is the way in which the function $\hat{p}(x)$ is estimated. It's doing a constrained maximization of the likelihood and computing the deviance as the upper bound on $\| \beta \|_1$ is varied, with the model that achieves the smallest deviance on test data being considered a "best" model.

Regarding the question about log loss, we can simplify the multinomial deviance above by keeping only the non-zero terms and write it as $-2 \sum_{i=1}^{n} \log [\hat{p}_{j_i} (x_i)]$, where $j_i$ is the index of the observed class for observation $i$, which is just the empirical log loss multiplied by a constant. So minimizing the deviance is actually equivalent to minimizing the log loss.

Best Answer

Related Solutions

Solved – How to interpret all zero coefficients in the results of cv.glmnet

Solved – “Multinomial Deviance” in the glmnet package

Related Question