Alright, you can hit me now - after a couple of days of thinking I figured the answer out myself - but in any case that someone should have the same problem, please continue reading:
The answer is very (!) simple. Since this method is called proportional odds logistic regression, the coefficients are of course the same for every level of the dependent variable. And you get two thresholds for three DV levels (thresholds are the negative intercept, thus you'll notice a minus-sign below)
You just do this:
log_pred_probs1 <- 1-(exp(-logit_model$zeta[1] +
logit_model$coefficients[1] * IV_1 + logit_model$coefficients[2] * IV_2)/
(exp(-logit_model$zeta[1] + logit_model$coefficients[1] * IV_1 +
logit_model$coefficients[2] * IV_2))
log_pred_probs2 <- 1-(exp(-logit_model$zeta[2] +
logit_model$coefficients[1] * IV_1 + logit_model$coefficients[2] * IV_2)/
(exp(-logit_model$zeta[2] + logit_model$coefficients[1] * IV_1 +
logit_model$coefficients[2] * IV_2)) - log_pred_probs1
notice the two intercepts AND the subtraction of the first level's probability!
and finally
log_pred_probs3 <- 1 - log_pred_probs2 - log_pred_probs1
It's easy as that. Cheers!
As the probabilities of each class must sum to one, we can either define n-1
independent coefficients vectors, or n
coefficients vectors that are linked by the equation \sum_c p(y=c) = 1
.
The two parametrization are equivalent.
See also in Wikipedia Multinomial logistic regression - As a log-linear model.
For a class c
, we have a probability P(y=c) = e^{b_c.X} / Z
, with Z
a normalization that accounts for the equation \sum_c P(y=c) = 1
.
These probabilities are the expected probabilities of a class given the coefficients. They can be computed with predict_proba
To have better insight of the coefficients, please consider the left plot in this example.
example http://scikit-learn.org/dev/_images/plot_logistic_multinomial_001.png
In this example there are 3 classes a, b, c and 2 features x0
, x1
. The class is noted y
.
After the fit of a multinomial logistic, each class as a coefficients vector C
with 2 components (for the 2 features): (C_a0, C_a1)
, (C_b0, C_b1)
, (C_c0, C_c1)
There is also an intercept (aka biais) I
for each class, which are always unidimensional: I_a
, I_b
, I_c
The dash line represents the hyperplane defined by C
and I
:
example: for class a, the hyperplane is defined by the equation x0 * C_a0 + x1 * C_a1 + I_a = 0
This is the hyperplane where P(y=a) = e^{x0 * C_a0 + x1 * C_a1 + I_a} / Z = 1 / Z
.
If C_a0
is positive, when x0
increases P(y=a)
increases.
If C_a0
is negative, when x0
increases P(y=a)
decreases.
However this is not the decision boundary.
The decision boundary between classes a and b is defined by the equation:
p(y=a) = p(y=b)
which is e^{x0 * C_a0 + x1 * C_a1 + I_a} = e^{x0 * C_b0 + x1 * C_b1 + I_b}
or again x0 * C_a0 + x1 * C_a1 + I_a = x0 * C_b0 + x1 * C_b1 + I_b
.
This boundary hyperplane is visible in the plot by the background colors.
If C_a0 - C_b0
is positive, when x0
increases P(y=a) / P(y=b)
increases.
If C_a0 - C_b0
is negative, when x0
increases P(y=a) / P(y=b)
decreases.
Best Answer
The library creates $c$ neurons for $c$ classes for $c>2$, which yields $c\times f$ coefficients and $c$ biases, where $f$ is the number of features. So, it's like a collection of logistic regressions, or like a single layer neural network.