Solved – predict() – multinomial logistic regression

multinomial-distributionr

I come to you today because I face a huge problem that I cannot explain.

I have run a multinomial logistic regression (using the mlogit package) on behavioral data. I prepare the data by doing

    mlogit <- mlogit.data(Merge, choice = "Choice", shape = "long", alt.var = "Comp", 
                          drop.index = TRUE)

on my Merge data.

which gives me the following:

                Date     Time ActivityX ActivityY Temp Behavior Valley Age Month Year kid Individual Choice
    1.F   01/05/2012 00:00:00        80        58   10        F  Fuorn   8     5 2012   Y         26   TRUE
    1.R   01/05/2012 00:00:00        80        58   10        F  Fuorn   8     5 2012   Y         26  FALSE
    1.M   01/05/2012 00:00:00        80        58   10        F  Fuorn   8     5 2012   Y         26  FALSE
    1.RUN 01/05/2012 00:00:00        80        58   10        F  Fuorn   8     5 2012   Y         26  FALSE
    2.F   01/05/2012 00:05:00        90        76   10        F  Fuorn   8     5 2012   Y         26   TRUE
    2.R   01/05/2012 00:05:00        90        76   10        F  Fuorn   8     5 2012   Y         26  FALSE
    2.M   01/05/2012 00:05:00        90        76   10        F  Fuorn   8     5 2012   Y         26  FALSE
    2.RUN 01/05/2012 00:05:00        90        76   10        F  Fuorn   8     5 2012   Y         26  FALSE
    3.F   01/05/2012 00:10:00        51        47   10        M  Fuorn   8     5 2012   Y         26  FALSE
    3.R   01/05/2012 00:10:00        51        47   10        M  Fuorn   8     5 2012   Y         26  FALSE
    3.M   01/05/2012 00:10:00        51        47   10        M  Fuorn   8     5 2012   Y         26   TRUE
    3.RUN 01/05/2012 00:10:00        51        47   10        M  Fuorn   8     5 2012   Y         26  FALSE
    4.F   01/05/2012 00:15:00         0         0   10        R  Fuorn   8     5 2012   Y         26  FALSE
    4.R   01/05/2012 00:15:00         0         0   10        R  Fuorn   8     5 2012   Y         26   TRUE
    4.M   01/05/2012 00:15:00         0         0   10        R  Fuorn   8     5 2012   Y         26  FALSE
    4.RUN 01/05/2012 00:15:00         0         0   10        R  Fuorn   8     5 2012   Y         26  FALSE
    5.F   01/05/2012 00:20:00         0         0    9        R  Fuorn   8     5 2012   Y         26  FALSE
    5.R   01/05/2012 00:20:00         0         0    9        R  Fuorn   8     5 2012   Y         26   TRUE
    5.M   01/05/2012 00:20:00         0         0    9        R  Fuorn   8     5 2012   Y         26  FALSE
    5.RUN 01/05/2012 00:20:00         0         0    9        R  Fuorn   8     5 2012   Y         26  FALSE

then I ran my regression :

m1 <- mlogit(Choice ~ 1 |Temp + Valley + Age + kid + Month , mlogit)

and it gave me significant results :

                          Estimate  Std. Error  t-value  Pr(>|t|)    
    M:(intercept)      -4.2153e-01  5.7533e-02  -7.3268 2.358e-13 ***
    R:(intercept)       6.2325e-01  3.4958e-02  17.8284 < 2.2e-16 ***
    RUN:(intercept)    -1.2275e+01  4.0526e-01 -30.2895 < 2.2e-16 ***
    M:Temp              1.5371e-02  9.8680e-04  15.5764 < 2.2e-16 ***
    R:Temp             -3.9871e-02  6.7926e-04 -58.6975 < 2.2e-16 ***
    RUN:Temp           -4.4532e-02  6.8696e-03  -6.4825 9.023e-11 ***
    M:ValleyTrupchun   -3.6154e-01  1.6362e-02 -22.0968 < 2.2e-16 ***
    R:ValleyTrupchun   -4.0186e-02  9.7968e-03  -4.1020 4.096e-05 ***
    RUN:ValleyTrupchun  1.2895e+00  8.5357e-02  15.1066 < 2.2e-16 ***
    M:Age              -1.1026e-02  2.6902e-03  -4.0985 4.158e-05 ***
    R:Age               1.9465e-02  1.6479e-03  11.8119 < 2.2e-16 ***
    RUN:Age             5.5473e-02  1.6661e-02   3.3294 0.0008703 ***
    M:kidY              6.0686e-02  2.2638e-02   2.6807 0.0073460 ** 
    R:kidY             -4.1638e-01  1.2391e-02 -33.6024 < 2.2e-16 ***
    RUN:kidY            6.2311e-01  1.0410e-01   5.9854 2.158e-09 ***
    M:Month            -2.0466e-01  8.4448e-03 -24.2346 < 2.2e-16 ***
    R:Month             2.4148e-02  5.2317e-03   4.6157 3.917e-06 ***
    RUN:Month           9.8715e-01  5.6209e-02  17.5622 < 2.2e-16 ***

those results were in line with what I expected to find in literature so I was quite happy.

My next step was to plot my results and here is when I have some trouble.

First of all when I plot my original data and compare it with the result of my regression I find some huge differences. For example, when I plot the %of time spend in a behavior (M for moving, F for feeding, R for resting and Run for running, in my regression F is the reference) in function of age, I find that the older an individual gets, the more they will rest and the more they will move, but the estimates I got from my regression shows that they should rest more (when they get older) but move less. So to summarize, my graph on the original data shows the opposite as what I got from the regression.

I don't know if it is normal, in the sense that I don't know if I can compare my original data to the result of my regression in a way that my regression shows the probability from switching to a behavior from an other each time my variable grows of one unit.

So I wanted to use the predict() function but I don't know how to do that. I was hoping to get some help here.

Best Answer

When the estimates do not have the expected sign, the usual suspect is multicollinearity. Below is a passage from the Wikipedia page.

The usual interpretation of a regression coefficient is that it provides an estimate of the effect of a one unit change in an independent variable, $X_1$, holding the other variables constant. If is highly correlated with another independent variable, $X_2$, in the given data set, then we have a set of observations for which $X_1$ and $X_2$ have a particular linear stochastic relationship. We don't have a set of observations for which all changes in $X_1$ are independent of changes in $X_2$, so we have an imprecise estimate of the effect of independent changes in $X_1$.

You can find more about the adverse effects of multicollinearity, and strategies to fight it by reading this question and that question.