Solved – Fit a multinomial model using proportions instead of raw data

asymptoticsmodelmodelingmultinomial-distributionregression

I want to fit a non ordered multinomial model to data. In it, the dependent variable has 3 categories (success, partial success, fail) and i have a binary predictor and two other continuous predictors (raw measures for length and speed). I have obtained thousands of observations from a hunting videogame in which the players may kill the prey, hurt it or fail, and the prey varies in speed and the size of vital body parts. However, for me it would be most helpful to represent predictors using ratios (length/ total length, or speed/maximum speed) rather than using raw data. I thought that using proportions is not good practice for linear regressions, but I could not find any hint for a not ordered multinomial model as mine.

The only thing i need to know is whether or not i can use proportions instead of natural numbers as predictors in a non ordered multinomial model and why.

Here is a reproducible example:

Data <- data.frame(
X = sample(1:100),
D = sample(1:100),
Y = sample(c("yes", "no"), 10, replace = TRUE),
Z=sample(c("body", "tail", "fail"), 10, replace = TRUE))


require(nnet)

test=multinom(Z~Y+X+D+X:Y+D:X+D:Y,data=Data)
summary(test)
z=summary(test)$coefficients/summary(test)$standard.errors;z# t values
relativize=function(x){return(x/max(x))}
Data$X=relativize(Data$X)
Data$D=relativize(Data$D)
test1=multinom(Z~Y+X+D+X:Y+D:X+D:Y,data=Data)
z1=summary(test1)$coefficients/summary(test1)$standard.errors;z1# t values
z# t values for unscaled
z1# t values for scaled data
(1 - pnorm(abs(z), 0, 1)) * 2# z test p values
(1 - pnorm(abs(z1), 0, 1)) * 2# z test p values
AIC(test, test1)
confint(test)
confint(test1)

As you can see, I get nearly identical AIC values, but the effects and significance of the terms in the model change hugely! That behavior is not derived from my specific dataset, rather can be reproduced with any other. Which model is the correct one?

In the original dataset, when using raw variables i get several intervals that do not pass through zero (according to the significant significant terms in the model). In some of them, the intervals are very narrow and get very very close to zero. The most extreme example is this interval: (0.001, 7.16E-03). When using scaled data, all those narrow and close to zero intervals are considered to be non significant (i.e. they pass through zero) . The question keeps being: Which one is correct? I got tempted to think that when coefficients were so close to zero, the change in outcome odds generated by the associated terms probably have minor biological importance. However, I am unsure if that sensation actually comes from the units used for my model's terms (i.e. number of pixels).

Best Answer

Even after the additional information in comments (and now included in the EDIT) the question is not very clear. At first I read it as if by "proportions" you ment the count response vector expressed as percentages, but now I see that you write "and two other continuous predictors (raw measures for length and speed)." However, for me it would be most helpful to use proportions (length/ total length, or speed/maximum speed) rather than using raw data. I know that using proportions is not good practice for linear regressions, but ..." so you maybe really ask if you can express length, speed as the ratios (better word here than proportions) of length (speed) relative to the maximum observed in the data. If that reading is correct, there should be no problem, as predictors. You said you heard that so is not a good practice in linear regression, I think that must be a misunderstanding. Using ratios as response variables is often a bad idea, but there is not a problem in using them as predictors.

Moreover, I read you as dividing say $\text{length}$ by $\text{length}_\text{max}$ where the maximum is taken over the complete sample, that is effectively only a linear transformation of predictors, and will only change the model by multiplying the corresponding coefficient by a constant (both for linear and multinomial regression) and is certainly not a problem. The problems with using a ratio as a response variable is when both numerator and denominator variaes over the sample. Please comment if I did understand you correctly now, and please edit your OP to clarify.

EDIT

This is to answer what you say in last comment. But first a look at the above R code. Note that:

> with(Data, matrix(Z,10,10))
      [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]   [,9]   [,10] 
 [1,] "body" "body" "body" "body" "body" "body" "body" "body" "body" "body"
 [2,] "tail" "tail" "tail" "tail" "tail" "tail" "tail" "tail" "tail" "tail"
 [3,] "tail" "tail" "tail" "tail" "tail" "tail" "tail" "tail" "tail" "tail"
 [4,] "fail" "fail" "fail" "fail" "fail" "fail" "fail" "fail" "fail" "fail"
 [5,] "fail" "fail" "fail" "fail" "fail" "fail" "fail" "fail" "fail" "fail"
 [6,] "tail" "tail" "tail" "tail" "tail" "tail" "tail" "tail" "tail" "tail"
 [7,] "body" "body" "body" "body" "body" "body" "body" "body" "body" "body"
 [8,] "fail" "fail" "fail" "fail" "fail" "fail" "fail" "fail" "fail" "fail"
 [9,] "fail" "fail" "fail" "fail" "fail" "fail" "fail" "fail" "fail" "fail"
[10,] "fail" "fail" "fail" "fail" "fail" "fail" "fail" "fail" "fail" "fail"
> with(Data, matrix(Y,10,10))
      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9]  [,10]
 [1,] "yes" "yes" "yes" "yes" "yes" "yes" "yes" "yes" "yes" "yes"
 [2,] "yes" "yes" "yes" "yes" "yes" "yes" "yes" "yes" "yes" "yes"
 [3,] "no"  "no"  "no"  "no"  "no"  "no"  "no"  "no"  "no"  "no" 
 [4,] "yes" "yes" "yes" "yes" "yes" "yes" "yes" "yes" "yes" "yes"
 [5,] "yes" "yes" "yes" "yes" "yes" "yes" "yes" "yes" "yes" "yes"
 [6,] "no"  "no"  "no"  "no"  "no"  "no"  "no"  "no"  "no"  "no" 
 [7,] "no"  "no"  "no"  "no"  "no"  "no"  "no"  "no"  "no"  "no" 
 [8,] "no"  "no"  "no"  "no"  "no"  "no"  "no"  "no"  "no"  "no" 
 [9,] "yes" "yes" "yes" "yes" "yes" "yes" "yes" "yes" "yes" "yes"
[10,] "yes" "yes" "yes" "yes" "yes" "yes" "yes" "yes" "yes" "yes"

Note that all the rows are identical, you have effective only made independent samples of size 10 and then repeated it ten times. You should really explain why. That gives data far removed from the independence assumtion behind the multinomial model you are fitting. Also, for the variables X and D you are only permuting randomly 1:100, not sampling from it. Why?

Then, you fit the multinomial model with the multinom function from nnet package. That uses a neural net algorithm, with random starts, which do not (in principle) guarantee identical solution when called multiple times. But just for the multinomial likelihood it seems to work well, but there are numerical issues, note from the help page:

Details

multinom calls nnet. The variables on the rhs of the formula should be roughly scaled to [0,1] or the fit will be slow or may not converge at all.

So the difference between your two model fits, before and after scaling (which are small), are entirely due to numerical problems. Since predictor variables between 0 and 1 are preferred, you should probably trust more the fit after scaling.

There are newer and maybe better implementations of multinomial regression in R today, see https://cran.r-project.org/web/packages/mnlogit/vignettes/mnlogit.pdf

EDIT

(partial) answer to question in comments, after adding confint to the code. (Note that confint.multinom uses the standard t-based confidence intervals, with variances obtained by a call to vcov on the model object, it does not use likelihood profiling). See my code below, but first note that all your confidence intervals contains zero so you should be very careful with interpreting a model where no coefficients are significant!

> ci  <-  confint(test)
> ci1  <-  confint(test1)
> ci1
, , fail

                2.5 %   97.5 %
(Intercept) -1.875869 4.369025
Yyes        -3.483587 2.842708
X           -6.335916 4.391166
D           -6.013250 2.857671
Yyes:X      -2.644593 6.113461
X:D         -6.624659 7.545904
Yyes:D      -2.506965 5.674141

, , tail

                2.5 %    97.5 %
(Intercept) -1.637342  4.819822
Yyes        -4.808154  1.815249
X           -5.621161  5.660411
D           -7.719668  2.117075
Yyes:X      -3.403196  5.970511
X:D         -5.958546 10.839393
Yyes:D      -4.945901  4.090463

> ci2
, , fail

                2.5 %   97.5 %
(Intercept) -1.875869 4.369025
Yyes        -3.483587 2.842708
X           -6.335916 4.391166
D           -6.013250 2.857671
Yyes:X      -2.644593 6.113461
X:D         -6.624659 7.545904
Yyes:D      -2.506965 5.674141

, , tail

                2.5 %    97.5 %
(Intercept) -1.637342  4.819822
Yyes        -4.808154  1.815249
X           -5.621161  5.660411
D           -7.719668  2.117075
Yyes:X      -3.403196  5.970511
X:D         -5.958546 10.839393
Yyes:D      -4.945901  4.090463

To locate where are the differences, do

> ci1 - ci
, , fail

                    2.5 %        97.5 %
(Intercept)  9.018376e-05  8.355338e-05
Yyes        -5.969440e-05 -6.214081e-05
X           -9.626431e-01 -9.626696e-01
D           -1.562012e+00 -1.562010e+00
Yyes:X       1.717088e+00  1.717102e+00
X:D          4.605740e-01  4.605791e-01
Yyes:D       1.567759e+00  1.567741e+00

, , tail

                    2.5 %        97.5 %
(Intercept) -2.712526e-05 -4.407901e-05
Yyes         3.544498e-06 -7.020956e-06
X            1.951815e-02  1.933554e-02
D           -2.773252e+00 -2.773311e+00
Yyes:X       1.270888e+00  1.270759e+00
X:D          2.440282e+00  2.440077e+00
Yyes:D      -4.233814e-01 -4.235042e-01

Observe the noticeable differences are from the line with D on. But, even when the differences are largest, both confidence intervals contain zero, so the interpretation is the same. You should be careful with interpreting nonsignificant coefficients! The only thing to note is that maybe these asymptotic confidence intervals are imprecise, to check one could try to construct likelihood profiling confidence intervals also in this case. That is for later (or for you ...)

Related Question