Solved – Unusual high Odds Ratio Values in multinomial logistic regression

logisticmlogitr

I was following the procedure in a statistics textbook to run a multinomial logistic regresion using mlogit. However, the Odds Ratios calculated seemed too high for some of the variables (>1000). Can someone take a look at this and check wether I am doing everything correctly? The data can be downloaded from here. I prepared the data with the following commands:

#read in the data
test<-read.csv(file="LogisticRegressionSample.csv",sep=",")
#trasnform data into the correct form for mlogit
mlogitData<-mlogit.data(test,choice="Outcome",shape="wide")
#build model
MLogitFit<-mlogit(Outcome~1|V1+V2+V3+V4+V5+V6+V7+V8,reflevel=3,data=mlogitData)
#summary of the model
summary(MLogitFit)
#OddsRatios
data.frame(exp(MLogitFit$coefficients))
# confidence Interval of the odds Ratios
exp(confint(MLogitFit))

The summary of mlogit gives me:

    Call:
mlogit(formula = Outcome ~ 1 | V1 + V2 + V3 + V4 + V5 + V6 + 
    V7 + V8, data = mlogitData, reflevel = 3, method = "nr", 
    print.level = 0)

Frequencies of alternatives:
      Z       A       B 
0.43333 0.25556 0.31111 

nr method
7 iterations, 0h:0m:0s 
g'(-H)^-1g = 1.56E-06 
successive function values within tolerance limits 

Coefficients :
               Estimate Std. Error t-value  Pr(>|t|)    
A:(intercept)  -6.74640    5.97451 -1.1292 0.2588147    
B:(intercept)  -7.12401    4.50350 -1.5819 0.1136759    
A:V1            3.65979    3.90808  0.9365 0.3490331    
B:V1            4.24363    3.25687  1.3030 0.1925822    
A:V2          -15.11554    6.92901 -2.1815 0.0291475 *  
B:V2           -4.88778    3.65249 -1.3382 0.1808302    
A:V3            1.71465    6.57907  0.2606 0.7943839    
B:V3            2.94335    3.96557  0.7422 0.4579497    
A:V4           -1.70660    1.58849 -1.0744 0.2826633    
B:V4           -1.67210    1.17575 -1.4222 0.1549820    
A:V5            1.18494    1.60760  0.7371 0.4610682    
B:V5            1.03084    1.25573  0.8209 0.4116971    
A:V6            8.28902    2.51631  3.2941 0.0009873 ***
B:V6            3.44578    1.91844  1.7961 0.0724727 .  
A:V7           -1.34395    2.67943 -0.5016 0.6159612    
B:V7            1.04803    1.95147  0.5370 0.5912343    
A:V8           -7.46263    4.12978 -1.8070 0.0707577 .  
B:V8            0.21861    2.13596  0.1023 0.9184810    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Log-Likelihood: -64.636
McFadden R^2:  0.33149 
Likelihood ratio test : chisq = 64.1 (p.value = 1.0515e-07)

Running data.frame(exp(MLogitFit$coefficients)) to calculate the odds ratios gives:

              exp.MLogitFit.coefficients.
A:(intercept)                1.175103e-03
B:(intercept)                8.055280e-04
A:V1                         3.885310e+01
B:V1                         6.966040e+01
A:V2                         2.725226e-07
B:V2                         7.538147e-03
A:V3                         5.554743e+00
B:V3                         1.897938e+01
A:V4                         1.814819e-01
B:V4                         1.878524e-01
A:V5                         3.270504e+00
B:V5                         2.803423e+00
A:V6                         3.979917e+03
B:V6                         3.136764e+01
A:V7                         2.608125e-01
B:V7                         2.852036e+00
A:V8                         5.741439e-04
B:V8                         1.244345e+00

I obtained the confidence interavls with: exp(confint(MLogitFit)):

                     2.5 %       97.5 %
A:(intercept) 9.650816e-09 1.430830e+02
B:(intercept) 1.182216e-07 5.488637e+00
A:V1          1.831725e-02 8.241213e+04
B:V1          1.176881e-01 4.123248e+04
A:V2          3.446800e-13 2.154711e-01
B:V2          5.864847e-06 9.688857e+00
A:V3          1.394913e-05 2.211978e+06
B:V3          7.994348e-03 4.505896e+04
A:V4          8.066986e-03 4.082774e+00
B:V4          1.875058e-02 1.881996e+00
A:V5          1.400307e-01 7.638467e+01
B:V5          2.392271e-01 3.285238e+01
A:V6          2.870699e+01 5.517731e+05
B:V6          7.303065e-01 1.347282e+03
A:V7          1.366460e-03 4.978060e+01
B:V7          6.223884e-02 1.306918e+02
A:V8          1.752860e-07 1.880591e+00
B:V8          1.891518e-02 8.185990e+01

The predicted Probabilities are as following:

fitted(MLogitFit, outcome=FALSE)
                 Z            A          B
 [1,] 0.2790108926 3.880184e-01 0.33297074
 [2,] 0.5191458618 2.900625e-01 0.19079169
 [3,] 0.7263001933 1.633014e-02 0.25736966
 [4,] 0.8386056883 3.700203e-03 0.15769411
 [5,] 0.8050365007 7.487290e-03 0.18747621
 [6,] 0.7855655154 3.860347e-02 0.17583101
 [7,] 0.7878404896 7.992930e-03 0.20416658
 [8,] 0.8386056883 3.700203e-03 0.15769411
 [9,] 0.7878404896 7.992930e-03 0.20416658
[10,] 0.4363708036 2.827104e-01 0.28091885
[11,] 0.6126060746 3.320075e-02 0.35419317
[12,] 0.0274357267 8.418204e-01 0.13074390
[13,] 0.1438998597 5.869087e-01 0.26919146
[14,] 0.1850027820 2.105586e-01 0.60443858
[15,] 0.8427092407 5.933393e-03 0.15135737
[16,] 0.1537160539 4.929905e-01 0.35329341
[17,] 0.0434283140 6.358897e-01 0.32068201
[18,] 0.1868202029 1.141679e-01 0.69901186
[19,] 0.3064594418 1.156597e-01 0.57788084
[20,] 0.5737141160 6.734724e-02 0.35893865
[21,] 0.5841338911 1.374758e-01 0.27839031
[22,] 0.0866451414 4.019366e-01 0.51141821
[23,] 0.2794060013 9.964607e-02 0.62094793
[24,] 0.0252343516 7.343045e-01 0.24046118
[25,] 0.1314775919 4.602643e-01 0.40825811
[26,] 0.0274357267 8.418204e-01 0.13074390
[27,] 0.1303195991 6.649645e-01 0.20471586
[28,] 0.2818251202 4.896734e-01 0.22850146
[29,] 0.0063990341 8.874618e-01 0.10613917
[30,] 0.0002408527 9.742025e-01 0.02555668
[31,] 0.0523052465 7.073015e-01 0.24039322
[32,] 0.3287956423 2.756959e-01 0.39550841
[33,] 0.0419093705 7.521689e-01 0.20592173
[34,] 0.0523052465 7.073015e-01 0.24039322
[35,] 0.3287956423 2.756959e-01 0.39550841
[36,] 0.0100998700 7.475180e-01 0.24238212
[37,] 0.1609808596 2.268570e-01 0.61216212
[38,] 0.0119603037 8.065964e-01 0.18144331
[39,] 0.0697132279 4.549378e-01 0.47534896
[40,] 0.5756435353 6.315652e-02 0.36119994
[41,] 0.4689676672 6.796615e-02 0.46306619
[42,] 0.2652679745 6.358962e-02 0.67114240
[43,] 0.7870195702 2.038999e-03 0.21094143
[44,] 0.6438437943 9.222002e-03 0.34693420
[45,] 0.7462282258 5.881047e-04 0.25318367
[46,] 0.3532662528 2.193975e-01 0.42733620
[47,] 0.9563852795 4.133754e-05 0.04357338
[48,] 0.9079031419 2.786314e-03 0.08931054
[49,] 0.0220230156 8.017508e-01 0.17622619
[50,] 0.2268852285 1.745210e-01 0.59859376
[51,] 0.2268852285 1.745210e-01 0.59859376
[52,] 0.0751929214 6.261548e-01 0.29865225
[53,] 0.9426667411 4.520877e-06 0.05732874
[54,] 0.0212631471 6.729961e-01 0.30574075
[55,] 0.0212631471 6.729961e-01 0.30574075
[56,] 0.9218535421 1.166953e-02 0.06647693
[57,] 0.6374868816 3.856300e-02 0.32395012
[58,] 0.2920703240 2.410709e-01 0.46685876
[59,] 0.7047942848 1.728601e-02 0.27791970
[60,] 0.1850395244 5.297673e-01 0.28519316
[61,] 0.4402296785 8.870861e-03 0.55089946
[62,] 0.6781988218 3.852569e-04 0.32141592
[63,] 0.9889453179 4.036588e-05 0.01101432
[64,] 0.1618635354 8.011851e-02 0.75801796
[65,] 0.3008372801 9.835522e-02 0.60080750
[66,] 0.0740319347 4.284039e-01 0.49756417
[67,] 0.5529727485 1.768537e-01 0.27017351
[68,] 0.7824740564 5.001713e-03 0.21252423
[69,] 0.5343045050 5.865850e-02 0.40703700
[70,] 0.4564647083 1.733995e-01 0.37013579
[71,] 0.4711837972 8.449081e-03 0.52036712
[72,] 0.9154349308 2.364316e-02 0.06092191
[73,] 0.1858643216 2.217595e-01 0.59237621
[74,] 0.3770813535 9.943397e-02 0.52348468
[75,] 0.8124141650 3.243679e-04 0.18726147
[76,] 0.3195206223 2.932236e-01 0.38725578
[77,] 0.8615871019 5.063299e-04 0.13790657
[78,] 0.8615871019 5.063299e-04 0.13790657
[79,] 0.8254986241 2.059378e-03 0.17244200
[80,] 0.1208591778 4.615235e-01 0.41761730
[81,] 0.0035765650 9.093754e-01 0.08704806
[82,] 0.7583239965 3.544345e-02 0.20623255
[83,] 0.8141948591 5.016280e-03 0.18078886
[84,] 0.1204323818 2.545405e-01 0.62502710
[85,] 0.9594950290 3.694056e-05 0.04046803
[86,] 0.6858228916 1.691396e-01 0.14503752
[87,] 0.8254986241 2.059378e-03 0.17244200
[88,] 0.8254986241 2.059378e-03 0.17244200
[89,] 0.2463233530 2.793410e-01 0.47433568
[90,] 0.5674338104 1.448538e-02 0.41808081

To assess multicolinearity I calculated the VIF statistic but using the a glm model of the same dataset.

fullmod<-glm(as.factor(Outcome)~.,data=test,family=binomial())
vif(fullmod)
      V1       V2       V3       V4       V5       V6       V7       V8 
1.789116 1.822252 2.216444 1.320244 1.821820 1.439183 1.512865 1.121805

Best Answer

I don't think you have anything to worry about. V2 and V6 are very strong predictors, but those huge odds ratios are for a 1 unit change in the value of the predictor. However, the range of the predictor in both cases is << 1. Multiply the predictor variables by 10, or 100 (dividing the coefficients by 10 or 100), and suddenly it doesn't look problematic. Check the discussion Kjetil linked to above. Check the standard errors of those coefficients -- they are entirely reasonable and smaller than the coefficients. When you get complete separation in a binomial GLM those standard errors usually blow out to values much larger than the coefficients as well (Hauck-Donner effect, see pg 197 in Venables and Ripley MASS 4th edition). Finally, looking over the predicted probabilities all the numbers are reasonable; there are some very small probabilities for Alternative A for a couple of cases, but they are still larger than zero.
If it were me, I would next consider backwards elimination to see if some of the non-significant predictors can be removed.

Related Solutions

Solved – Calculating risk ratio using odds ratio from logistic regression coefficient

Zhang 1998 originally presented a method for calculating CIs for risk ratios suggesting you could use the lower and upper bounds of the CI for the odds ratio.

This method does not work, it is biased and generally produces anticonservative (too tight) estimates of the risk ratio 95% CI. This is because of the correlation between the intercept term and the slope term as you correctly allude to. If the odds ratio tends towards its lower value in the CI, the intercept term increases to account for a higher overall prevalence in those with a 0 exposure level and conversely for a higher value in the CI. Each of these respectively lead to lower and higher bounds for the CI.

To answer your question outright, you need a knowledge of the baseline prevalence of the outcome to obtain correct confidence intervals. Data from case-control studies would rely on other data to inform this.

Alternately, you can use the delta method if you have the full covariance structure for the parameter estimates. An equivalent parametrization for the OR to RR transformation (having binary exposure and a single predictor) is:

$$RR = \frac{1 + \exp(-\beta_0)}{1+\exp(-\beta_0-\beta_1)}$$

And using multivariate delta method, and the central limit theorem which states that $\sqrt{n} \left( [\hat{\beta}_0, \hat{\beta}_1] - [\beta_0, \beta_1]\right) \rightarrow_D \mathcal{N} \left(0, \mathcal{I}^{-1}(\beta)\right)$, you can obtain the variance of the approximate normal distribution of the $RR$.

Note, notationally this only works for binary exposure and univariate logistic regression. There are some simple R tricks that make use of the delta method and marginal standardization for continuous covariates and other adjustment variables. But for brevity I'll not discuss that here.

However, there are several ways to compute relative risks and its standard error directly from models in R. Two examples of this below:

x <- sample(0:1, 100, replace=T)
y <- rbinom(100, 1, x*.2+.2)
glm(y ~ x, family=binomial(link=log))
library(survival)
coxph(Surv(time=rep(1,100), event=y) ~ x)

http://research.labiomed.org/Biostat/Education/Case%20Studies%202005/Session4/ZhangYu.pdf

Solved – Improving Logistic Regression model’s summary output

As these data are based on employee records, you presumably have data on the time to quitting (length of employment), not just the fact of having quit. If so, this would be better modeled with survival analysis. Predicting the length of employment would seem to be of considerable value to the company.

Then the dependent variable is continuous, with those who haven't quit yet treated as "censored" observations. (We all do, eventually, end up leaving employment.)

Whether you model this as logistic or survival, you should carefully limit the number of variables under consideration or use a penalized method like LASSO or elastic net. The rule of thumb to avoid overfitting if you are not using a penalized method is to consider no more than one variable per 15 events. That would be the number who quit or otherwise left employment for survival analysis, or the smaller of those who quit/didn't quit for logistic (which, the more I think on it, seems less and less useful here). And in terms of the number of variables, each categorical variable counts as one less than the total number of categories (that's how many columns it contributes to the model matrix).

To make this concrete, say that 600 out of the 1252 cases represented people who left employment with the company. If you intend to do standard survival analysis, this rule of thumb means that you should enter no more than about 600/15=40 variables (columns of a model matrix) into your analysis, not the full model matrix with 224 columns. If only 300 people in your data set left employment, only 20 variables should be considered in standard survival analysis. The particular variables might best be selected based on your knowledge of the subject matter, or multiple correlated predictors might be combined into single predictors. If you need to evaluate more predictors than warranted by this rule of thumb you should use a penalized method.

Best Answer

Related Solutions

Solved – Calculating risk ratio using odds ratio from logistic regression coefficient

Solved – Improving Logistic Regression model’s summary output

Related Question