Solved – multinomial logistic regression- NaNs produced in R and no significant variables

I have a data frame with 1200 observations and 30 variables and I'am trying to do a multinomial logistic regression to explain the intentions of vote of Tunisian citizens using multinom(). My dependent variable has 10 levels.
When I executed the command multinom () I got this warning

Warning messages: 1: In sqrt(diag(vc)) : NaNs produced

so I reduced the number of the predictor variables to 13 , the levels of my dependent variable to only 3 and the warning message no longer appears , but once I calculate the p.value the majority of my predictor variables are non significant.

      > str(k)
    'data.frame':   1081 obs. of  19 variables:
     $ URBRUR    : Factor w/ 2 levels "Rural","Urban": 2 2 2 2 2 2 2 2 2 2  ...
     $ REGION    : Factor w/ 24 levels "Ariana","Beja",..: 23 23 23 23 23 23 23 23 23 23 ...
     $ classe_age: Factor w/ 5 levels "60 ans et plus",..: 3 5 1 1 3 1 5 4 1 2 ...
       $ Q3A       : Factor w/ 5 levels "Fairly bad","Fairly good",..: 2 1 1 4 4 4 2 4 1 3 ...
       $ Q3B       : Factor w/ 5 levels "Fairly bad","Fairly good",..: 2 1 1 3 1 4 2 4 1 3 ...
       $ Q7        : Factor w/ 2 levels "Going in the right direction",..: 1 2 2 2 2 2 2 2 2 1 ...
       $ Q14       : Factor w/ 4 levels "Not at all interested",..: 4 3 3 2 3 3 3 3 3 4 ...
       $ Q27       : Factor w/ 9 levels "Did not vote for some other reason",..: 6 6 6 6 6 3 6 6 6 1 ...
       $ Q46A      : num  9 5 8 0 3 3 4 5 0 3 ...
       $ Q63PT1    : Factor w/ 8 levels " Services gouvernementaux",..: 5 5 4 4 4 4 5 4 4 5 ...
       $ Q89A      : Factor w/ 9 levels "Non","Oui, autre",..: 7 1 1 8 5 1 1 1 1 1 ...
       $ Q96       : Factor w/ 3 levels "No (looking)",..: 3 2 2 2 1 2 2 3 2 1 ...
       $ Q96_ARB   : Factor w/ 9 levels "Agriculteur exploitant",..: 2 6 4 4 1 6 7 4 6 6 ...
       $ Q97       : Factor w/ 4 levels "Aucune éducation formelle ",..: 1 3 1 4 4 3 4 3 1 4 ...
       $ Q98B      : Factor w/ 4 levels "Not at all important",..: 4 4 4 4 3 4 4 4 4 4 ...
     #the logistic regression
      library(nnet)
      k$out=relevel(k$Q99,ref = "Nahdha")
     fit=multinom(out ~ URBRUR+ REGION +    classe_age+ Q3A +Q3B+ Q7 +  Q14+    Q27+ Q46A+  Q63PT1+ Q96+ Q96_ARB+ Q97   + Q98B,data=k,maxit=3000)

     summary(fit)
     #calculate the p.value
     z <- summary(fit)$coefficients/summary(fit)$standard.errors
     p <- (1 - pnorm(abs(z), 0, 1))*2
     p

this is a part from the output R

                        (Intercept) URBRUR[T.Urban] REGION[T.Beja] REGION[T.Ben        Arous]
          CPR            0.0000000       0.8006384     0.50724591           0.3490626
          Nahdha         0.6480962       0.9298628     0.09299337           0.2426325
          Nidaa Tounes   0.1547996       0.1210917     0.01340229           0.5486973
                           REGION[T.Bizerte] REGION[T.Gabes] REGION[T.Gafsa]
           CPR                  0.6667980      0.86525482      0.01971166
          Nahdha               0.2933951      0.03008731      0.05240173
          Nidaa Tounes         0.5154798      0.51222561      0.03301253
                         REGION[T.Jendouba] REGION[T.Kairouan] REGION[T.Kasserine]
          CPR                  0.21477728          0.4552543          0.53160327
         Nahdha               0.01548534          0.9322695          0.22102722
         Nidaa Tounes         0.06993081          0.7833111          0.09259959
                       REGION[T.Kebili] REGION[T.Le Kef] REGION[T.Mahdia]
           CPR                0.49607138        0.0000000        0.3084810
           Nahdha             0.09437504        0.6338189        0.1629434
           Nidaa Tounes       0.17968658        0.1360486        0.1955159

I'm sorry if I am asking a complicated question but I would like an explication for this issue

   > table(k$out)

    Ne pas voter       Nahdha Nidaa Tounes 
     307          292          266

Best Answer

The problem here is that you have many more predictors than you think you have. Each predictor factor with $k$ levels counts as $k-1$ variables in the model. So region alone counts as 23 and most of the others will be multiple too. When you have so many predictor variables it is unlikely that any level will add much predictive power over and above all the rest. Even with 1200 people you are trying to fit a model for which you do not have sufficient data.

The issue you had in your previous post here was one of what is called separation. There you had more levels for your outcome variable and some of them were quite infrequent (I suppose). That meant that some combination of your predictor variables was capable of predicting with certainty which way the person voted. You have now got over that problem by using fewer categories.

Best Answer

Related Solutions

Solved – auto.arima warns NaNs produced on std error

Solved – Multinomial logistic regression

Related Question