[Math] Binary Logistic Regression Model Processing

binarymathematical modelingMATLABregressionstatistics

Thanks for showing interest and wanting to help out.

My aim is to develop a model that – as accurately as possible – predicts how entities in a population will either cooperate or defect, as a % of total population. For this purpose, I have 70 predictor variables, however, not all of them may be significant (some are though). There could be a degree of multicollinearity for these variables. There are other variables that could potentially affect the outcome, but they are currently unknown. I have approximately 300 datapoints.

So far, I have used the glmfit function in Matlab to create a binary logistic regression model for all predictor variables.

Now, my statistics expertise is limited at best (I'm sorry about that), and I struggle to choose how to proceed at this point. I would very much appreciate if you could help me out with solving following questions in matlab:

  1. How do I best assess the accuracy of the model?
  2. Would it be better to reduce the number of predictor variables to
    improve the accuracy of the model? If so, how should I best do this?
  3. How do I check whether multicollinearity is significant? If so, what
    actions should I take to improve the model?
  4. What outputs/plots should I produce to demonstrate the above?
  5. Finally, is there a better way of doing things?

I would very much appreciate your help. Sorry if some of this seems basic – I assure you I have read up on this, but I find myself unable to make an informed decision as to how I should proceed to obtain optimum results.

Thank you very much for your time.

EDIT:
For example, would it be a good idea to look at the individual p-values for all the predictors and eliminate all those that fail a chosen significance level (say 0.05), then reconstructing the model with predictors that pass the test, and then see whether a better deviance (D) is obtained? How would I be able to judge whether the model is suitable, even if the deviance is better? Is there a better way of doing this? I just don't understand the maths behind these statistics well enough in order to choose an effective strategy.

EDIT 2:
Thanks to Zhiyong Wang, I have managed to do a LASSO on my data to discriminate predictor variables… I'm now down to 14. However, some of the p-values are still very high, and I'm not quite sure how I should continue to process my model. Please find below my diagnosis:

Estimated Coefficients:
                   Estimate      SE            tStat   
    (Intercept)       -9.3957       0.45246     -20.766
    x2              0.0032055     0.0043646     0.73443
    x3             -0.0095759     0.0022003      -4.352
    x4              0.0023242    0.00090184      2.5772
    x5              0.0033171      0.001955      1.6968
    x7              0.0017115    0.00090373      1.8938
    x9              0.0031377     0.0013612      2.3051
    x11            0.00024809     0.0013823     0.17947
    x16             0.0014808     0.0021081     0.70244
    x22            -0.0017803     0.0014742     -1.2077
    x26            -0.0025935     0.0045821    -0.56601
    x35            -0.0077807      0.014286    -0.54464
    x37            -0.0086488     0.0079046     -1.0942
    x45            -0.0038264     0.0019328     -1.9797
    x52            0.00032738     0.0043498    0.075264


                   pValue    
    (Intercept)    8.7732e-96
    x2                0.46269
    x3              1.349e-05
    x4              0.0099602
    x5               0.089743
    x7               0.058253
    x9               0.021161
    x11               0.85757
    x16               0.48241
    x22               0.22718
    x26               0.57139
    x35                 0.586
    x37               0.27389
    x45              0.047742
    x52                  0.94


126 observations, 111 error degrees of freedom
Dispersion: 1
Chi^2-statistic vs. constant model: 319, p-value = 1.51e-59

How do I best proceed from there? Thank you very much.

Best Answer

1.How do I best assess the accuracy of the model?

Besides using the terms in hypothesis test, like p-value, you can try to compute the precision and recall (see wiki), if your response value is categorical.

2.Would it be better to reduce the number of predictor variables to improve the accuracy of the model? If so, how should I best do this?

You can improve the accuracy as while as reducing the number of predictor by adding L-1 norm of the weights of linear regression in the object function. The method called LASSO. There will be an extra parameter you need to tune to find a balance between sparsity of your model in term of number of predictor variables and the accuracy.

3.How do I check whether multicollinearity is significant? If so, what actions 

should I take to improve the model?

You can achieve this by adding interaction term, like $x_1x_2$ to the set of predictor variables, where $x_1,x_2$ is your original predictor variables.

4.What outputs/plots should I produce to demonstrate the above?

You can try ROC curve, see wiki for detail.

5.Finally, is there a better way of doing things?

I think it depends on your specific problem.

Related Question