Thanks for showing interest and wanting to help out.
My aim is to develop a model that – as accurately as possible – predicts how entities in a population will either cooperate or defect, as a % of total population. For this purpose, I have 70 predictor variables, however, not all of them may be significant (some are though). There could be a degree of multicollinearity for these variables. There are other variables that could potentially affect the outcome, but they are currently unknown. I have approximately 300 datapoints.
So far, I have used the glmfit
function in Matlab to create a binary logistic regression model for all predictor variables.
Now, my statistics expertise is limited at best (I'm sorry about that), and I struggle to choose how to proceed at this point. I would very much appreciate if you could help me out with solving following questions in matlab:
- How do I best assess the accuracy of the model?
- Would it be better to reduce the number of predictor variables to
improve the accuracy of the model? If so, how should I best do this? - How do I check whether multicollinearity is significant? If so, what
actions should I take to improve the model? - What outputs/plots should I produce to demonstrate the above?
- Finally, is there a better way of doing things?
I would very much appreciate your help. Sorry if some of this seems basic – I assure you I have read up on this, but I find myself unable to make an informed decision as to how I should proceed to obtain optimum results.
Thank you very much for your time.
EDIT:
For example, would it be a good idea to look at the individual p-values for all the predictors and eliminate all those that fail a chosen significance level (say 0.05), then reconstructing the model with predictors that pass the test, and then see whether a better deviance (D) is obtained? How would I be able to judge whether the model is suitable, even if the deviance is better? Is there a better way of doing this? I just don't understand the maths behind these statistics well enough in order to choose an effective strategy.
EDIT 2:
Thanks to Zhiyong Wang, I have managed to do a LASSO on my data to discriminate predictor variables… I'm now down to 14. However, some of the p-values are still very high, and I'm not quite sure how I should continue to process my model. Please find below my diagnosis:
Estimated Coefficients:
Estimate SE tStat
(Intercept) -9.3957 0.45246 -20.766
x2 0.0032055 0.0043646 0.73443
x3 -0.0095759 0.0022003 -4.352
x4 0.0023242 0.00090184 2.5772
x5 0.0033171 0.001955 1.6968
x7 0.0017115 0.00090373 1.8938
x9 0.0031377 0.0013612 2.3051
x11 0.00024809 0.0013823 0.17947
x16 0.0014808 0.0021081 0.70244
x22 -0.0017803 0.0014742 -1.2077
x26 -0.0025935 0.0045821 -0.56601
x35 -0.0077807 0.014286 -0.54464
x37 -0.0086488 0.0079046 -1.0942
x45 -0.0038264 0.0019328 -1.9797
x52 0.00032738 0.0043498 0.075264
pValue
(Intercept) 8.7732e-96
x2 0.46269
x3 1.349e-05
x4 0.0099602
x5 0.089743
x7 0.058253
x9 0.021161
x11 0.85757
x16 0.48241
x22 0.22718
x26 0.57139
x35 0.586
x37 0.27389
x45 0.047742
x52 0.94
126 observations, 111 error degrees of freedom
Dispersion: 1
Chi^2-statistic vs. constant model: 319, p-value = 1.51e-59
How do I best proceed from there? Thank you very much.
Best Answer
Besides using the terms in hypothesis test, like p-value, you can try to compute the precision and recall (see wiki), if your response value is categorical.
You can improve the accuracy as while as reducing the number of predictor by adding L-1 norm of the weights of linear regression in the object function. The method called LASSO. There will be an extra parameter you need to tune to find a balance between sparsity of your model in term of number of predictor variables and the accuracy.
should I take to improve the model?
You can achieve this by adding interaction term, like $x_1x_2$ to the set of predictor variables, where $x_1,x_2$ is your original predictor variables.
You can try ROC curve, see wiki for detail.
I think it depends on your specific problem.