Solved – Logistic Regression: Does the model selection process make sense

This is kind of a broad question and so I am okay with broad or general answers. In fact, each of these could be their own individual questions, but I think it makes sense to ask them all. Even if you have answers to just one or two, I am happy with that. Basically I have made a model and approach and even have some results, but I just want to make sure that it's correct and that there are no gaps in the process. So, here goes:

Previous Criminal Activity as Predictor of Future Criminal Activity

(Note that this is an academic project and is not going to impact any real persons)

For simplicity, say that in my training set I have 100 individuals and 10 of them are convicted criminals. So my output variable (Y) has 90 zeroes (not convicted) and 10 ones (convicted). In my test set I have 10 individuals and one convicted criminal among them.

I have features of their behaviors and demographics and I want to figure out which features makes someone more likely to commit a crime. But I also want to break them into ranks, or buckets, so that the investigators can know who to target. For example, among the 90 that are not convicted, I only have enough money to pay investigators to research twenty of them. So how do I use the output to tell them which twenty are the riskiest?
So I put my training set into logistic regression with various features (some continuous and some categorical). For example, say the state that they live in, so I would have 49 categorical variables for the states. If I calculate VIFs for these and some states have high VIFs and others don't, does it mean that there is multicollinearity among them even though they are categorical? And does it make sense to pick and choose which categorical variables are to be removed? For example, does it make sense to proceed with 39 of the 50 states since I found that 11 had multicollinearity?
After that I do stepwise model selection. Lets say I get five out of thirty features with p-values being significant. So is it correct to assume that those features make most likely to perform criminal activity? And the coefficients describe how much impact (after transforming them back to linear estimates, of course)? Similarly to question (2.), does it make sense to drop parts of a categorical feature during this process? Or if you drop one, then you have to drop them all?
Once I do this, how do I use my output to "predict" which of the 90 are riskiest? My output would be 1's and 0's in order to make ROC curve, right? How do I convert that to some sort of predicted probability in between 0 and 1? So basically I would like to make five buckets, like 0-20%, 21-40%, 41-60%, 61-80%, and 81-100% based on predicted probabilities and I will tell the investigators to focus on the 81-100% first, and then if they have time, go to 61-80%, and so on.
In logistic regression, what is the difference between the training, validation, and test set? I am used to using a training and test set, but not sure about a validation set. Is that used to calculate the ROC curve?
Say that in my data I have too few 1's and a lot of 0's and I am getting really poor ROC curves (close to 0.5). Is there a sampling approach or other type of fix that I can perform to remedy this?

I hope that's not too broad and any guidance would be helpful. Thank you and please feel free to ask me for any clarifications!

Best Answer

The beauty of logistic regression is that it outputs probabilities. So just sort the subjects by their predicted probability of offending and pick the 20 greatest.
I don't know what you're asking. I think you have the vocabulary mixed up here. Multicollinearity isn't something you do; it's a condition of a dataset. A categorical feature comprises levels, not variables. What exactly do you mean by "VIF"?
To put it bluntly, stepwise model selection is an obsolete method. There are much better ways to do variable selection, such as the lasso, and you should be careful not to assume you need to do variable selection in the first place, because variable selection always carries the risk of throwing away useful information, and there are modeling techniques that can handle a lot of uninformative features, such as random forests (and lasso-regularized logistic regression, for that matter).
As I mentioned earlier, logistic regression produces probabilities, not 0s and 1s. Making an ROC curve just means coercing the probabilities to 0 and 1 with a varying threshold. But it seems more to the point to just give the investigators the complete list of subjects sorted by their probability of offending.
No, the idea of a validation set is just to allow you to tune some aspect of the models with the test set (such as, in the case of the lasso, the penalty size). Then you can examine the model's predictions in the validation set without any optimistic bias from overfitting the tuning procedure to the test set.
That depends entirely on how you're getting your data. Just don't throw away data you already have to balance your classes. That would be counterproductive. Edit: Oh, when you wrote "sampling" here, you probably meant "resampling". No, don't do that.

Best Answer

Related Solutions

Solved – Multicollinearity among categorical variables – Is it normal

Solved – Does the VIF make sense for a model with categorical variables

Related Question