Logistic Regression – Evaluating Accuracy on Skewed Data

classificationlogisticrregression

I've been having an argument with a friend of mine, and it's very possible I'm wrong here.

We are performing binary logistic regression on a dataset with 10000 observations, classifying action as "good" or bad". There are two independent variables (x1, x2), and class variable (y, with values "good" or "bad"). In this dataset, we have 7,500 observations classified as "bad" and 2,500 classified as "good". This is because there are several different ways for a user to perform a "bad" action, but only one way for them to perform a "good" action.

We are doing our analysis in R using the glm() function.

We create training data by randomly sampling 7,500 observations from the dataset, and create test data from the other 2,500 observations. we then build a model using binary logistic regression on the training data, then test it on the test data. The accuracy of our model is 75%.

Can we say our model is better than guessing?

He says that this model is no better than guessing. Even though the error rate is better than 50%, because the original data had a prevalence of "bad" classifiers, we would need our model to predict better than 75% in order to say it performs better than random guessing.

I disagree…but I can't defend my point with anything other than "that doesn't seem right". Can someone shed some light on the correct interpretation, and the reason for it?

Best Answer

You are right. I suspect that the conceptual mistake of your friend is that he visualizes the $75$%-$25$% "random guessing" as an "agnostic draw from the bag" and compares this with the "model accuracy". But this is not what "model accuracy" measures. "Model accuracy" quantifies a two-step procedure where first you predict and then you draw to see whether your prediction was correct. Without the model, you would have only the unconditional probabilities to go on (accepting that sample proportions are close estimates of them). So, as you clarified in the comments, you would flip a $75$%-$25$% coin to make your prediction. So the "accuracy" you should expect on average in this situation would be

$$\Pr [\{\text{"coin lands bad"}\},\{\text{"draw gives bad"}\}] + \Pr [\{\text{"coin lands good"}\},\{\text{"draw gives good"}\}]$$

These are independent events so the above joint probabilities split: $$\Pr [\{\text{"coin lands bad"}\}]\cdot\Pr [\{\text{"draw gives bad"}\}] \\+ \Pr [\{\text{"coin lands good"}\}]\cdot\Pr [\{\text{"draw gives good"}\}]$$

$$ = (0.75)^2 + (0.25)^2 = 0.5625 + 0.0625 = 0.625$$

You are telling us that, using the model, you have

$$\Pr [\{\text{"model says bad"}\},\{\text{"observation is bad"}\}] + \Pr [\{\text{"model says good"}\},\{\text{"observation is good"}\}] = 0.75$$

So your model improved your chances of predicting correctly. Note that the model does not "draw" from the bag of $Y$'s but from the bag of $X$'s. The model does not use the fact that you have a $75$% probability of drawing $X'$s that are related to a "bad" $Y$, -it uses the numerical values of the $X$'s.

P.S.
Greg Snow's answer correctly points out that your model does not perform better compared to a prediction strategy "always guess bad". But this strategy cannot be considered "random". Still, this remark would be a valid criticism against the practical usefulness of your model in general (and not only compared to "random guessing").