Solved – How to measure test set error with logistic regression

logisticregression

In measuring the performance of a model, I divide my data into 2 sets, the training set and the test set, fit my model to the training set and then try to predict the results of the test set. If I'm looking at binary classification, I expect to classify my results into 0's and 1's. However, the output of the logistic regression is probabilities. So if I predict a probability of 51%, but I classify this as a 1 because it is > .5, and I get it wrong 49% of the time, isn't my model right as opposed to being 49% wrong?

Would it be a better measure to check if my error rate is close to the expected error rate for the model?

Best Answer

There is no standard way to define goodness-of-fit. It depends on your application and what the problem you are going to solve. As in classification, you may define the goodness-of-fit as 0-1 loss.

For a logistic regression, you can compute the likelihood function. I would use a McFadden pseudo-$R^2$, which is defined as:

$$ R^2 = 1 - \frac{\operatorname{L}(\theta)}{\operatorname{L}(\mathbf{0})} $$

$\operatorname{L}$ is the log-likelihood function, $\theta$ is the parameter of the model and $\mathbf{0}$ denote a zero vector (i.e. you compare the likelihood ratio of your model against a model with all coefficients 0)

Moreover, given a probability measure $\mu(x) = P(Y = 1|X=x)$, define the loss function of a classifier $g$ as $L(g) = P(g(X) \neq Y)$.

The Bayes decision rule:

$$ g^*(x) = \begin{cases} 1 & \mbox{if } \mu(x) \geq 0.5 \\ 0 & \mbox{if } \mu(x) < 0.5 \end{cases} $$

is the rule that minimize $L(g)$. That is nothing wrong to classify as 1 when your logistic regression output probability $\geq 0.5$ as long as you are thinking the loss function as above.