Solved – How to measure test set error with logistic regression

logisticregression

In measuring the performance of a model, I divide my data into 2 sets, the training set and the test set, fit my model to the training set and then try to predict the results of the test set. If I'm looking at binary classification, I expect to classify my results into 0's and 1's. However, the output of the logistic regression is probabilities. So if I predict a probability of 51%, but I classify this as a 1 because it is > .5, and I get it wrong 49% of the time, isn't my model right as opposed to being 49% wrong?

Would it be a better measure to check if my error rate is close to the expected error rate for the model?

Best Answer

There is no standard way to define goodness-of-fit. It depends on your application and what the problem you are going to solve. As in classification, you may define the goodness-of-fit as 0-1 loss.

For a logistic regression, you can compute the likelihood function. I would use a McFadden pseudo-$R^2$, which is defined as:

$$ R^2 = 1 - \frac{\operatorname{L}(\theta)}{\operatorname{L}(\mathbf{0})} $$

$\operatorname{L}$ is the log-likelihood function, $\theta$ is the parameter of the model and $\mathbf{0}$ denote a zero vector (i.e. you compare the likelihood ratio of your model against a model with all coefficients 0)

Moreover, given a probability measure $\mu(x) = P(Y = 1|X=x)$, define the loss function of a classifier $g$ as $L(g) = P(g(X) \neq Y)$.

The Bayes decision rule:

$$ g^*(x) = \begin{cases} 1 & \mbox{if } \mu(x) \geq 0.5 \\ 0 & \mbox{if } \mu(x) < 0.5 \end{cases} $$

is the rule that minimize $L(g)$. That is nothing wrong to classify as 1 when your logistic regression output probability $\geq 0.5$ as long as you are thinking the loss function as above.

Related Solutions

Solved – Logistic Regression – Bayesian Approach – Assessing Classification Accuracy

Predicting with bayesian models and especially BUGS is very easy. Just set the response in testing sets to NA. Then you also need to specifiy initial values for the response; set those to NA for the training set and to a reasonable value for the test data.

BUGS will then sample from the posterior predictive distribution for the response values you set to NA. Note that these distributions contain the uncertainty about the regression coefficients. You can take the median of these samples if you want point estimates, but the sd of the estimates will also be quite informative.

Here is a rather minimal example:

model
{
    for (i in 1:N)
        {
            y[i] ~ dnorm(mu,1)
        }
        mu ~ dunif(-1000,1000)
}

#data
list(N=10, y = c(-1,0,1,-1,0.5,-0.5,2,-1.5, NA, NA))
#inits
list(mu = 0, y = c(NA,NA,NA,NA,NA,NA,NA,NA,0,0))

You can then get posterior predictive distributions for $y_9$ and $y_{10}$. This example does not contain predictors, but it also works with them. Note that you would not set them to NA, they would instead remain unchanged.

@Edit after Comment:

You can also do this differently and seperate test and training data in the model above. This would look like this:

model
{
    for (i in 1:N.train)
        {
            y.train[i] ~ dnorm(mu,1)
        }
    for (i in 1:N.test)
        {
            y.test[i] ~ dnorm(mu,1)
        }
        mu ~ dunif(-1000,1000)
}

#data
list(N.train=8, N.test = 2, y.train = c(-1,0,1,-1,0.5,-0.5,2,-1.5))
#inits
list(mu = 0, y.test = c(0,0))

This might look somewhat easier, but note that you will also need to split any predictor in the models (my example has none). You might have vectors like sex.train and sex.test then. Personally I prefer the first way, because it is more terse.

And yes, I think this is a reasonable starting point. While some sorts of overfitting will be indicated in a bayesian model by very high sds for the coefficients, you still impose a model structure which might not fit the data well. This can also lead to your predictions being poor. You should also consider (for example) a full cross validation, where you will repeat that step with different splits of the original data.

Solved – When to divide data into training & test set in logistic regression

I do not think you need to divide the set if you are interested in the significance of a coefficient and not in prediction. Cross validation is used to judge the prediction error outside the sample used to estimate the model. Typically, the objective will be to tune some parameter that is not being estimated from the data.

For example, if you were interested in prediction, I would advise you to use regularized logistic regression. This is similar to logistic regression, except for the fact that coefficients (as a whole) are biased towards 0. The level of bias is determined by a penalty parameter that is typically fine tuned via cross validation. The idea is to choose the penalty parameter that minimizes the out of sample error (which is measured via cross validation.) When building a predictive model, it is acceptable (and desirable) to introduce some bias into the coefficients if said bias causes a much larger drop in the variance of the prediction (hence, resulting in a better model for predictive purposes.)

What you are trying to do is inference. You want an unbiased estimate of a coefficient (supposedly to judge the effect that changing one variable may have on another). The best way to obtain this is to have a well specified model and a sample as large as possible. Hence, I would not split the sample. If you are interested in sampling variation, you should try a bootstrap or a jacknife procedure instead.

EDIT:

Short version: You want an unbiased model. Cross validation can help you find a good predictive model, which are often biased. Hence, I do not think cross validation is helpful in this situation.

Best Answer

Related Solutions

Solved – Logistic Regression – Bayesian Approach – Assessing Classification Accuracy

Solved – When to divide data into training & test set in logistic regression

Related Question