Solved – How to evaluate fit of a logistic regression

goodness of fitlogisticpredictive-modelsregression

I have a set of data points, which exhibit a solid linear correlation $r\approx 0.9$. I am basically plotting population in certain areas against the number of occurrences of a certain phenomenon (so in other words, I think the number of people should predict the number of occurrences of this phenomenon).

I don't like the linear regression though, because a linear model can take on negative or huge values. In my case the values need to be positive, and there is an upper bound (which I don't know; but logically, the values can only get so large). Unfortunately one of the inputs for which I need a predicted output is far larger than the inputs which I used to make a regression.

This made me choose a logistic model instead, $$y=\frac{A}{1+Be^{-Cx}}$$ because I am familiar with its shape (always positive, and approaching a limiting value) from basic differential equations, and it seems to be what I want for this situation. Also the data has a slight "point of inflection", so the curvature of the logistic regression (visually to me) is even a little better fit for the data points than the line was.

So finally my question(s):

  1. I keep reading on the internet that logistic models are meant for probabilities, which take values between $0$ and $1$. I would like, in addition to having a high $r$, to try and reason why this type of model "makes sense" for my situation: am I going the completely wrong direction here?

  2. How can I measure the "goodness of fit" of my curve to the data points? I have read Wikipedia but it is too vague for me to understand. An explicit formula – must be computable by hand calculator – would be helpful.

I have done my research but absolutely nothing I have found on the internet has been accessible to me, since I know virtually nothing about stats. An introductory/easy explanation would be so nice… and my math background is far stronger than my stats background, so use all the math you want but assume I'm a beginner at stats. I want it to be somewhat rigorous too – like if I could plug a number into a t-Test for example, that would be good to get an "objective" value.

Best Answer

Standard univariate logistic regression of $y$ on $x$ finds the coefficients $\alpha$, $\beta$ that best fit your training data $\{(x_i, y_i), i \in [1, N]\}$ in the following equation:

(model 1): $y_i = (1 + exp(-(\alpha + \beta x_i)))^{-1}$

Note that the fit will be bad if the $y$ in your data are not in $(0,1)$, so you'll have to transform your data if you want to use logistic regression. One option might be to transform $y$ into a proportion (number of occurences of the "phenomenon" divided by population of the corresponding area?).

Also, the fact that "one of the inputs for which I need a predicted output is far larger than the inputs which I used to make a regression" is a problem, because you will be extrapolating the results of the model to unknown regions of the data. A value of $x_i$ much higher than the ones in the training sample will probably give you a $\hat y_i$ very close to 1.

Directly assessing prediction error

Once the model is fitted and you have your estimated parameters $\hat \alpha$ and $\hat \beta$, you get a predicted output value $\hat y_i$ for each observation $x_i$: $\hat y_i = (1 + exp(-(\hat \alpha + \hat \beta x_i)))^{-1}$. You can easily assess goodness of fit on your graphic calculator using the observed $y_i$ and their corresponding predict values $\hat y_i$:

  • either by plotting one against the other (if the fit was perfect this would give you a straight line, the identity line, because then $y=\hat y$)

  • or by computing an error measure, for instance the root mean squared error: $rmse= \sqrt{\frac{1}{N} \sum_{i=0}^N (y_i - \hat y_i)^2}$. This will tell the average distance between observed outcomes $y_i$ and the model-predicted outcomes $\hat y_i$ (the lower $rmse$, the better the fit). It is not a standardized score like $R^2$, but is easy to compute and interpret.

Now to asses the prediction power of the model, it is best to compare $y_i$ and $\hat y_i$ on a validation dataset, ie. data that were not used in the fit (eg. by withholding a portion of data during training, see cross-validation for more info).

Pseudo-R²

The usual $R^2$ of linear regression does not apply to logistic regression, for which several alternative measures exist. In all variants, $R^2$ is a real value between 0 and 1, and the closer to 1 the better the model.

One of them uses the likelihood ratio, and is defined as follows:

$R^2_L = 1 - \frac{L_1}{L_0}$, where $L_1$ and $L_0$ are the log-likelihood of (respectively) model 1 (see above) and the following model 0, which is a logistic regression on just a constant (and does not depend on $x$):

(model 0): $y_i = (1 + exp(-\alpha))^{-1}$

For any logistic regression model with $y \in \{0,1\}$ the log-likelihood is computed from the observed $y$ and the predicted $\hat y$, using the following formula (but I'm not sure it applies for continuous $y \in [0,1]$):

$L=\sum_{i= 1}^N(y_i \ln(\hat y_i)+(1−y_i)\ln(1−\hat y_i))$

Another pseudo-R² is based on the linear correlation of $y$ and $\hat y$, which is easily computed on any graphic calculator with stat functions:

$R^2_{cor} = \left( \widehat {cor(y_i, \hat y_i)} \right) ^2$