Solved – How to evaluate fit of a logistic regression

goodness of fitlogisticpredictive-modelsregression

I have a set of data points, which exhibit a solid linear correlation $r\approx 0.9$. I am basically plotting population in certain areas against the number of occurrences of a certain phenomenon (so in other words, I think the number of people should predict the number of occurrences of this phenomenon).

I don't like the linear regression though, because a linear model can take on negative or huge values. In my case the values need to be positive, and there is an upper bound (which I don't know; but logically, the values can only get so large). Unfortunately one of the inputs for which I need a predicted output is far larger than the inputs which I used to make a regression.

This made me choose a logistic model instead, $$y=\frac{A}{1+Be^{-Cx}}$$ because I am familiar with its shape (always positive, and approaching a limiting value) from basic differential equations, and it seems to be what I want for this situation. Also the data has a slight "point of inflection", so the curvature of the logistic regression (visually to me) is even a little better fit for the data points than the line was.

So finally my question(s):

I keep reading on the internet that logistic models are meant for probabilities, which take values between $0$ and $1$. I would like, in addition to having a high $r$, to try and reason why this type of model "makes sense" for my situation: am I going the completely wrong direction here?
How can I measure the "goodness of fit" of my curve to the data points? I have read Wikipedia but it is too vague for me to understand. An explicit formula – must be computable by hand calculator – would be helpful.

I have done my research but absolutely nothing I have found on the internet has been accessible to me, since I know virtually nothing about stats. An introductory/easy explanation would be so nice… and my math background is far stronger than my stats background, so use all the math you want but assume I'm a beginner at stats. I want it to be somewhat rigorous too – like if I could plug a number into a t-Test for example, that would be good to get an "objective" value.

Best Answer

Standard univariate logistic regression of $y$ on $x$ finds the coefficients $\alpha$, $\beta$ that best fit your training data $\{(x_i, y_i), i \in [1, N]\}$ in the following equation:

(model 1): $y_i = (1 + exp(-(\alpha + \beta x_i)))^{-1}$

Note that the fit will be bad if the $y$ in your data are not in $(0,1)$, so you'll have to transform your data if you want to use logistic regression. One option might be to transform $y$ into a proportion (number of occurences of the "phenomenon" divided by population of the corresponding area?).

Also, the fact that "one of the inputs for which I need a predicted output is far larger than the inputs which I used to make a regression" is a problem, because you will be extrapolating the results of the model to unknown regions of the data. A value of $x_i$ much higher than the ones in the training sample will probably give you a $\hat y_i$ very close to 1.

Directly assessing prediction error

Once the model is fitted and you have your estimated parameters $\hat \alpha$ and $\hat \beta$, you get a predicted output value $\hat y_i$ for each observation $x_i$: $\hat y_i = (1 + exp(-(\hat \alpha + \hat \beta x_i)))^{-1}$. You can easily assess goodness of fit on your graphic calculator using the observed $y_i$ and their corresponding predict values $\hat y_i$:

either by plotting one against the other (if the fit was perfect this would give you a straight line, the identity line, because then $y=\hat y$)
or by computing an error measure, for instance the root mean squared error: $rmse= \sqrt{\frac{1}{N} \sum_{i=0}^N (y_i - \hat y_i)^2}$. This will tell the average distance between observed outcomes $y_i$ and the model-predicted outcomes $\hat y_i$ (the lower $rmse$, the better the fit). It is not a standardized score like $R^2$, but is easy to compute and interpret.

Now to asses the prediction power of the model, it is best to compare $y_i$ and $\hat y_i$ on a validation dataset, ie. data that were not used in the fit (eg. by withholding a portion of data during training, see cross-validation for more info).

Pseudo-R²

The usual $R^2$ of linear regression does not apply to logistic regression, for which several alternative measures exist. In all variants, $R^2$ is a real value between 0 and 1, and the closer to 1 the better the model.

One of them uses the likelihood ratio, and is defined as follows:

$R^2_L = 1 - \frac{L_1}{L_0}$, where $L_1$ and $L_0$ are the log-likelihood of (respectively) model 1 (see above) and the following model 0, which is a logistic regression on just a constant (and does not depend on $x$):

(model 0): $y_i = (1 + exp(-\alpha))^{-1}$

For any logistic regression model with $y \in \{0,1\}$ the log-likelihood is computed from the observed $y$ and the predicted $\hat y$, using the following formula (but I'm not sure it applies for continuous $y \in [0,1]$):

$L=\sum_{i= 1}^N(y_i \ln(\hat y_i)+(1−y_i)\ln(1−\hat y_i))$

Another pseudo-R² is based on the linear correlation of $y$ and $\hat y$, which is easily computed on any graphic calculator with stat functions:

$R^2_{cor} = \left( \widehat {cor(y_i, \hat y_i)} \right) ^2$

Related Solutions

How to Simulate Artificial Data for Logistic Regression in R

No. The response variable $y_i$ is a Bernoulli random variable taking value $1$ with probability $pr(i)$.

> set.seed(666)
> x1 = rnorm(1000)           # some continuous variables 
> x2 = rnorm(1000)
> z = 1 + 2*x1 + 3*x2        # linear combination with a bias
> pr = 1/(1+exp(-z))         # pass through an inv-logit function
> y = rbinom(1000,1,pr)      # bernoulli response variable
> 
> #now feed it to glm:
> df = data.frame(y=y,x1=x1,x2=x2)
> glm( y~x1+x2,data=df,family="binomial")

Call:  glm(formula = y ~ x1 + x2, family = "binomial", data = df)

Coefficients:
(Intercept)           x1           x2  
     0.9915       2.2731       3.1853  

Degrees of Freedom: 999 Total (i.e. Null);  997 Residual
Null Deviance:      1355 
Residual Deviance: 582.9        AIC: 588.9

Logistic Regression – Logistic Regression and Inflection Point

As touched upon by @scortchi the reviewer was operating under the false impression that it is not possible to model nonlinear effects of predictors on the logit scale in the context of logistic regression. The original model was quick to assume linearity of all predictors. By relaxing the linearity assumption, using for example restricted cubic splines (natural splines), the entire shape of the curve is flexible and inflection point is no longer an issue. Had there been a single predictor and had it been expanded using a regression spline, one could say that the logistic model makes only the assumptions of smoothness and independence of observations.