Logistic Regression – Applying Logistic Regression for Bounds Different from 0 and 1

logistic

I have some data; it's a proportion $y$ of some stuff relative to everything, so it's bounded between 0 and 1 by definition. The proportion changes over time. Besides fairly high variance there is a step-like change about the middle of the time period; the step isn't very large, but it's there, and it happens pretty fast relative to the whole time period. So I have an S-curve, and I want to (and was told to) do logistic regression.

But: (1) this is not monotone, not even close, because of the high variance. (2) it does not go from 0 to 1. Rather, it goes from about 0.2 to about 0.8, if you look at the means. So it would seem that the right thing to do will be to fit something like $a\phi(x)+b$, where $\phi$ is the usual logistic S-curve, so we have 4 parameters altogether.

What bothers me is that I've never seen an example of logistic regression used like that (I'll admit to not seeing too many). It's not that I'm not sure how to implement this — I'm pretty confident I can figure that out, although specific pointers will be appreciated — but I am afraid that the assumption that the data is bounded by the limits of the S-curve is important for all the estimates afterward, like goodness of fit, significance of the step, confidence intervals, etc.

So:

Are my fears justified?
Can you point me to a relevant example in the literature? Not necessarily statistical literature, if some biologists (or whoever) use it, that's fine. Or even just someone mentioning this possibility.
If there are indeed problems with this approach, what are the alternatives?

Update: Well, I was so ignorant that I could not even ask the questions properly. I was looking for multinomial (aka polytomous) regression.

Best Answer

To begin with, I think we have to distinguish between logistic regression and (generalized) logistic function. Though the latter may be viewed as a separate case of a the former taking the time as the only explanatory variable. It is then straightforward to see that fitted process will go by $S$ shaped path that goes to its upper (or, probably, lower) limit when $t \rightarrow \infty$. Therefore moving along the $S$ curve is the influence of over covariates (not time) that are changing up and down with time (in consumption structures these are income, tastes, prices, etc.). So there could be jumps or whatever, because nobody restricts the linear regression to go only up or only down to the $S$ curve's bounds.

Since you are working with structures $0$ and $1$ are natural limits. You can never be sure that any other bounds won't be hit higher or lower in the future, when your conclusions are based only on historical data analysis, and arguing that the process never did so is not an appropriate reasoning. Therefore your fears are not justified, logistic regression (but not fitting the logistic curve! that comes as a solution of deterministic differential equation!) will work just fine here. Pay attention to the fact that there could be several categories that some up to 1, so you need multinomial logit model to fit the structure in this case.

Among the alternatives there could be any model that could be applied to discrete choice. Commonly used candidates are probit and logit models. Even if you think that there is no decision in my model actually all structures in the world are the result of decision processes solved either by humans, nature or the aliens ^_^.

Related Solutions

Logistic Regression – Logistic Regression and Inflection Point

As touched upon by @scortchi the reviewer was operating under the false impression that it is not possible to model nonlinear effects of predictors on the logit scale in the context of logistic regression. The original model was quick to assume linearity of all predictors. By relaxing the linearity assumption, using for example restricted cubic splines (natural splines), the entire shape of the curve is flexible and inflection point is no longer an issue. Had there been a single predictor and had it been expanded using a regression spline, one could say that the logistic model makes only the assumptions of smoothness and independence of observations.

R Logistic Regression – Connecting Cumulative Distribution Function (CDF) with Logistic Regression

There are several issues here. The biggest is that the thinking behind the code for generating the blue points in the second plot is confused. I can't make any sense of what the logic is supposed to be. The $x$ values are the same as your data, but the $y$ values are the number of wins up to that point on the $x$ axis divided by the total number of wins. This really makes no sense. You aren't even dividing by the total number of games, which means you don't have the cumulative probability at that point (although that still wouldn't make sense). This also makes you wonder why the last blue point isn't at $1.0$. Actually it is, but the last point is at $x=55$, which is excluded by your argument xlim=c(-10,50).

At any rate, if you want to get a non-parametric estimate of what the underlying function might be, you could plot a lowess line:

plot(ravenWinNum~ravenScore, data=ravensData)
lines(with(ravensData, lowess(ravenWinNum~ravenScore)))

enter image description here

What is clear from this figure, is that there is a curvilinear relationship between the score and the probability of winning (at least in these data, we can debate whether that makes any theoretical sense). If you thought about it, it was clear from the plot of the raw data in that there are increasing numbers of wins at both the low end and high end of $x$; the losses are in the center. A very simple way to fit a curvilinear relationship is to add a squared term:

lr.Ravens2 <- glm(ravenWinNum~ravenScore+I(ravenScore^2), data=ravensData, 
                  family=binomial)

We can test these models as a whole by assessing their deviance against the null deviance and comparing the result to a chi-squared distribution with the number of degrees of freedom consumed. We do this because we want to think of both the linear term and the squared term together as a unit. This method of testing the model as a whole is analogous to the global $F$-test that comes with a multiple regression model.

# here is the test of the model with the squared term as a whole
1-pchisq(24.435-16.875, df=2)
[1] 0.02282269
# here is the test of your model
1-pchisq(24.435-20.895, df=1)
[1] 0.05990546
# this is the test of the improvement due to adding the squared term
1-pchisq(20.895-16.875, df=1)
[1] 0.04496371

I wouldn't make too much of the arbitrary threshold of significance at $.05$, but we see that the initial model is not significant by conventional standards, or at least is slightly less significant. (Given that you have so few data, I would think a higher $\alpha$ is fine.) The model that includes the squared term as well is more significant, even though it uses up another of your precious few degrees of freedom. The AIC is lower as well: the first model has $AIC=24.895$, whereas the second model has $AIC=22.875$. The last of the tests listed shows that the addition of the squared term yields a significant improvement in fit.

The implication of the fact that there is a curvilinear relationship between $x$ and $y$ is that the function that maps the conditional probability that $y_i=1$ given $x_i$ cannot be equated to a cumulative probability distribution that takes $x$ as its quantiles.

For completeness, here is the same plot with the model's predicted probabilities overlaid:

enter image description here

We see that the predicted probabilities mirror the lowess line reasonably well. Of course, they don't match perfectly, but there is no reason to believe that the lowess line is exactly correct; it just provides a non-parametric approximation of whatever the underlying function is.

Having addressed the specific example in your question, it is worth addressing the assumption underlying the question more broadly. What is the relationship between logistic regression and a cumulative distribution function (CDF)?

There can be a connection between binomial regression and a CDF. For example, when the probit link is used, the conditional probability of 'success' ($\pi_i(Y_i=1|X=x_i)$) is transformed via the inverse of the normal CDF. (For more about that, it may help to read my answer here: difference between logit and probit models.) But there are a couple of points to be made here. First, as @FrankHarrell and @Glen_b note, binomial (e.g., logistic) regression is about modeling conditional probabilities at each point.

Second, although the fitted function can look sort of like a CDF when the conditions are just right, it will never actually be a function (scaled, shifted, etc.) of the CDF of your $X$ data. Your own data serve as an example of the function not even looking like a CDF because you have a polynomial relationship. But it is true even when the function does look like a CDF. The easiest way to understand this is that your last (highest $x$ valued) point will always be 100% of the way through your data, but the fitted function will never reach $\hat p_i=1.0$, it can only approach that value as $X$ (the variable, not your data) approaches infinity. As a result, you should not think of a function of the CDF of your $x$ data as the values the logistic regression is supposed to approximate.

Best Answer

Related Solutions

Logistic Regression – Logistic Regression and Inflection Point

R Logistic Regression – Connecting Cumulative Distribution Function (CDF) with Logistic Regression

Related Question