A bit of context on logistic regression. When dealing with logistic regression, you want to compute the probability $p$ of the binomial variable "YES" / "NO", or "heart attack" / "no heart attack" etc...by stating that such probability depends on a certain number of variables, let us say
$$x_1,\dots,x_p$$
through
$$\log(\frac{p}{1-p}):=\beta_0+\beta_1 x_1+\dots+\beta_p x_p.$$
This is a choice (we could have used other models, like the probit) motivated by the fact that $p=p(\beta,x)$ is a probability (i.e. is a real number between 0 and 1) and the logit transformation
$$\operatorname{logit}(p):=\log(\frac{p}{1-p})$$
is invertible, with inverse
$$p(\beta,x)=\frac{1}{1+exp(-\beta_0-\beta_1 x_1-\dots-\beta_p x_p)}.$$
Given a set of $n<p$ measurements $(y_{1},x_{1,1},\dots,x_{1,p})$, $\dots$,
$(y_{n},x_{1,n},\dots,x_{1,n})$, with $y_{n}$ either equal to $0$ (no event) or $1$ (event), we want to estimate the parameters $(\beta_0,\beta_1,\dots,\beta_p)$, i.e. the missing piece in our regression scheme.
This task is usually performed using maximum likelihood methods, which end up to the numerical algorithm called "Newton Raphson" for the parameter estimation.
The recursive algorithm provides you with an answer in most of the cases; divergences can occur for different reasons, however. One particularly interesting reason is called "complete separation" of one or even more variable. You will surely meet this topic in applications.
Try to perform a logistic regression for the following easy vectors of data (in this order! Please, do not shuffle the components...)
$$y=(0,0,0,0,0,0,0,0,0,0,1,1,1,1,1)$$
$$x=(0,0,0,0,0.1,0.2,0.3,0.4,0.5,0.6,0,0,0,0.9,1)$$
...what do you get, as an answer?
The logistic regression can be theoretically motivated by the principle of maximum entropy: in fact, if we are supposed to use it on the binomial variable "YES" / "NO", or "heart attack" / "no heart attack" in presence of certain constraints,it is possible toshow that the probability distribution for such variable that maximizes the (Shannon) entropy is the logistic distribution. In this sense, the logistic regression, which is so common in applications, plays a special role.
The given data is graphically represented on the next figure :
First, we will look how the logistic function fit to the given data : $$(x_1,y_1),(x_2,y_2),...,(x_i,y_i),...,(x_n,y_n)$$
$$
y_i\simeq\frac{1}{1-e^{k(x_i-x_{\text{m}})}} \tag 1$$
$$x_i\simeq x_{\text{m}}+\frac{1}{k}\ln\left(\frac{1}{y_i}-1\right)
$$
We compute $\quad (z_1),(z_2),...,(z_i),...,(z_n)\quad $ with :
$$\quad z_i=\ln\left(\frac{1}{y_i}-1\right)$$
Then, the points $(x_i,z_i)$ are plotted on the next figure (respectively in BLUE and RED for the two given examples).
If the function $(1)$ was perfectly convenient, the points would have been on a straight line. We observe a non-negligible deviation for small and large values of $x$.
This draw to add a corrective term which has to be a symmetrical odd function. The simplest one has the form $\quad \alpha (x-c)^3\quad$ where $\alpha$ is a small coefficient to be determined. $c$ is directly found in the given data for $y(c)=0.5$ Don't confuse $c$ with $x_m$ above, even if they are on same order of magnitude.
For the first data : $c=-7.819$ and for the second : $c=7.048$ (no adjustment is necessary).
We see on the figure that with this kind of corrective term, the points (plotted in BLACK) can become nicely aligned.
In fact, the proposed function is :
$$y(x)\simeq\frac{1}{1-e^{k(x-x_{\text{m}})+\alpha(x-c)^3}} \tag 2$$
where there are three parameters to adjust : $k$ , $x_{\text{m}}$ and $\alpha$.
What is more, the computation of those three parameters is very easy, in fact a simple linear regression ( no need for recursive calculus, no initial guess).
Consider the data :
$$(x_1,z_1),(x_2,z_2),...,(x_i,z_i),...,(x_n,z_n)$$
and the linear relationship (with the above known value of $c$ ) :
$$z=kx+\beta+\alpha (x-c)^3$$
where $\beta=-kx_{\text{m}} \quad\to\quad x_{\text{m}}=-\frac{\beta}{k} $
An usual linear regression for $k$ , $\beta$ , $\alpha$ leads to the wanted parameters of equation (2).
The result is shown on the next figure :
Of course, no need to take all the digits given by the computer. Only three or four significant digits are largely sufficient.
Note : The value of $c$ is not critical. It comes from the $50^{th}$ point in the given data. But one can take any other point around. For example for the first data, instead of $-7.819$ one can take $7$ or $8$ without a signifiant change of the final fitting.
Note: In the regression calculus, the points $(x_0,y_0=0)$ and $(x_{100},y_{100}=1)$ are excluded since they are obviously deviant for finite value of $x$.
Best Answer
$\newcommand{\logit}{\operatorname{logit}}$
The popular confusion between probability and odds seems to be in play here. A probability $p$ is always in the interval $[0,1]$. The odds in favor of an event or a statement is the number $p/(1-p)$, where $p$ is the probability. The odds is in the interval $[0,\infty]$ (closed brackets at both ends), and is more than $1$ when the probability is more than $1/2$. What is commonly called "$3$-to-$1$ odds" would mean $p/(1-p)=3$, so that $p=3/4$. The probability is $3/4$; the odds is $3$.
In logistic regression one has a real-valued predictor variable $x$ observed in $n$ cases, thus a vector $(x_1,\ldots,x_n)$, and a $\{0,1\}$-valued response variable $y$ observed in the same $n$ cases, $(y_1,\ldots,y_n)$. One estimates a function $$ \logit p(x) = ax+b, $$ where $\logit p = \log\dfrac{p}{1-p}$, so that $p$ must be between $0$ and $1$. The function $p(x)$ is supposed to be an estimated probability that $y=1$ given the value of $x$. The values of $a$ and $b$ determine the function $p$, and they are estimated based on the observed values $(x_1,\ldots,x_n)$ and $(y_1,\ldots,y_n)$, using maximum likelihood. The likelihood function is $$ L(a,b) = \left(\prod_{i\ :\ y_i=1} p(x_i)\right)\left(\prod_{i\ :\ y_i=0} (1-p(x_i)). \right) $$ The values $\hat a$ and $\hat b$ of $a$ and $b$ that maximize this are the estimates. They are found by iterative numerical methods.
If all $y$ values corresponding to $x<\text{cutoff}$ are $0$ and all $y$ values corresponding to $x>\text{cutoff}$ are $1$, then one does get $p(x)=1\text{ or }0$ according as $x$ is larger or smaller than the cutoff, and that means $a=+\infty$. and for things like age and baldness (if baldness can really be considered binary) one would get a finite number for $a$, which would be positive if baldness is more frequent for older people in the observed dataset. And $a$ would be $0$ if baldness is uncorrelated with age, and negative if baldness is more frequent among younger people.
But, except in the trivial case where $a=0$, so $p$ is constant, the function $p$ will always satisfy either $$ p(x)\to1\text{ as }x\to+\infty\text{ and }p(x)\to0\text{ as }x\to-\infty $$ or
$$ p(x)\to1\text{ as }x\to-\infty\text{ and }p(x)\to0\text{ as }x\to+\infty. $$