[Math] Question about logistic regression

regression

A logistic regression is meant for a binary/categorical variable. Sort of like age vs baldness.

1) So, does the "S-curve" regression equation output give the odds of having that condition for a given x-value (eg: age), since the values go from 0 to 1 on the Y-axis?

(thinking to myself….)If the data models this behavior strongly (age vs. ability to vote), then it will be a very sharp cutoff at 18, and I guess it will be pretty accurate, yielding 0% and 100% for almost all ages… With more ambiguity, I guess the curve will not go from 0 to 1 on the y-axis, but more like 25% to 75%, for example?

2) Anyone have a good example of a binary data set that exhibits this binary nature?

3) How does one do Logistic Regression in Excel 2010?
I only see Layout -> Trendline -> Logarithmic

Best Answer

$\newcommand{\logit}{\operatorname{logit}}$

The popular confusion between probability and odds seems to be in play here. A probability $p$ is always in the interval $[0,1]$. The odds in favor of an event or a statement is the number $p/(1-p)$, where $p$ is the probability. The odds is in the interval $[0,\infty]$ (closed brackets at both ends), and is more than $1$ when the probability is more than $1/2$. What is commonly called "$3$-to-$1$ odds" would mean $p/(1-p)=3$, so that $p=3/4$. The probability is $3/4$; the odds is $3$.

In logistic regression one has a real-valued predictor variable $x$ observed in $n$ cases, thus a vector $(x_1,\ldots,x_n)$, and a $\{0,1\}$-valued response variable $y$ observed in the same $n$ cases, $(y_1,\ldots,y_n)$. One estimates a function $$ \logit p(x) = ax+b, $$ where $\logit p = \log\dfrac{p}{1-p}$, so that $p$ must be between $0$ and $1$. The function $p(x)$ is supposed to be an estimated probability that $y=1$ given the value of $x$. The values of $a$ and $b$ determine the function $p$, and they are estimated based on the observed values $(x_1,\ldots,x_n)$ and $(y_1,\ldots,y_n)$, using maximum likelihood. The likelihood function is $$ L(a,b) = \left(\prod_{i\ :\ y_i=1} p(x_i)\right)\left(\prod_{i\ :\ y_i=0} (1-p(x_i)). \right) $$ The values $\hat a$ and $\hat b$ of $a$ and $b$ that maximize this are the estimates. They are found by iterative numerical methods.

If all $y$ values corresponding to $x<\text{cutoff}$ are $0$ and all $y$ values corresponding to $x>\text{cutoff}$ are $1$, then one does get $p(x)=1\text{ or }0$ according as $x$ is larger or smaller than the cutoff, and that means $a=+\infty$. and for things like age and baldness (if baldness can really be considered binary) one would get a finite number for $a$, which would be positive if baldness is more frequent for older people in the observed dataset. And $a$ would be $0$ if baldness is uncorrelated with age, and negative if baldness is more frequent among younger people.

But, except in the trivial case where $a=0$, so $p$ is constant, the function $p$ will always satisfy either $$ p(x)\to1\text{ as }x\to+\infty\text{ and }p(x)\to0\text{ as }x\to-\infty $$ or

$$ p(x)\to1\text{ as }x\to-\infty\text{ and }p(x)\to0\text{ as }x\to+\infty. $$

Related Question