Solved – Local polynomial (linear) regression of binary data — logit transformation

binary datalogitnonparametricregression

I got a bit confused about how to fit a local polynomial to binary outcomes if I would rather approximate the underlying index (within a link function) instead. (Basically for the same reason why someone would estimate a logit or a probit instead of a linear probability model.) Of course, I would like to plug the variable in the inverse of the link function — but of course, 0 and 1 will give me $-\infty$ and $+\infty$ in common link functions.

To clarify, my problem is to flexibly predict $y(x)$, where $x \in [-1,1]$ but $y \in \{0,1\}$. I was not thinking about splines, as I would be more interested in a smooth prediction of the index $f(x)$ if the prediction is $\hat{y}=g(f(x))$ with a link function g. I do not have a prior on the order of the splines and the location of the breakpoints. I would be interested in general advice on the appropriate approach.

As a shorthand, I substituted 0 with $\frac{1}{N}$ and 1 with $\frac{N-1}{N}$, and generated the index with these. (Using a logit link, thus I was smoothing/predicting $f(x)=\Lambda(\frac{1}{N})$ if y(x)=0, e.g.)

But I am uneasy about this. I coded up leave-out cross-validation to pick the right bandwidth for smoothing f with the short-hand f, and even used the delta method to get back to confidence intervals of the original binary variable from my smoothing of the index. But that might be fake sophistication if I approach the prediction wrong in the first place.

Thanks for any thoughts on this!

Best Answer

Take a look at McCullagh and Nelder (1989) Generalized Linear Models, 2nd ed, Section 2.5 (pp 40-43), on iteratively reweighted least squares.

Let $y$ be the 0/1 outcome and let $\eta = g(\mu)$ be the link function. You never calculate $g(y)$ directly, but work with an adjusted dependent variable $$z = \hat{\eta}_0 + (y-\hat{\mu}_0) \left(\frac{d\eta}{d\mu}\right)_0$$ where $\hat{\eta}_0$ is the current estimate of the linear predictor, $X\hat{\beta}_0$, and $\hat{\mu}_0 = g^{-1}(\hat{\eta}_0)$. So that avoids the problem with $g(0)$ and $g(1)$ being $\pm \infty$.

For the logit link, $\eta = \ln[\mu / (1-\mu)]$, you'll find that $d\eta/d\mu = 1/[\mu(1-\mu)]$ and so you would have $$z = \hat{\eta}_0 + \frac{y-\hat{\mu}_0}{\hat{\mu}_0 (1 - \hat{\mu}_0)}$$

You further calculate weights $$w_0^{-1} = \left(\frac{d\eta}{d\mu}\right)^2_0 v_0$$ where $v_0 = V(\mu_0)$ come from the mean/variance relationship, which for the binary case would be $V(\mu) = \mu(1-\mu)$. For the logit link, since $d\eta/d\mu = 1/[\mu(1-\mu)]$, you end up with weights $w_0 = \mu_0(1-\mu_0)$.

A key concern is the starting points. You might look at the R source code to see what they do. I wrote down in a notebook to start with $\tilde{\mu} = 1/4$ if $y = 0$ and $\tilde{\mu} = 3/4$ if $y=1$, but I didn't include a source.

To spell out the iterative algorithm a bit more, focusing on the logit link:

At the start you do the following:

  • Start with initial "fitted" values, say $\hat{\mu}^{(0)}_i = $ 1/4 or 3/4 according to whether $y_i = $ 0 or 1
  • Calculate $\hat{\eta}^{(0)}_i = \ln[\hat{\mu}^{(0)}_i/(1-\hat{\mu}^{(0)}_i)]$
  • Calculate $z^{(0)}_i = \hat{\eta}^{(0)}_i + [y_i-\hat{\mu}^{(0)}_i]/[\hat{\mu}^{(0)}_i (1 - \hat{\mu}^{(0)}_i)]$
  • Calculate the weights $w^{(0)}_i = \hat{\mu}^{(0)}_i (1-\hat{\mu}^{(0)}_i)$
  • Regress the $z^{(0)}_i$ on $X$ using weights $w^{(0)}_i$, to get initial estimates $\hat{\beta}^{(0)}$

Then, at each iteration, you do the following:

  • Calculate $\hat{\eta}^{(s)}_i = X \hat{\beta}^{(s-1)}$
  • Calculate $\hat{\mu}^{(s)}_i = \exp(\hat{\eta}^{(s)}_i)/[1+\exp(\hat{\eta}^{(s)}_i)]$
  • Calculate $z^{(s)}_i = \hat{\eta}^{(s)}_i + [y_i-\hat{\mu}^{(s)}_i]/[\hat{\mu}^{(s)}_i (1 - \hat{\mu}^{(s)}_i)]$
  • Calculate the weights $w^{(s)}_i = \hat{\mu}^{(s)}_i (1-\hat{\mu}^{(s)}_i)$
  • Regress the $z^{(s)}_i$ on $X$ using weights $w^{(s)}_i$, to get revised estimates $\hat{\beta}^{(s)}$

This is all just for regular logistic regression. For the local logistic regression version, there is some discussion in Chapter 4 of Loader (1999) Local regression and likelihood (but frankly, I didn't really follow it).

A Google search for "local logistic regression IRLS" revealed these notes from Patrick Breheny, which say (pg 8):

The weight given to an observation $i$ in a given iteration of the IRLS algorithm is then a product of the weight coming from the quadratic approximation to the likelihood and the weight coming from the kernel ($w_i = w_{1i} w_{2i}$)