Solved – Local polynomial (linear) regression of binary data — logit transformation

binary datalogitnonparametricregression

I got a bit confused about how to fit a local polynomial to binary outcomes if I would rather approximate the underlying index (within a link function) instead. (Basically for the same reason why someone would estimate a logit or a probit instead of a linear probability model.) Of course, I would like to plug the variable in the inverse of the link function — but of course, 0 and 1 will give me $-\infty$ and $+\infty$ in common link functions.

To clarify, my problem is to flexibly predict $y(x)$, where $x \in [-1,1]$ but $y \in \{0,1\}$. I was not thinking about splines, as I would be more interested in a smooth prediction of the index $f(x)$ if the prediction is $\hat{y}=g(f(x))$ with a link function g. I do not have a prior on the order of the splines and the location of the breakpoints. I would be interested in general advice on the appropriate approach.

As a shorthand, I substituted 0 with $\frac{1}{N}$ and 1 with $\frac{N-1}{N}$, and generated the index with these. (Using a logit link, thus I was smoothing/predicting $f(x)=\Lambda(\frac{1}{N})$ if y(x)=0, e.g.)

But I am uneasy about this. I coded up leave-out cross-validation to pick the right bandwidth for smoothing f with the short-hand f, and even used the delta method to get back to confidence intervals of the original binary variable from my smoothing of the index. But that might be fake sophistication if I approach the prediction wrong in the first place.

Thanks for any thoughts on this!

Best Answer

Take a look at McCullagh and Nelder (1989) Generalized Linear Models, 2nd ed, Section 2.5 (pp 40-43), on iteratively reweighted least squares.

Let $y$ be the 0/1 outcome and let $\eta = g(\mu)$ be the link function. You never calculate $g(y)$ directly, but work with an adjusted dependent variable $$z = \hat{\eta}_0 + (y-\hat{\mu}_0) \left(\frac{d\eta}{d\mu}\right)_0$$ where $\hat{\eta}_0$ is the current estimate of the linear predictor, $X\hat{\beta}_0$, and $\hat{\mu}_0 = g^{-1}(\hat{\eta}_0)$. So that avoids the problem with $g(0)$ and $g(1)$ being $\pm \infty$.

For the logit link, $\eta = \ln[\mu / (1-\mu)]$, you'll find that $d\eta/d\mu = 1/[\mu(1-\mu)]$ and so you would have $$z = \hat{\eta}_0 + \frac{y-\hat{\mu}_0}{\hat{\mu}_0 (1 - \hat{\mu}_0)}$$

You further calculate weights $$w_0^{-1} = \left(\frac{d\eta}{d\mu}\right)^2_0 v_0$$ where $v_0 = V(\mu_0)$ come from the mean/variance relationship, which for the binary case would be $V(\mu) = \mu(1-\mu)$. For the logit link, since $d\eta/d\mu = 1/[\mu(1-\mu)]$, you end up with weights $w_0 = \mu_0(1-\mu_0)$.

A key concern is the starting points. You might look at the R source code to see what they do. I wrote down in a notebook to start with $\tilde{\mu} = 1/4$ if $y = 0$ and $\tilde{\mu} = 3/4$ if $y=1$, but I didn't include a source.

To spell out the iterative algorithm a bit more, focusing on the logit link:

At the start you do the following:

Start with initial "fitted" values, say $\hat{\mu}^{(0)}_i = $ 1/4 or 3/4 according to whether $y_i = $ 0 or 1
Calculate $\hat{\eta}^{(0)}_i = \ln[\hat{\mu}^{(0)}_i/(1-\hat{\mu}^{(0)}_i)]$
Calculate $z^{(0)}_i = \hat{\eta}^{(0)}_i + [y_i-\hat{\mu}^{(0)}_i]/[\hat{\mu}^{(0)}_i (1 - \hat{\mu}^{(0)}_i)]$
Calculate the weights $w^{(0)}_i = \hat{\mu}^{(0)}_i (1-\hat{\mu}^{(0)}_i)$
Regress the $z^{(0)}_i$ on $X$ using weights $w^{(0)}_i$, to get initial estimates $\hat{\beta}^{(0)}$

Then, at each iteration, you do the following:

Calculate $\hat{\eta}^{(s)}_i = X \hat{\beta}^{(s-1)}$
Calculate $\hat{\mu}^{(s)}_i = \exp(\hat{\eta}^{(s)}_i)/[1+\exp(\hat{\eta}^{(s)}_i)]$
Calculate $z^{(s)}_i = \hat{\eta}^{(s)}_i + [y_i-\hat{\mu}^{(s)}_i]/[\hat{\mu}^{(s)}_i (1 - \hat{\mu}^{(s)}_i)]$
Calculate the weights $w^{(s)}_i = \hat{\mu}^{(s)}_i (1-\hat{\mu}^{(s)}_i)$
Regress the $z^{(s)}_i$ on $X$ using weights $w^{(s)}_i$, to get revised estimates $\hat{\beta}^{(s)}$

This is all just for regular logistic regression. For the local logistic regression version, there is some discussion in Chapter 4 of Loader (1999) Local regression and likelihood (but frankly, I didn't really follow it).

A Google search for "local logistic regression IRLS" revealed these notes from Patrick Breheny, which say (pg 8):

The weight given to an observation $i$ in a given iteration of the IRLS algorithm is then a product of the weight coming from the quadratic approximation to the likelihood and the weight coming from the kernel ($w_i = w_{1i} w_{2i}$)

Related Solutions

Logistic Regression – Interpreting Results of a Binary Logistic Regression

Although changing the link function could change your significance, ignoring issues of multiple testing, only changing to a bad link function will make significance go away - and making the significance go away doesn't make the relationship go away. That is I don't think for a binary respose you can have a significant result that is not real - even if the model is not of the best form.

The terms will only show as significant if they are signficantly explaining variance in the response - they may not be being used in the best possible way to do this, but they are doing something.

One way to see this is to think of your logistic regression not as a model in its own right, but just an arbitrary data transformation. Suppose someone came along and handed you the responses and the binary predictions of the logistic regression (say thresholded at 0.5). You now just have a single binary predictor for a binary response - a $2 \times2$ contingency table. There is no "goodness-of-fit" to worry about - the only possible model is to $Y=X$ and the fit must be good since it is either right or wrong (all structure has been removed by construction!). However, by virtue of the fact that the original logistic regression was significant, it must be the case that the contingency table is significant, there $Y$ is related to $X$. Since $X$ is a function of your original predictor variables, it must also be the case that $Y$ is related to those original predictors.

Of course, the converse is not true. Predictors that were not found to be significant could turn out to have significance if a better fitting model was chosen. Also note that the coefficients of the fit are unlikely to be meaningful. A better fitted model will have very different (most likely larger) strengths of relationship, and may well find additional predictors are important. It is only the simple "yes/no" question of is this predictor related to the dependent variable that must be true even with a poor model fit.

(The above should be caveated that stacking models one after the other like that is a bad idea, and is only meant as a thought experiment.)

Solved – Resolving heteroskedasticity in linear regression models

Perhaps you can use the rlm() function from the MASS package. I believe it's intended for datasets affected by outliers, but it's at least worth a try with your dataset.

Here is a paper that describes the details of the function.

Description: Fit a linear model by robust regression using an M estimator.

Details: Fitting is done by iterated re-weighted least squares (IWLS).

# Package examples
require(MASS)
summary(rlm(stack.loss ~ ., stackloss))
rlm(stack.loss ~ ., stackloss, psi = psi.hampel, init = "lts")
rlm(stack.loss ~ ., stackloss, psi = psi.bisquare)

# Using the diamonds data set
diam_rlm <- rlm(price ~ ., diamonds)
summary(diam_rlm)

# A plot as example
ggplot(diamonds, aes(carat,price)) +
  geom_point() +
  stat_smooth(method = "rlm") +
  xlim(0,2.9)

enter image description here

For the diamonds dataset, i'm sure you could improve the model with a transformation. However, as you aren't saying this works with your data i left the diamonds data "as is".

Best Answer

Related Solutions

Logistic Regression – Interpreting Results of a Binary Logistic Regression

Solved – Resolving heteroskedasticity in linear regression models

Related Question