I have been telling students that you cannot have a 0-1 independent variable transformed into log. My reason: a log of 0 is undefined. Am I wrong?
Solved – Log transformation of binary explanatory variable in regression
binary datadatasetinstrumental-variables
Related Solutions
Take a look at McCullagh and Nelder (1989) Generalized Linear Models, 2nd ed, Section 2.5 (pp 40-43), on iteratively reweighted least squares.
Let $y$ be the 0/1 outcome and let $\eta = g(\mu)$ be the link function. You never calculate $g(y)$ directly, but work with an adjusted dependent variable $$z = \hat{\eta}_0 + (y-\hat{\mu}_0) \left(\frac{d\eta}{d\mu}\right)_0$$ where $\hat{\eta}_0$ is the current estimate of the linear predictor, $X\hat{\beta}_0$, and $\hat{\mu}_0 = g^{-1}(\hat{\eta}_0)$. So that avoids the problem with $g(0)$ and $g(1)$ being $\pm \infty$.
For the logit link, $\eta = \ln[\mu / (1-\mu)]$, you'll find that $d\eta/d\mu = 1/[\mu(1-\mu)]$ and so you would have $$z = \hat{\eta}_0 + \frac{y-\hat{\mu}_0}{\hat{\mu}_0 (1 - \hat{\mu}_0)}$$
You further calculate weights $$w_0^{-1} = \left(\frac{d\eta}{d\mu}\right)^2_0 v_0$$ where $v_0 = V(\mu_0)$ come from the mean/variance relationship, which for the binary case would be $V(\mu) = \mu(1-\mu)$. For the logit link, since $d\eta/d\mu = 1/[\mu(1-\mu)]$, you end up with weights $w_0 = \mu_0(1-\mu_0)$.
A key concern is the starting points. You might look at the R source code to see what they do. I wrote down in a notebook to start with $\tilde{\mu} = 1/4$ if $y = 0$ and $\tilde{\mu} = 3/4$ if $y=1$, but I didn't include a source.
To spell out the iterative algorithm a bit more, focusing on the logit link:
At the start you do the following:
- Start with initial "fitted" values, say $\hat{\mu}^{(0)}_i = $ 1/4 or 3/4 according to whether $y_i = $ 0 or 1
- Calculate $\hat{\eta}^{(0)}_i = \ln[\hat{\mu}^{(0)}_i/(1-\hat{\mu}^{(0)}_i)]$
- Calculate $z^{(0)}_i = \hat{\eta}^{(0)}_i + [y_i-\hat{\mu}^{(0)}_i]/[\hat{\mu}^{(0)}_i (1 - \hat{\mu}^{(0)}_i)]$
- Calculate the weights $w^{(0)}_i = \hat{\mu}^{(0)}_i (1-\hat{\mu}^{(0)}_i)$
- Regress the $z^{(0)}_i$ on $X$ using weights $w^{(0)}_i$, to get initial estimates $\hat{\beta}^{(0)}$
Then, at each iteration, you do the following:
- Calculate $\hat{\eta}^{(s)}_i = X \hat{\beta}^{(s-1)}$
- Calculate $\hat{\mu}^{(s)}_i = \exp(\hat{\eta}^{(s)}_i)/[1+\exp(\hat{\eta}^{(s)}_i)]$
- Calculate $z^{(s)}_i = \hat{\eta}^{(s)}_i + [y_i-\hat{\mu}^{(s)}_i]/[\hat{\mu}^{(s)}_i (1 - \hat{\mu}^{(s)}_i)]$
- Calculate the weights $w^{(s)}_i = \hat{\mu}^{(s)}_i (1-\hat{\mu}^{(s)}_i)$
- Regress the $z^{(s)}_i$ on $X$ using weights $w^{(s)}_i$, to get revised estimates $\hat{\beta}^{(s)}$
This is all just for regular logistic regression. For the local logistic regression version, there is some discussion in Chapter 4 of Loader (1999) Local regression and likelihood (but frankly, I didn't really follow it).
A Google search for "local logistic regression IRLS" revealed these notes from Patrick Breheny, which say (pg 8):
The weight given to an observation $i$ in a given iteration of the IRLS algorithm is then a product of the weight coming from the quadratic approximation to the likelihood and the weight coming from the kernel ($w_i = w_{1i} w_{2i}$)
Probit coefficients don't have a straightforward interpretation like the ones in logit (see this question). Use marginal effects (if in Stata, the margins command) to obtain the change in $Pr(y)$ for a one-unit increase in $x$ for interpreting the coefficients of a probit.
If your predictor would not be log transformed, you could stop by here, as you would have the change in $Pr(y)$ by a one-unit increase in $x$. Log transformed variables require one more step, because for them a one-unit increase in $x$ means multiplying $x$ by the base (in this case $e$).
In this case, the marginal effects that you would obtain for your log transformed predictor would show the change in $Pr(y)$ for a 2.7182818285-fold change in $x$. Not very pretty. Your best bet is to use a logarithm with base 2 or base 10, so that you can interpret this as the marginal effects as the change in $Pr(y)$ for a two-fold or a ten-fold change in $x$. See this discussion on Statalist for another example.
Best Answer
You're right, and not just because log zero is not defined.
Any one-to-one transformation of $0$ and $1$ to $a$ and $b$ would just be a linear rescaling. This is true with any rule or function: even if it is a nonlinear function (say $\log(x + c)$) all that matters are the results of transforming $0$ and $1$. Think of this geometrically: for the data any transformation that preserves a difference between the two values defines two points in the plane differing on both coordinates, and so a linear transformation.
So it could not possibly do anything to improve that was even thought to be a problem.
For example, contrary to a surprisingly common myth, there aren't strict assumptions about marginal distributions of predictors (which is not to say that 665 zeros and 1 one for a predictor (say) is not a situation that needs care and attention).
(0, 1) predictors are fine and convenient because they lead to clean parameterisations and explanations of changes in level and slope.