Solved – Iteratively Reweighted Least Squares, (Logistic Regression)

likelihoodlogisticoptimizationrregression

I'm trying to obtain the parameters estimates in a Logistic Regression using the IRLS (Iteratively Reweighted Least Squares) algorithm.

I'm following this great and simple reference slides: (Logistic Regression)

And also this question where there are all the mathematic details and codes: Why using Newton's method for logistic regression optimization is called iterative re-weighted least squares?

I'm trying to obtain the estimates, without using the lm function, but using the matrix notation,as stated in the question I mentioned above:

$$
b^{(m+1)} = b^{(m)} + (X^T W_{(m)} X)^{-1}X^T W_{(m)} z_{(m)}
$$

the predictor is equal to (in the code case we don't have the intercept): $\eta_i = \sum_{j=1}^{2}\beta_jx_{ij}=\beta_1x_{i1}+\beta_{i2}x_{i2}$
As stated in the first link above $W$ is a diagonal matrix, where each element of the diagonal is the second partial derivative in respect of the vector of parameters $\beta$ of fitted values of the Logistic Regression
the residual $z =\frac{y_i – E[y_i]}{h'(\eta_i)}$ where $h'(\eta_n)$ is the first partial derivative of the fitted values in respect of the vector of the same parameters, and it is equal to $h'(\eta) = \frac{1}{1+e^\eta}*(1-\frac{1}{1+e^\eta})$

In the code below we have

The p = 2 is the variable to set the number of parameters (in this example it's not use the intercept).

The n = 20 is the variable to set the number of observation.

The code (the first part is copied from the question link above) of the algorithm in matrix notation is not working (estimates do not converge) when we have large matrices (i.e. when we have p = 3the matrix notation algorithm never converges, when we have p =2 and n = 200 the algorithm never converges.
In the matrix form algorithm, also the convergence is much slower than the algorithm with lm function.

By the way all the elements before the IRLS is computed (estimation of vector of betas parameters) are equal in both forms, and I also added two lists to show that are equal.
This below is the code:

#LOGISTIC REGRESSION Estimation (IRLS)
#LOGIT

set.seed(5)
p <- 2             ##for p > 3 the estimates do not converge
n <- 20
x <- matrix(rnorm(n * p), n, p)
betas <- runif(p, -2, 2)
hc <- function(x) 1 /(1 + exp(-x)) # inverse canonical link
p.true <- hc(x %*% betas)
y <- rbinom(n, 1, p.true)
tol=1e-8




#IRLS using the 'lm' function:
b.init = rep(1,p)
b.old <- b.init
change <- Inf


IRLS_canoni_ = list()


while(change > tol) {
  eta <- x %*% b.old  # linear predictor

  y.hat <- hc(eta)
  h.prime_eta <- y.hat * (1 - y.hat)     #first derivative
  z <- (y - y.hat) / h.prime_eta

  b.new <- b.old + lm(z ~ x - 1, weights = h.prime_eta)$coef  # WLS regression
  change <- sqrt(sum((b.new - b.old)^2))
  b.old <- b.new

  IRLS_canoni_$eta = cbind(IRLS_canoni_$eta,eta)
  IRLS_canoni_$y.hat = cbind(IRLS_canoni_$y.hat,y.hat)
  IRLS_canoni_$h.prime_eta = cbind(IRLS_canoni_$h.prime_eta, h.prime_eta)
  IRLS_canoni_$z = cbind(IRLS_canoni_$z, z)
  IRLS_canoni_$b.old = cbind(IRLS_canoni_$b.old, b.old)

  print(b.old)
  Sys.sleep(.1)

}

b.old


my_IRLS_canonical(x, y, rep(1,p), hc)    
glm(y ~ x - 1, family=binomial())$coef      #model with no intercept

glm1 = glm(y ~ x, family=binomial())







##Trying to obtain same results with matrix notation (IRLS):
deriv2 = function(x) exp(x)/(1+exp(x))^2     #second derivative


b.init = rep(1,p)
b.old1 <- b.init
change1 <- Inf


IRLS_matrix = list()

while(change1 > tol) {

  eta1 <- x %*% b.old1  # linear predictor
  y.hat1 <- hc(eta1)
  h.prime_eta1 <- y.hat1 * (1 - y.hat1)
  z1 <- (y - y.hat1) / h.prime_eta1


  Wdiag = deriv2(eta)
  W = matrix(0,n,n)
  diag(W) = Wdiag

  H = -(t(x)%*%(W)%*%x)      #not using it

  b.new1 = b.old1 + ((solve(t(x) %*% W %*% x)) %*% (t(x)%*%W%*%z1))
  change1 = sqrt(sum((b.new1 - b.old1)^2))
  b.old1 = b.new1

  IRLS_matrix$eta = cbind(IRLS_matrix$eta, eta1)
  IRLS_matrix$y.hat = cbind(IRLS_matrix$y.hat, y.hat1)
  IRLS_matrix$h.prime_eta = cbind(IRLS_matrix$h.prime_eta, h.prime_eta1)
  IRLS_matrix$z = cbind(IRLS_matrix$z, z1)
  IRLS_matrix$b.old = cbind(IRLS_matrix$b.old, b.old1)

  print(b.new1)
  Sys.sleep(.1)

}

b.new1

glm(y ~ x - 1, family=binomial())$coef    #model with no intercept


IRLS_canoni_$eta[,1] == IRLS_matrix$eta[,1]
IRLS_canoni_$y.hat[,1] == IRLS_matrix$y.hat[,1]
IRLS_canoni_$h.prime_eta[,1] == IRLS_matrix$h.prime_eta[,1]
IRLS_canoni_$z[,1] == IRLS_matrix$z[,1]

IRLS_canoni_$b.old[,1] == IRLS_matrix$b.old[,1]

So can anyone give a try? It seems it only works with max $2$ parameters and few observation.

Anyway I think the algorithm is correct, if it wouldn't be, it wouldn't find the correct value anytime (and this is not the case).

Why is it happening this? Thank You.

Best Answer

I just found that inside the algorithm in the "matrix form" the variable it's equal to Wdiag = deriv2(eta) so in this case this variable stays always the same.

So we have to change it to Wdiag = deriv2(eta1). So now both of these algorithms works fine.

There's another error anyway with this algorithm (this is also present in the original question, linked above):

The betas starting values are all equal to $1$ as we can see it in the variable b.init = rep(1,p), in this case we can't estimate much parameters (eg. if we have 8 parameters to estimate the algorithm doesn't converge in both forms) and the algorithm converge only with few parameters and observation.

Solution: We need to change the initialization of the betas with $0$ in this way b.init = rep(0,p).

This reminds me of Non-Stationarity in time-series: when we have a parameter that is equal or bigger than $1$ the process is said to be an explosive process. Since this is a sort of process that evolves in time i think that the b.init = rep(1,p) leads to the non convergence path. This is just an idea btw, but the code works fine.

Best Answer

Related Solutions

Solved – Reporting exponentiated coefficients in a logistic regression, t-value and confidence intervals

Solved – Iteratively Reweighted Least squares for logistic regression when features are dependent

Related Question