In case the two classes are separable, iteratively reweighted least squares (IRLS) would break. In such a scenario, any hyperplane that separates the two classes is a solution and there are infinitely many of them. IRLS is meant to find a maximum likelihood solution. Maximum likelihood does not have a mechanism to favor any of these solutions over the other (e.g. no concept of maximum margin). Depending on the initialization, IRLS should go toward one of these solutions and would break due to numerical problems (don't know the details of IRLS; an educated guess).
Another problem arises in case of linear-separability of the training data. Any of the hyperplane solutions corresponds to a heaviside function. Therefore, all the probabilities are either 0 or 1. The linear regression solution would be a hard classifier rather than a probabilistic classifier.
To clarify using mathematical notation, the heaviside function is $\lim_{|\mathbf{w}| \rightarrow \infty}\sigma(\mathbf{w}^T x + b)$, the limit of sigmoid function, where $\sigma$ is the sigmoid function and $(\mathbf{w}, b)$ determines the hyperplane solution. So IRLS theoretically does not stop and goes toward a $\mathbf{w}$ with increasing magnitude but would break in practice due to numerical problems.
The numDeriv
package can indeed be used to compute the gradient
and the hessian (if needed). In both cases the argument y of the log-likelihood
is passed through the dots mechanism, using an argument with the suitable name. For a vector-valued function, the jacobian
function of the same package can be used
similarly.
You could also consider computing analytical derivatives rather than
numerical ones.
library(numDeriv)
H <- hessian(func = loglik, x = b$par, y = y)
g <- grad(func = loglik, x = b$par, y = y)
We can compute as well the Jacobian of a function returning a vector of length $T-1$.
mlogDens <- function(theta, y) {
T <- length(y)
-dnorm(y[2:T], mean = theta[1] + theta[2] * y[1:(T-1)],
sd = theta[3], log = TRUE)
}
## a matrix with dim (T-1, 3)
G <- jacobian(func = mlogDens, x = b$par, y = y)
## a matrix with dim (9, T-1)
GG <- apply(G, MARGIN = 1, FUN = tcrossprod)
## a vector with length 9 representing a symmetric mat.
GGsum0 <- apply(GG, MARGIN = 1, FUN = sum)
## a symmetric mat.
GGsum <- matrix(GGsum0, nrow = 3, ncol = 3)
Best Answer
You are correct that in general, IWLS, like other numerical optimization methods, can only guarantee convergence to a local maximum, if they even converge. Here's a nice example where the starting value was outside the convergence domain for the algorithm used by glm() in R. However, it is worth noting that for GLMs with the canonical link, the likelihood is concave, see here. Thus, if the algorithm converges, it will have converged to the global mode!
The last issue pointed out in the slide is a problem where the MLE for a paramter is at infinity. This can occur in logistic regression where there exists complete separation. In such a case, you will get a warning message that the fitted probabilities are numerically 0 or 1. It's important to note that when this occurs, the algorithm has not converged to the mode, thus this does not have to do with the algorithm being stuck in a local maximum.