Solved – Can one do GLM with LOESS transformed variables

generalized linear modelloessnonparametricr

I have binary valued classification variables, and predictors that are not really performing great in GLM with probit/logit model. Some of the predictors are also correlated with each other. I am considering to do a transformation to the parameters like a loess function in R. Loess applies to linear models where dependent variable is continuous, but my dependent variable is binary.

How can this approach extended to GLM probit/logit models? I might need a non-parametric transformation before feeding into GLM. The problem is how to find the non-parametric transform.

Edit 1: Here is an example where loess is applied directly to binary classifier, thus it is two stage. AUC jumps from 0.76 to 0.94. I would be glad to learn if there are any other ways to improve this nonlinear predictor

# nonlinear transformation ------------------------------------------------
set.seed(102)
a  <- runif(1000)
d  <- ifelse((a-0.3)^2 > 0.03, 1, 0)
d[ sample.int(1000, 50)]  <- 1
d[ sample.int(1000, 50)]  <- 0

par(mfrow=c(2,2))


df  <- data.frame(a, d)

glmmod <- glm(d ~ a, df, family=binomial(link = "logit"))
plot(a, glmmod$fitted.values)

lf  <- loess(d ~ a, df, model = T, span = 1)
plot(a, d)
lines(a[order(a)], predict(lf)[order(a)])

df2  <- data.frame(aT = predict(lf), d)
glmmod2 <- glm(d ~ aT, df2, family=binomial(link = "logit"))

plot(a, glmmod2$fitted.values)

require(ROCR)
pred <- prediction(glmmod2$fitted.values, d)
roc.perf = performance(pred, measure = "tpr", x.measure = "fpr")
plot(roc.perf, col="blue")
auc.perf = performance(pred, measure = "auc")
auc.perf@y.values[[1]]

pred <- prediction(glmmod$fitted.values, d)
roc.perf = performance(pred, measure = "tpr", x.measure = "fpr")
plot(roc.perf, add=TRUE, col="red")
auc.perf = performance(pred, measure = "auc")
auc.perf@y.values[[1]]

enter image description here

Best Answer

You don't use loess to transform variables.

You may be looking for generalized additive models (GAM), which is an extension of GLMs in the same way that additive models/nonparametric regression (including smoothing splines and local linear or local polynomial regression models) is an extension of linear regression.

https://en.wikipedia.org/wiki/Generalized_additive_model

example in R (picking your code up from df <- ..., using gam:

df  <- data.frame(a, d)
library(gam) #assuming you already have the package 
gammod <- gam(d ~ s(a,4), df, family=binomial(link = "logit")) #spline model
plot(a,d)
oa=order(a)
lines(a[oa],fitted(gammod)[oa],col=3)

enter image description here

gammod2 <- gam(d ~ lo(a,span=.5), df, family=binomial(link = "logit")) #loess-like 
plot(a,d)
lines(a[oa],fitted(gammod2)[oa],col=4)

enter image description here

Related Solutions

Solved – Are LOESS and GAM with one covariate the same

Not really a full answer, but too long for a comment: s sets up a spline, whereas loess does a local regression.

In the gam package (maybe mgcv too, not too familiar with that one) you can also feed a local regression, as in

library(gam)

set.seed(1234) 

# generate data
x <- sort(runif(100)) 
y <- sin(2*pi*x) + rnorm(10, sd=0.1) 

gam.1 <- gam(y ~ lo(x))
base.r <- loess(y ~ x) 
summary(base.r$fitted - gam.1$fitted)
plot(base.r$fitted,gam.1$fitted)

That does not produce the same fitted values either, but maybe you can further play around with the settings of lo and loess.

Solved – Use loess regression with many zero values

A Loess confidence interval doesn't mean much unless the Loess parameters have been cross-validated (which usually is not the case). When you use Loess for exploration, as it was originally intended, understanding how to control it will help you guide your exploration and interpret its results better.

Consider this small study of a synthetic dataset which has only $0$ or $1$ as responses: it is an extreme example of your situation. The data, plotted as black points, are outcomes of Bernoulli$(p)$ variables ("coin flips") where $p$ varies in a damped sinusoidal manner with the horizontal coordinate $x$, as shown by the white reference curve in each panel. The panels vary only by the "span" of the Loess smooth, which determines how local each Loess estimate is: smaller spans produce estimates that are more localized; that is, they reflect the responses for the closest neighbors of each $x$ value much more than for distant neighbors. The smooth is shown in blue and its surrounding confidence band in dark gray.

The lefthand panel uses the default span of $0.75$. This causes the Loess estimate at each point to depend on most of the points in the plot: it is a heavy smooth for these data. In many cases the white plot lies outside the shaded confidence band, showing this confidence band may be misleading.

It is clear that only with the final span of $0.25$ does the smooth come at all close to the true values: here, the white graph is contained within the shaded gray area. Unfortunately, in practice we do not have access to any true underlying curve: that's precisely what we're trying to estimate.

All three of these smooths are perfectly valid, insofar as they are efforts to sketch out the overall trend in the response ("y") relative to the regressor ("x"). The heavy smooth at the left suggests the response rate is approximately stable (which, on average, it is). The lighter smooth at the right captures higher-frequency variation. In practice, it might not be apparent whether what it shows is "real" or is "noise."

In practice, we never accept just one default level of smoothing: we vary the amount of smoothing, exactly as illustrated here, in order to learn about the data at varying levels of local resolution. We might also vary the smoothing in order to create different kinds of visual descriptions of the data, guiding the viewer's eye to global trends (as at the left) or local behaviors (as at the right), as we see appropriate.

The best tool for "checking appropriateness" is to study the residuals of the smooth in the context of a particular analytical or visualization objective. Good books on Exploratory Data Analysis, such as John Tukey's EDA, provide a wealth of techniques for computing and analyzing smooths and their residuals.

If you would like to experiment, here is the R code that created these illustrations.

#
# Generate data.
#
n <- 2e2
x <- 1:n
p <- (sin(x/100 * 2*pi)^2 - 1/2)*exp(-x/n) + 1/2
set.seed(17)
y <- rbinom(n, 1, p)
df <- data.frame(x=x, y=y, p=p)
#
# Set up for drawing.
#
library(ggplot2)
spans <- c(0.75, 0.5, 0.25)
k <- length(spans)
viewports <- lapply(1:k, function(i) 
  grid:::viewport(width=1/k, height=1, x=(i-1/2)/k, y=1/2))
names(viewports) <- spans
#
# Create the plots.
#
g <- ggplot(df, aes(x, y)) + geom_point(aes(x,y), df, alpha=0.25) + 
  coord_cartesian(ylim=c(0,1))
for (i in 1:k) {
  print(g + geom_smooth(method="loess", span=spans[i])  +
    geom_line(aes(x,p), df, color="White", lwd=1) + 
    labs(title=paste("Span =", spans[i])),
    vp=viewports[[i]])
}

References

John W. Tukey, EDA. Addison-Wesley, 1977.

Best Answer

Related Solutions

Solved – Are LOESS and GAM with one covariate the same

Solved – Use loess regression with many zero values

References

Related Question