Solved – Getting to predicted values using cv.glmnet

glmnetpredictionr

I'm a little confused by the predict function with a cv.glmnet object.

I'm running these two lines:

cvFit <- cv.glmnet(x = as.matrix(imputedTrainingData[,2:33]), y = imputedTrainingData[,1], family = "binomial", type.measure = "class" )

response<-predict(cvFit, as.matrix(imputedTestData[,2:33]), s= "lambda.min")

The y variable is a 2-level factor

Why is it that the predict statement gives a numeric vector and not the the class variable outcome predicted?
I thought for a moment that perhaps it gives the probability or being in one class or another but the max value of results is just above .35 in my data and the min is -.42.

Thanks!

Best Answer

Note that you are using the predict.cv.glmnet method when called as you did. The help for this function is a bit counterintuitive, but you can pass arguments to the predict.glmnet method, which does the predictions, via the ... argument.

Hence you probably want

response <- predict(cvFit, as.matrix(imputedTestData[,2:33]),
                    s = "lambda.min",
                    type = "class")

where type = "class" has meaning:

  Type ‘"class"’ applies only to
  ‘"binomial"’ or ‘"multinomial"’ models, and produces the
  class label corresponding to the maximum probability.

(from ?predict.glmnet)

What you were seeing was the predicted values on the scale of the linear predictor (link function), i.e. before the inverse of the logit function had been applied to yield probability of class == 1. This is fairly typical in R, and just as typically this behaviour can be controlled via a type argument.

Related Solutions

Solved – Unexpected residuals plot of mixed linear model using lmer (lme4 package) in R

Your residual structure is totally expected with this model specification and an indication of an ill-specified model. What you basically are trying to do is to fit a linear line through points that can only take values of 0 and 1 on the $y$-axis.

Let's look at a simple example with arbitrarily generated variables:

#-----------------------------------------------------------------------------
# Generate random data for logistic regression
#-----------------------------------------------------------------------------

set.seed(123)
x <- rnorm(1000)          
z <- 1 + 2*x
pr <- 1/(1+exp(-z))
y <- rbinom(1000,1, pr)

#-----------------------------------------------------------------------------
# Plot the data
#-----------------------------------------------------------------------------

par(bg="white", cex=1.2)
plot(y~x, las=1, ylim=c(-0.1, 1.3))

#-----------------------------------------------------------------------------
# Fit a linear regression (nonsensical) and plot the fit
#-----------------------------------------------------------------------------

linear.mod <- lm(y~x)
segments(-2.32146, 0, 1.24196, 1, col="steelblue", lwd=2)
segments(1.24196, 1, 100, 28.71447, col="red", lwd=2)
segments(-100, -27.41153, -2.32146, 0, col="red", lwd=2)

IllFit

As you can see, a linear line is fitted through the data. One problem of this is that the line predicts outcomes that are outside the interval $[0,1]$ (illustrated by the red lines outside that interval). Let's have a look at the residuals:

#-----------------------------------------------------------------------------
# Add the residual lines
#-----------------------------------------------------------------------------

x.y0 <- sample(which(y==0), 50, replace=F)
x.y1 <- sample(which(y==1), 50, replace=F)

pre <- predict(linear.mod)

segments(x[x.y0], y[x.y0], x[x.y0], pre[x.y0], col="red", lwd=2)
points(x[x.y0], y[x.y0], pch=16, col="red", las=1)

segments(x[x.y1], y[x.y1], x[x.y1], pre[x.y1], col="blue", lwd=2)
points(x[x.y1], y[x.y1], pch=16, col="blue", las=1)

illmodresiduals

I randomly picked some values to show the pattern. The red and blue lines are depicting the residuals, which is the difference between the predicted value of the line and the actual observed value (red and blue dots). The blue lines correspond to the residuals where $y=1$ whereas the red residuals correspond to the situation where $y=0$. Because the outcome can only be either 0 or 1, the residuals are simply the distances between the regression line and either 0 or 1. The residuals take exactly the form that you see in your data:

#-----------------------------------------------------------------------------
# Plot the residuals
#-----------------------------------------------------------------------------

res.linear <- residuals(linear.mod, type="response")

par(bg="white", cex=1.2)
plot(predict(linear.mod)[y==0], res.linear[y==0], las=1,
     xlab="Fitted values", ylab = "Residuals",
     ylim = max(abs(res.linear))*c(-1,1), xlim=c(-0.4, 1.6), col="red")
points(predict(linear.mod)[y==1], res.linear[y==1], col="blue")
abline(h = 0, lty = 2)

IllModelResidualplot

The colors correspond to the residuals shown above: the blue dots are the residuals where $y=1$ and the red dots are the residuals where $y=0$. In normal linear regression, the residuals are assumed to be approximately normally distributed. But in this case, the residuals can hardly be normal. They are binomial.

We need a transformation that transformes the probability, which is bound within $[0,1]$ into a variable that ranges over $(-\infty, \infty)$. One such transformation is the logit (this is not the only possibility: we could also use probit or the complementary log-log function). Let's fit a logistic regression with a logit-link and again plot the binned residuals (explained on page 97 by Gelman and Hill (2007)). Plotting the raw residuals vs. fitted values are generally not useful after logistic regression:

#-----------------------------------------------------------------------------
# Fit a logistic regression
#-----------------------------------------------------------------------------

glm.fit <- glm(y~x, family=binomial(link="logit"))

#-----------------------------------------------------------------------------
# Plot the binned residuals as recommended by Gelman and Hill (2007)
#-----------------------------------------------------------------------------

library(arm)
par(bg="white", cex=1.2, las=1)
binnedplot(predict(glm.fit), resid(glm.fit), cex.pts=1, col.int="black")

BinnedResiduals

The residuals in logistic regression can be define -$~$as in linear regression$~$- as observed minus expected values: $$ \text{residual}_{i}=y_{i}-\mathrm{E}(y_{i}|X_{i})=y_{i}-\text{logit}^{-1}(X_{i}\beta) $$ Because the data $y_{i}$ are discrete, so are the residuals. In the plot above, the residuals are binned by dividing the data into categories based on their fitted values, and are then plotted against the average residual versus the average fitted value for each category (bin). The lines indicate $\pm2$ standard-error bounds, within which one we would expect about 95% of the binned residuals to fall, under the assumption that the model is true.

So the remedy for your immediate problem is to fit a mixed effects logistic regression by typing:

model <- glmer(error~is_frisian*condition*person+(1|subject_id),
data=output, family="binomial")

For a good introduction to mixed effects logistic regression in R, see here. For a good overview of diagnostics in linear and generalized linear models, see here.

Solved – How to interpret all zero coefficients in the results of cv.glmnet

The fine-tuning of the penalization factor of Elastic Net during the cross validation has resulted in a penalty that shrinks all coefficients to zero.

Without being mathematically exact this seems to indicates that none of your features is very helpful. In this case Elastic Net will always predict the mean of the data it was trained on.

Your measure for accuracy is very problematic, as just predicting the mean can produce very high results.

For example, given the standard normal distribution the average absolute error is close to 0.8. Given a large sample size the range is easily around 8, giving you an accuracy of 0.9.

See here:

> set.seed(123)
> x <- rnorm(1e5)
> 1-mean(abs(x-mean(x)))/diff(range(x))
0.9056073

Best Answer

Related Solutions

Solved – Unexpected residuals plot of mixed linear model using lmer (lme4 package) in R

Solved – How to interpret all zero coefficients in the results of cv.glmnet

Related Question