Solved – Why, when I scale the data set, glmnet gives error

glmnetmachine learningregression

I'm using glmnet for building the regression models.
My data are already log-transformed. when I scale my data set (zero mean, and SD=1), I get the following error:

Error in elnet(x, is.sparse, ix, jx, y, weights, offset, type.gaussian,  : 
  NA/NaN/Inf in foreign function call (arg 5)

But without scaling data, glmnet run perfectly.
would someone help me to understand why this happens?

Best Answer

You most likely have NaN in your scaled data due to scaling of some columns with constant values (dividing by SD of zero), check the scaled data with is.na().

Related Solutions

Solved – Regression models for log transformed data without multiplicative error

I illustrate five options to fit a model here. The assumption for all of them is that the relationship is actually $y = a \cdot x^b$ and we only need to decide on the appropriate error structure.

1.) First the OLS model $\ln{y} = a + b\cdot\ln{x}+\varepsilon$, i.e., a multiplicative error after back-transformation.

fit1 <- lm(log(y) ~ log(x), data = DF)

I would argue that this is actually an appropriate error model as you clearly have increasing scatter with increasing values.

2.) A non-linear model $y = \alpha\cdot x^b+\varepsilon$, i.e., an additive error.

fit2 <- nls(y ~ a * x^b, data = DF, start = list(a = exp(coef(fit1)[1]), b = coef(fit1)[2]))

3.) A Generalized Linear Model with Gaussian distribution and a log link function. We will see that this is actually the same model as 2 when we plot the result.

fit3 <- glm(y ~ log(x), data = DF, family = gaussian(link = "log"))

4.) A non-linear model as 2, but with a variance function $s^2(y) = \exp(2\cdot t \cdot y)$, which adds an additional parameter.

library(nlme)
fit4 <- gnls(y ~ a * x^b, params = list(a ~ 1, b ~ 1),
             data = DF, start = list(a = exp(coef(fit1)[1]), b = coef(fit1)[2]), 
             weights = varExp(form = ~ y))

5.) A GLM with a gamma distribution and a log link.

fit5 <- glm(y ~ log(x), data = DF, family = Gamma(link = "log"))

Now let's plot them:

plot(y ~ x, data = DF)
curve(exp(predict(fit1, newdata = data.frame(x = x))), col = "green", add = TRUE)
curve(predict(fit2, newdata = data.frame(x = x)), col = "black", add = TRUE)
curve(predict(fit3, newdata = data.frame(x = x), type = "response"), col = "red", add = TRUE, lty = 2)
curve(predict(fit4, newdata = data.frame(x = x)), col = "brown", add = TRUE)
curve(predict(fit5, newdata = data.frame(x = x), type = "response"), col = "cyan", add = TRUE)

legend("topleft", legend = c("OLS", "nls", "Gauss GLM", "weighted nls", "Gamma GLM"),
       col = c("green", "black", "red", "brown", "cyan"),
       lty = c(1, 1, 2, 1, 1))

I hope these fits persuade you that you actually should use a model that allows larger variance for larger values. Even the model where I fit a variance model agrees on that. If you use the non-linear model or Gaussian GLM you place undue weight on the larger values.

Finally, you should consider carefully if the assumed relationship is the correct one. Is it supported by scientific theory?

Solved – Why glmnet gives different coefficient estimates with same data

You might find this link helpful.

Essentially

c1 are coefficients for unstandardized data (in the original units of the data)
c2 are coefficients for standardized data (in z score units)
c3 are coefficients for standardized data (in original units of the data)
c4 is because of this

Best Answer

Related Solutions

Solved – Regression models for log transformed data without multiplicative error

Solved – Why glmnet gives different coefficient estimates with same data

Related Question