Solved – Estimate the sum of predicted variables by a linear model in R

least squarespredictionprediction intervalr

I have a dataset with 3000 sub-regions with data about their population by income range and their value spending in a commodity. I made a OLS model with log-log transformation using lm() function in R to predict the spending in another 300 sub-regions.

$$
\ln(Y+1) = \beta_0 + \beta_1\ln(X_1+1) + \beta_2\ln(X_2+1) + … + \epsilon
$$

Where $Y$ is the aggregated spending by the sub-region, and the $X$'s are the Population by income range.

In R:

myModel = lm(log(spending + 1) ~ log(pop_income1 + 1) + log(pop_income2) +
           log(pop_income3 + 1) + log(pop_income4 + 1), data=myOldData)

Then I use predict(myModel, myNewData, interval = "prediction").

But this resulted in the expected value of $\ln(Y_i+1)$ and its predictions intervals for every $i$ and I need the prediction interval and mean of $\sum\limits_{i=1}^n Y_i$, where $n$ is 300.

How can I do that with R?

Best Answer

There are two ways to produce predictions on the levels of a variable $Y_i$ for a linear regression that is fitted to the logarithms of the variable $$ \log Y_i = \boldsymbol{X}_i\boldsymbol{\beta} + \varepsilon_i $$ where $ \mathbb{E}(\varepsilon_i \mid \boldsymbol{X}_i) = 0$.

Option 1:

One option is $$ \begin{align} \widehat{Y}_i &= \overline{\exp(\widehat{\varepsilon})}\exp\left(\widehat{\log Y} _i\right) \end{align} $$ where $\overline{\exp(\widehat{\varepsilon})}$ is the sample mean of the exponentiated residuals from the log-normal fit.

A theoretical justification proceeds by noting that the log-normal regression implies that $$ \begin{align} Y_i &= \exp\left(\boldsymbol{X}_i'\boldsymbol{\beta} + \varepsilon_i\right) \\ Y_i &= \exp\left(\boldsymbol{X}_i'\boldsymbol{\beta}\right)\exp(\varepsilon_i) \end{align} $$

Under conditional indepedence of the errors and the covariates, we can write $$ \mathbb{E}(Y_i \mid \boldsymbol{X}_i) = \exp\left(\boldsymbol{X}_i'\boldsymbol{\beta}\right)\mathbb{E}(\exp(\varepsilon_i)) $$

We will need estimates of $\exp\left(\boldsymbol{X}_i'\boldsymbol{\beta}\right)$ and $\mathbb{E}(\exp(\varepsilon_i))$ in order to construct estimates of $\mathbb{E}(Y_i \mid \boldsymbol{X}_i)$. A consistent estimator of $\mathbb{E}(\exp(\varepsilon_i))$ is $\overline{\exp(\widehat{\varepsilon})}$. Combining this information, we get that a theoretically justifiable estimator of $\mathbb{E}(Y_i \mid \boldsymbol{X}_i)$ is $$ \widehat{Y}_i = \exp\left(\boldsymbol{X}_i'\widehat{\boldsymbol{\beta}}\right)\overline{\exp(\widehat{\varepsilon})} $$

R implementation:

This can be easily coded in R.

data(longley)

# log-linear regression
lmLongley = lm(log(GNP) ~ GNP.deflator + Armed.Forces + Population, 
               data = longley)

# compute the predictions
predGNPLevel = exp(predict(lmLongley))*mean(exp(resid(lmLongley)))

# print the output
cbind(predGNPLevel, longley$GNP)

Option 2:

The second option follows if we assume that the errors in the log-linear regression follow a normal distribution $N(0, \sigma^2)$, then $\exp(\varepsilon_i)$ follows a log-normal distribution, with mean $\exp(\sigma^2/2)$. This implies that $$ \begin{align} \mathbb{E}(Y_i \mid \boldsymbol{X}_i) &= \exp\left(\boldsymbol{X}_i'\boldsymbol{\beta}\right)\mathbb{E}(\exp(\varepsilon_i)) \\ &= \exp\left(\boldsymbol{X}_i'\boldsymbol{\beta}\right)\exp(\sigma^2/2) \end{align} $$

Plugging in consistent estimates of $\sigma^2$ produces the predictions $$ \widehat{Y}_i = \exp\left(\boldsymbol{X}_i'\widehat{\boldsymbol{\beta}}\right)\exp(\widehat{\sigma^2}/2) $$

R implementation:

The R implementation is again straighforward:

predGNPLevel2 = exp(predict(lmLongley))*exp(var(resid(lmLongley))/2)

# print the output
cbind(predGNPLevel2, longley$GNP)

Related Solutions

Solved – Multiple regression and OLS. How to choose the best “non-linear” specification

I've become rather enamoured of late with generalized additive modelling to handle non-linearity. The gam() function from the mgcv package for R makes things very easy as it incorporates automated generalized cross-validation to avoid overfitting.

Solved – Does more variables mean tighter confidence intervals

do I ALWAYS get a tighter confidence interval if I include more variables in my model?

Yes, you do (EDIT: ...basically. Subject to some caveats. See comments below). Here's why: adding more variables reduces the SSE and thereby the variance of the model, on which your confidence and prediction intervals depend. This even happens (to a lesser extent) when the variables you are adding are completely independent of the response:

a=rnorm(100)
b=rnorm(100)
c=rnorm(100)
d=rnorm(100)
e=rnorm(100)

summary(lm(a~b))$sigma       # 0.9634881
summary(lm(a~b+c))$sigma     # 0.961776
summary(lm(a~b+c+d))$sigma   # 0.9640104 (Went up a smidgen)
summary(lm(a~b+c+d+e))$sigma # 0.9588491 (and down we go again...)

But this does not mean you have a better model. In fact, this is how overfitting happens.

Consider the following example: let's say we draw a sample from a quadratic function with noise added.

enter image description here

A first order model will fit this poorly and have very high bias.

enter image description here

A second order model fits well, which is not surprising since this is how the data was generated in the first place.

enter image description here

But let's say we don't know that's how the data was generated, so we fit increasingly higher order models.

enter image description here

As the complexity of the model increases, we're able to capture more of the fluctuations in the data, effectively fitting our model to the noise, to patterns that aren't really there. With enough complexity, we can build a model that will go through each point in our data nearly exactly.

enter image description here

As you can see, as the order of the model increases, so does the fit. We can see this quantitatively by plotting the training error:

enter image description here

But if we draw more points from our generating function, we will observe the test error diverges rapidly.

enter image description here

The moral of the story is to be wary of overfitting your model. Don't just rely on metrics like adjusted-R2, consider validating your model against held out data (a "test" set) or evaluating your model using techniques like cross validation.

For posterity, here's the code for this tutorial:

set.seed(123)
xv = seq(-5,15,length.out=1e4)
X=sample(xv,20)
gen=function(v){v^2 + 7*rnorm(length(v))}
Y=gen(X)
df = data.frame(x=X,y=Y)
plot(X,Y)
lines(xv,xv^2, col="blue") # true model
legend("topleft", "True Model", lty=1, col="blue")    

build_formula = function(N){ 
  paste('y~poly(x,',N,',raw=T)')
}

deg=c(1,2,10,20)
formulas = sapply(deg[2:4], build_formula)
formulas = c('y~x', formulas)
pred = lapply(formulas
              ,function(f){
                predict(
                  lm(f, data=df)
                  ,newdata=list(x=xv))
                            })
# Progressively add fit lines to the plot
lapply(1:length(pred), function(i){
  plot(df, main=paste(deg[i],"-Degree"))
  lapply(1:i,function(n){
    lines(xv,pred[[n]], col=n)
  })
  })

# let's actually generate models from poly 1:20 to calculate MSE
deg=seq(1,20)
formulas = sapply(deg, build_formula)
pred.train = lapply(formulas
                   ,function(f){
                     predict(
                       lm(f, data=df)
                       ,newdata=list(x=df$x))
                   })

pred.test = lapply(formulas
              ,function(f){
                predict(
                  lm(f, data=df)
                  ,newdata=list(x=xv))
              })

rmse.train = unlist(lapply(pred.train,function(P){
  regr.eval(df$y,P, stats="rmse")
}))    

yv=gen(xv)
rmse.test = unlist(lapply(pred.test,function(P){
  regr.eval(yv,P, stats="rmse")
}))    

plot(rmse.train, type='l', col='blue'
     , main="Training Error"
     ,xlab="Model Complexity")

plot(rmse.test, type='l', col='red'
     , main="Train vs. Test Error"
     ,xlab="Model Complexity")
lines(rmse.train, type='l', col='blue')
legend("topleft", c("Test","Train"), lty=c(1,1), col=c("red","blue"))

Best Answer

Option 1:

R implementation:

Option 2:

R implementation:

Related Solutions

Solved – Multiple regression and OLS. How to choose the best “non-linear” specification

Solved – Does more variables mean tighter confidence intervals

Related Question