Solved – Estimate the sum of predicted variables by a linear model in R

least squarespredictionprediction intervalr

I have a dataset with 3000 sub-regions with data about their population by income range and their value spending in a commodity. I made a OLS model with log-log transformation using lm() function in R to predict the spending in another 300 sub-regions.

$$
\ln(Y+1) = \beta_0 + \beta_1\ln(X_1+1) + \beta_2\ln(X_2+1) + … + \epsilon
$$

Where $Y$ is the aggregated spending by the sub-region, and the $X$'s are the Population by income range.

In R:

myModel = lm(log(spending + 1) ~ log(pop_income1 + 1) + log(pop_income2) +
           log(pop_income3 + 1) + log(pop_income4 + 1), data=myOldData)

Then I use predict(myModel, myNewData, interval = "prediction").

But this resulted in the expected value of $\ln(Y_i+1)$ and its predictions intervals for every $i$ and I need the prediction interval and mean of $\sum\limits_{i=1}^n Y_i$, where $n$ is 300.

How can I do that with R?

Best Answer

There are two ways to produce predictions on the levels of a variable $Y_i$ for a linear regression that is fitted to the logarithms of the variable $$ \log Y_i = \boldsymbol{X}_i\boldsymbol{\beta} + \varepsilon_i $$ where $ \mathbb{E}(\varepsilon_i \mid \boldsymbol{X}_i) = 0$.

Option 1:

One option is $$ \begin{align} \widehat{Y}_i &= \overline{\exp(\widehat{\varepsilon})}\exp\left(\widehat{\log Y} _i\right) \end{align} $$ where $\overline{\exp(\widehat{\varepsilon})}$ is the sample mean of the exponentiated residuals from the log-normal fit.

A theoretical justification proceeds by noting that the log-normal regression implies that $$ \begin{align} Y_i &= \exp\left(\boldsymbol{X}_i'\boldsymbol{\beta} + \varepsilon_i\right) \\ Y_i &= \exp\left(\boldsymbol{X}_i'\boldsymbol{\beta}\right)\exp(\varepsilon_i) \end{align} $$

Under conditional indepedence of the errors and the covariates, we can write $$ \mathbb{E}(Y_i \mid \boldsymbol{X}_i) = \exp\left(\boldsymbol{X}_i'\boldsymbol{\beta}\right)\mathbb{E}(\exp(\varepsilon_i)) $$

We will need estimates of $\exp\left(\boldsymbol{X}_i'\boldsymbol{\beta}\right)$ and $\mathbb{E}(\exp(\varepsilon_i))$ in order to construct estimates of $\mathbb{E}(Y_i \mid \boldsymbol{X}_i)$. A consistent estimator of $\mathbb{E}(\exp(\varepsilon_i))$ is $\overline{\exp(\widehat{\varepsilon})}$. Combining this information, we get that a theoretically justifiable estimator of $\mathbb{E}(Y_i \mid \boldsymbol{X}_i)$ is $$ \widehat{Y}_i = \exp\left(\boldsymbol{X}_i'\widehat{\boldsymbol{\beta}}\right)\overline{\exp(\widehat{\varepsilon})} $$

R implementation:

This can be easily coded in R.

data(longley)

# log-linear regression
lmLongley = lm(log(GNP) ~ GNP.deflator + Armed.Forces + Population, 
               data = longley)

# compute the predictions
predGNPLevel = exp(predict(lmLongley))*mean(exp(resid(lmLongley)))

# print the output
cbind(predGNPLevel, longley$GNP)

Option 2:

The second option follows if we assume that the errors in the log-linear regression follow a normal distribution $N(0, \sigma^2)$, then $\exp(\varepsilon_i)$ follows a log-normal distribution, with mean $\exp(\sigma^2/2)$. This implies that $$ \begin{align} \mathbb{E}(Y_i \mid \boldsymbol{X}_i) &= \exp\left(\boldsymbol{X}_i'\boldsymbol{\beta}\right)\mathbb{E}(\exp(\varepsilon_i)) \\ &= \exp\left(\boldsymbol{X}_i'\boldsymbol{\beta}\right)\exp(\sigma^2/2) \end{align} $$

Plugging in consistent estimates of $\sigma^2$ produces the predictions $$ \widehat{Y}_i = \exp\left(\boldsymbol{X}_i'\widehat{\boldsymbol{\beta}}\right)\exp(\widehat{\sigma^2}/2) $$

R implementation:

The R implementation is again straighforward:

predGNPLevel2 = exp(predict(lmLongley))*exp(var(resid(lmLongley))/2)

# print the output
cbind(predGNPLevel2, longley$GNP)
Related Question