Regression – Predicting Y from Log Y as Dependent Variable

back-transformationdata transformationdependent variablelognormal distributionregression

In the book Introductory Econometrics by Wooldridge the chapter, which deals with predicting values of $\hat{y}$ (chapter 6.4 in the 5th edition) states the following:

If the estimated model is:

$$\widehat{\log(y)} = \hat{\beta_0} + \hat{\beta_1}x_1 + …+\hat{\beta_k}x_k$$

then

$$\hat{y} = \exp\left(\frac{\hat{\sigma}^2}{2}\right) \exp(\widehat{\log(y)})$$

where $\hat{\sigma}^2$ is the unbiased estimator of $\sigma^2. $

Can someone please explain why this is the case and why we cannot simply take
$$\hat{y} = \exp(\widehat{\log(y)})$$

Best Answer

The underlying model is

$$E[\log Y] = \beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k$$

or, in terms of error terms $\varepsilon_i,$

$$\log Y_i = \beta_0 + \beta_1 x_{1i} + \cdots + \beta_k x_{ki} + \varepsilon_i.\tag{*}$$

When we assume the conditional distribution of $\log Y$ is Normal, then the Ordinary Least Squares (OLS) estimate of $\log Y$ also is Normal, because the estimate is an affine linear combination of the errors. Suppose $\sigma^2$ is its true (but unknown) variance. Then

$$E[Y] = e^{\sigma^2/2} e^{E[\log Y]}.$$

(This is a readily-calculated property of Lognormal distributions: see Wikipedia, for instance.)

Wooldridge plugs the estimates of $\sigma^2$ and $E[\log Y]$ into this formula. As such, it can be viewed as a method of moments estimate of $E[Y].$

Although intuitively reasonable, this estimator is not necessarily the best or even a good one. For instance, it is biased: see https://stats.stackexchange.com/a/105734/919 for a discussion and a derivation of an unbiased version. Its main flaw is extreme sensitivity to the precision of the estimate $\hat \sigma^2:$ to use it reliably, you want either a great deal of data or for $\sigma^2$ to be very small.


In light of this, you may indeed consider using the estimate

$$\widehat Y = \exp\left(\widehat {E[\log Y]}\right).$$

This estimates the geometric mean of the conditional response (essentially by definition of geometric mean). In some applications it might be a better choice. After all, when you fit the logarithms of your data using OLS you were downweighting underestimates of $Y$ compared to overestimates, demonstrating you really don't want accurate estimates of $E[Y]$ itself. If you did, you would have fit the nonlinear least-squares model

$$E[Y] = \exp\left(\alpha_0+ \alpha_1 x_1 + \cdots + \alpha_k x_k\right) .$$

If you want to express the error terms $\delta_i$ explicitly, this is equivalent to

$$Y_i = e^{\alpha_0}\, \left(e^{x_{1i}}\right)^{\alpha_1}\,\cdots\,\left(e^{x_{ki}}\right)^{\alpha_k} + \delta_i.\tag{**}$$

It is instructive to compare this to the exponential of $(*)$ which asserts

$$Y_i = e^{\beta_0}\, \left(e^{x_{1i}}\right)^{\beta_1}\,\cdots\,\left(e^{x_{ki}}\right)^{\beta_k} \, e^{\varepsilon_i}.$$

Where $(*)$ posits multiplicative errors $\cdot e^{\varepsilon_i},$ $(**)$ posits additive errors $+\delta_i.$ That's the basic difference between the two models. (And, as a result, the values of the $\alpha_j$ will not equal the corresponding $\beta_j$ and their estimates will often differ, too.)

Related Question