The approach adopted by the forecast
package is to fit a model to the transformed data and to backtransform both the point forecasts and prediction intervals. This is not "wrong". It is a legitimate choice. This preserves the coverage of the prediction intervals, and the back-transformed point forecast can be considered the median of the forecast densities (assuming the forecast densities on the transformed scale are symmetric). For many purposes, this is acceptable, but occasionally the mean forecast is required. For example, with hierarchical forecasting the forecasts need to be aggregated, and medians do not aggregate but means do.
It is easy enough to derive the mean forecast using a Taylor series expansion. Suppose $f(x)$ represents the back-transformation function, $\mu$ is the mean on the transformed scale and $\sigma^2$ is the variance on the transformed scale. Then using the first three terms of a Taylor expansion around $\mu$, the mean on the original scale is given by
$$f(\mu) + \frac{1}{2}\sigma^2f''(\mu).$$
For a Box-Cox transformation,
$$f(x) = \left\{\begin{array}{ll}
(\lambda x+1)^{1/\lambda} & \text{if $\lambda\ne0$;}\\
e^x & \text{if $\lambda=0$.}\end{array}\right.
$$
So
$$f''(x) = \left\{\begin{array}{ll}
(1-\lambda)(\lambda x+1)^{1/\lambda-2} & \text{if $\lambda\ne0$;}\\
e^x & \text{if $\lambda=0$.}\end{array}\right.
$$
and
the backtransformed mean is given by
$$\left\{\begin{array}{ll}
(\lambda \mu+1)^{1/\lambda}\left[1 + \frac{\sigma^2(1-\lambda)}{2(\lambda \mu+1)^{2}}\right] & \text{if $\lambda\ne0$;}\\
e^\mu\left[1 + \frac{\sigma^2}{2}\right] & \text{if $\lambda=0$.}\end{array}\right.
$$
Therefore, to adjust the back-transformed mean obtained by R, the following code can be used.
library(forecast)
fit <- auto.arima(AirPassengers, lambda=0)
fc <- forecast(fit, h=50, level=95)
fvar <- ((BoxCox(fc$upper,fit$lambda)-BoxCox(fc$lower,fit$lambda))/qnorm(0.975)/2)^2
plot(fc)
fc$mean <- fc$mean * (1 + 0.5*fvar)
lines(fc$mean,col='red')
fit <- auto.arima(AirPassengers, lambda=0.2)
fc <- forecast(fit, h=50, level=95)
fvar <- ((BoxCox(fc$upper,fit$lambda)-BoxCox(fc$lower,fit$lambda))/qnorm(0.975)/2)^2
plot(fc)
fc$mean <- fc$mean * (1 + 0.5*fvar*(1-fit$lambda)/(fc$mean)^(2*fit$lambda))
lines(fc$mean,col='red')
An extended version of this answer is given on my blog.
Bantis, L. E., Nakas, C. T. and Reiser, B. (2014), Construction of confidence regions in the ROC space after the estimation of the optimal Youden index‐based cut‐off point. Biom, 70: 212-223. doi:10.1111/biom.12107 implements the method and gives the following formula:
We see that they use the transformed values when calculating the standard deviations (in the square brackets) but the untransformed values in the third row of the formula.
Again, I implemented this in R and this time my only quibble is that $-\frac{n_X}{2}$ and $-\frac{n_Y}{2}$ apparently should be replaced by just $-n_X$ and $-n_Y$, which are in my understanding the numbers of observations in both groups, otherwise the results are way off. The R code thus is:
profile_loglik <- function(lambda, x, y) {
m <- length(x)
n <- length(y)
x_t <- boxcox(x, lambda)
y_t <- boxcox(y, lambda)
-m * log(sd(x_t)) - n * log(sd(y_t)) + (lambda - 1) * (sum(log(x)) + sum(log(y)))
}
calc_lambda_hat <- function(x, y) {
optim_func <- function(lambda) -(profile_loglik(lambda = lambda, x = x, y = y))
optim(0.5, optim_func, method = "L-BFGS-B", lower = -4, upper = 10)
}
I tested calc_lambda_hat
against concatenating both groups and then using forecast::BoxCox.lambda(c(x, y), "loglik")
. The correlation of the returned values for $\lambda$ after 1000 repetitions in various scenarios with lognormal distributions is about 85%, so this way the implementation seems to be OK.
Best Answer
To ensure that the likelihoods for different values of $\lambda$ are compareable, we need the log likelihood based on the untransformed non-Gaussian data $y_1,\dots,y_n$, that is, $$ l(\lambda,\boldsymbol{\beta},\sigma^2)=\ln f(y_1,\dots,y_n).\tag{1} $$ According to the model, the pdf of the transformed data, say $f_{(\lambda)}$, is Gaussian, such that, $$ \ln f_{(\lambda)}(y_1^{(\lambda)},\dots,y_n^{(\lambda)})= -\frac{n}{2}\log(2\pi\sigma^{2})-\frac{1}{2\sigma^{2}}(\mathbf{y}^{(\lambda)}-\mathbf{X}\boldsymbol{\beta})^{T}(\mathbf{y}^{(\lambda)}-\mathbf{X}\boldsymbol{\beta}). \tag{2} $$ The relationship between the two pdfs is $$ f(y_1,\dots,y_n)=f_{(\lambda)}(y_1^{(\lambda)},\dots,y_n^{(\lambda)})|\mathbf{J}|,\tag{3} $$ where the diagonal elements of the Jacobian are $\partial y_i^{(\lambda)}/\partial y_i = y_i^{\lambda-1}$ and the off-diagonal elements $\partial y_i^{(\lambda)}/\partial y_j$ are all zero. Hence, the determinant $$|\mathbf{J}|=\prod_{i=1}^n y_i^{\lambda-1}.\tag{4} $$ Combining (1) to (4) leads to the log likelihood function $l$ in your question.
EDIT: Instead of including the additional term from the Jacobian (3) in the log likelihood (e.g. before maximising this with respect to $\lambda$), an alternative is to fit the model to $y_i'=y_i^{(\lambda)}/\tilde y^{\lambda-1}$ where $\tilde y=(\prod_{i=1}^n y_i)^{1/n}$ (the geometric mean of the original $y_i$'s). This has pdf \begin{align} f'(y_1',\dots,y_n')&=f_{(\lambda)}(y_1^{(\lambda)},\dots,y_n^{(\lambda)})(\tilde y^{\lambda-1})^n \\&=f_{(\lambda)}(y_1^{(\lambda)},\dots,y_n^{(\lambda)})\prod_{i=1}^n y_i^{\lambda-1} \\&=f(y_1,\dots,y_n). \end{align} The likelihood you obtain from fitting the model to this transformation is thus identical to the likelihood (1) that we seek without the inclusion of any extra correction term.