The factors, as always. Seems like the model is not using the actual value of the factor, but rather something like the position in the factor-levels.
I was able to reproduce your error with the data OrchardSprays
data(OrchardSprays)
model <- gbm(decrease ~ rowpos+colpos+treatment, data=OrchardSprays, n.trees=1000, distribution="gaussian", interaction.depth=3, bag.fraction=0.5, train.fraction=1.0, shrinkage=0.1, keep.data=TRUE)
firstrow <- OrchardSprays[1,]
str(firstrow)
manualFirstrow <- data.frame(decrease=57,rowpos=1,colpos=1,treatment="D")
str(manualFirstrow)
predict(model,newdata=firstrow,n.trees=100)
predict(model,newdata=manualFirstrow,n.trees=100)
predict(model,newdata=data.frame(decrease=57,rowpos=1,colpos=1,treatment="A"),n.trees=100)
output:
> predict(model,newdata=firstrow,n.trees=100)
[1] 50.31276
> predict(model,newdata=manualFirstrow,n.trees=100)
[1] 20.67818
> predict(model,newdata=data.frame(decrease=57,rowpos=1,colpos=1,treatment="A"),n.trees=100)
[1] 20.67818
since A has position 1 in the levels of OrchardSprays$treatment. Adding the levels to the data declaration does the trick
manualFirstrow <- data.frame(decrease=57,rowpos=1,colpos=1,treatment=factor("D",levels(OrchardSprays$treatment)))
str(manualFirstrow)
predict(model,newdata=firstrow,n.trees=100)
predict(model,newdata=manualFirstrow,n.trees=100)
output:
> predict(model,newdata=firstrow,n.trees=100)
[1] 50.31276
> predict(model,newdata=manualFirstrow,n.trees=100)
[1] 50.31276
There are two ways to produce predictions on the levels of a variable $Y_i$ for a linear
regression that is fitted to the logarithms of the variable
$$
\log Y_i = \boldsymbol{X}_i\boldsymbol{\beta} + \varepsilon_i
$$
where $ \mathbb{E}(\varepsilon_i \mid \boldsymbol{X}_i) = 0$.
Option 1:
One option is
$$
\begin{align}
\widehat{Y}_i &= \overline{\exp(\widehat{\varepsilon})}\exp\left(\widehat{\log Y} _i\right)
\end{align}
$$
where $\overline{\exp(\widehat{\varepsilon})}$ is the sample mean of the exponentiated residuals from the log-normal fit.
A theoretical justification proceeds by noting that the log-normal regression implies that
$$
\begin{align}
Y_i &= \exp\left(\boldsymbol{X}_i'\boldsymbol{\beta} + \varepsilon_i\right) \\
Y_i &= \exp\left(\boldsymbol{X}_i'\boldsymbol{\beta}\right)\exp(\varepsilon_i)
\end{align}
$$
Under conditional indepedence of the errors and the covariates, we can write
$$
\mathbb{E}(Y_i \mid \boldsymbol{X}_i) = \exp\left(\boldsymbol{X}_i'\boldsymbol{\beta}\right)\mathbb{E}(\exp(\varepsilon_i))
$$
We will need estimates of $\exp\left(\boldsymbol{X}_i'\boldsymbol{\beta}\right)$ and $\mathbb{E}(\exp(\varepsilon_i))$ in order to construct estimates of $\mathbb{E}(Y_i \mid \boldsymbol{X}_i)$.
A consistent estimator of $\mathbb{E}(\exp(\varepsilon_i))$ is $\overline{\exp(\widehat{\varepsilon})}$. Combining this information, we get that a theoretically justifiable
estimator of $\mathbb{E}(Y_i \mid \boldsymbol{X}_i)$ is
$$
\widehat{Y}_i = \exp\left(\boldsymbol{X}_i'\widehat{\boldsymbol{\beta}}\right)\overline{\exp(\widehat{\varepsilon})}
$$
R implementation:
This can be easily coded in R.
data(longley)
# log-linear regression
lmLongley = lm(log(GNP) ~ GNP.deflator + Armed.Forces + Population,
data = longley)
# compute the predictions
predGNPLevel = exp(predict(lmLongley))*mean(exp(resid(lmLongley)))
# print the output
cbind(predGNPLevel, longley$GNP)
Option 2:
The second option follows if we assume that the errors in the log-linear regression follow a normal distribution $N(0, \sigma^2)$, then $\exp(\varepsilon_i)$ follows a log-normal distribution, with mean $\exp(\sigma^2/2)$. This implies that
$$
\begin{align}
\mathbb{E}(Y_i \mid \boldsymbol{X}_i) &= \exp\left(\boldsymbol{X}_i'\boldsymbol{\beta}\right)\mathbb{E}(\exp(\varepsilon_i)) \\
&= \exp\left(\boldsymbol{X}_i'\boldsymbol{\beta}\right)\exp(\sigma^2/2)
\end{align}
$$
Plugging in consistent estimates of $\sigma^2$ produces the predictions
$$
\widehat{Y}_i = \exp\left(\boldsymbol{X}_i'\widehat{\boldsymbol{\beta}}\right)\exp(\widehat{\sigma^2}/2)
$$
R implementation:
The R implementation is again straighforward:
predGNPLevel2 = exp(predict(lmLongley))*exp(var(resid(lmLongley))/2)
# print the output
cbind(predGNPLevel2, longley$GNP)
Best Answer
You should investigate ARMAX/Dynamic Regression/Transfer Functions making sure that you deal with any Outliers/Level Shifts/Seasonal Pulses/Local Time Trends while testing for constancy of parameters and constancy of error variance. If you wish to post your data, do so and I will send you some results illustrating these ideas.