Solved – Time Series Modeling with Lagged Variables

arimamachine learningrsoftwaretime series

I have a dataset with columns that represent lagged values of predictors. To illustrate with a simple example, suppose we had car sales data for 3 years and the only predictors available were income and population for a number of car dealers, the dataset could be represented as follows,

ID  IncLag1  PopLag1  SalesLag1  IncLag2  PopLag2 SalesLag2  IncCurrent  PopCurr  SalesCurr
a       100      1000     200        150      2000    300        500       2500         450
b       10        300      50         60       900     80         90       1000         100

…

k       30        60      10        200      2000     60         80          800         ??

My dependent variable is SalesCurr – i.e., given a history of past sales and corresponding Income and Population values (which we can use as the train-test data), predict what the Sales will be in the current year (SalesCurr).

My question is as follows — Using R or GRETL, how is it possible to create an ARIMA/TimeSeries model with the above data to predict the SalesCurrent variable. Using simple Linear Regression, one could simply have a formula such as say, lm (SalesCurrent ~ ., data=mytable), but it would not be a time-series model since it does not take into account the relationship between the different variables.

Alternatively, I am quite familiar with Machine Learning models and wanted to get your thoughts on how such a dataset could be modeled using say, randomForest, GBM, etc.

Thanks in advance.

Best Answer

You should investigate ARMAX/Dynamic Regression/Transfer Functions making sure that you deal with any Outliers/Level Shifts/Seasonal Pulses/Local Time Trends while testing for constancy of parameters and constancy of error variance. If you wish to post your data, do so and I will send you some results illustrating these ideas.

Related Solutions

Solved – Why does GBM predict different values for the same data

The factors, as always. Seems like the model is not using the actual value of the factor, but rather something like the position in the factor-levels.

I was able to reproduce your error with the data OrchardSprays

data(OrchardSprays)

model <- gbm(decrease ~ rowpos+colpos+treatment, data=OrchardSprays, n.trees=1000, distribution="gaussian", interaction.depth=3, bag.fraction=0.5, train.fraction=1.0, shrinkage=0.1, keep.data=TRUE)

firstrow <- OrchardSprays[1,]
str(firstrow)

manualFirstrow <- data.frame(decrease=57,rowpos=1,colpos=1,treatment="D")
str(manualFirstrow)

predict(model,newdata=firstrow,n.trees=100)
predict(model,newdata=manualFirstrow,n.trees=100)
predict(model,newdata=data.frame(decrease=57,rowpos=1,colpos=1,treatment="A"),n.trees=100)

output:

> predict(model,newdata=firstrow,n.trees=100)
[1] 50.31276
> predict(model,newdata=manualFirstrow,n.trees=100)
[1] 20.67818
> predict(model,newdata=data.frame(decrease=57,rowpos=1,colpos=1,treatment="A"),n.trees=100)
[1] 20.67818

since A has position 1 in the levels of OrchardSprays$treatment. Adding the levels to the data declaration does the trick

manualFirstrow <- data.frame(decrease=57,rowpos=1,colpos=1,treatment=factor("D",levels(OrchardSprays$treatment)))
str(manualFirstrow)

predict(model,newdata=firstrow,n.trees=100)
predict(model,newdata=manualFirstrow,n.trees=100)

output:

> predict(model,newdata=firstrow,n.trees=100)
[1] 50.31276
> predict(model,newdata=manualFirstrow,n.trees=100)
[1] 50.31276

Solved – Estimate the sum of predicted variables by a linear model in R

There are two ways to produce predictions on the levels of a variable $Y_i$ for a linear regression that is fitted to the logarithms of the variable $$ \log Y_i = \boldsymbol{X}_i\boldsymbol{\beta} + \varepsilon_i $$ where $ \mathbb{E}(\varepsilon_i \mid \boldsymbol{X}_i) = 0$.

Option 1:

One option is $$ \begin{align} \widehat{Y}_i &= \overline{\exp(\widehat{\varepsilon})}\exp\left(\widehat{\log Y} _i\right) \end{align} $$ where $\overline{\exp(\widehat{\varepsilon})}$ is the sample mean of the exponentiated residuals from the log-normal fit.

A theoretical justification proceeds by noting that the log-normal regression implies that $$ \begin{align} Y_i &= \exp\left(\boldsymbol{X}_i'\boldsymbol{\beta} + \varepsilon_i\right) \\ Y_i &= \exp\left(\boldsymbol{X}_i'\boldsymbol{\beta}\right)\exp(\varepsilon_i) \end{align} $$

Under conditional indepedence of the errors and the covariates, we can write $$ \mathbb{E}(Y_i \mid \boldsymbol{X}_i) = \exp\left(\boldsymbol{X}_i'\boldsymbol{\beta}\right)\mathbb{E}(\exp(\varepsilon_i)) $$

We will need estimates of $\exp\left(\boldsymbol{X}_i'\boldsymbol{\beta}\right)$ and $\mathbb{E}(\exp(\varepsilon_i))$ in order to construct estimates of $\mathbb{E}(Y_i \mid \boldsymbol{X}_i)$. A consistent estimator of $\mathbb{E}(\exp(\varepsilon_i))$ is $\overline{\exp(\widehat{\varepsilon})}$. Combining this information, we get that a theoretically justifiable estimator of $\mathbb{E}(Y_i \mid \boldsymbol{X}_i)$ is $$ \widehat{Y}_i = \exp\left(\boldsymbol{X}_i'\widehat{\boldsymbol{\beta}}\right)\overline{\exp(\widehat{\varepsilon})} $$

R implementation:

This can be easily coded in R.

data(longley)

# log-linear regression
lmLongley = lm(log(GNP) ~ GNP.deflator + Armed.Forces + Population, 
               data = longley)

# compute the predictions
predGNPLevel = exp(predict(lmLongley))*mean(exp(resid(lmLongley)))

# print the output
cbind(predGNPLevel, longley$GNP)

Option 2:

The second option follows if we assume that the errors in the log-linear regression follow a normal distribution $N(0, \sigma^2)$, then $\exp(\varepsilon_i)$ follows a log-normal distribution, with mean $\exp(\sigma^2/2)$. This implies that $$ \begin{align} \mathbb{E}(Y_i \mid \boldsymbol{X}_i) &= \exp\left(\boldsymbol{X}_i'\boldsymbol{\beta}\right)\mathbb{E}(\exp(\varepsilon_i)) \\ &= \exp\left(\boldsymbol{X}_i'\boldsymbol{\beta}\right)\exp(\sigma^2/2) \end{align} $$

Plugging in consistent estimates of $\sigma^2$ produces the predictions $$ \widehat{Y}_i = \exp\left(\boldsymbol{X}_i'\widehat{\boldsymbol{\beta}}\right)\exp(\widehat{\sigma^2}/2) $$

R implementation:

The R implementation is again straighforward:

predGNPLevel2 = exp(predict(lmLongley))*exp(var(resid(lmLongley))/2)

# print the output
cbind(predGNPLevel2, longley$GNP)

Best Answer

Related Solutions

Solved – Why does GBM predict different values for the same data

Solved – Estimate the sum of predicted variables by a linear model in R

Option 1:

R implementation:

Option 2:

R implementation:

Related Question