Solved – ARIMA modeling with more than one Categorical Variable

rregressiontime series

I am using auto.arima for forecasting. I have more than one categorical variables having more than one level.

My questions are :

Do I need to do dummy coding ?
If I do dummy coding with my categorical variables, this will result into 20 variables. Is it good to have such number of variables in time series model.
How can I do variable selections?

I would appreciate any kind of help.

Best Answer

Yes, xreg must be a numeric vector or matrix, so you will need to code the dummy variables yourself.
Provided you have enough data for each category to estimate all the coefficients, it should be ok.
Minimize the AICc.

Related Solutions

Time Series – Building a Model with Multiple Independent Variables

Regarding questions (1), (3), and (4); yes, there are a lot of options for modelling multivariate time series, and this is absolutely something you can accomplish with R. You said that you don't have much experience with statistics, so I'm not sure how familiar you are with R (if at all), but a possible approach would be to use the R package "dynlm":

## You'll need these packages
install.packages("dynlm",dependencies=TRUE)
library(dynlm)
if(is.element("zoo",installed.packages()[,1])){
  library(zoo)
} else {
  install.packages("zoo",dependencies=TRUE)
  library(zoo)
}
## Generating some nonsense data for demonstration
## 104 dates, 1 week apart
d1 <- as.Date("01/01/2012",format='%m/%d/%Y')
dSeq <- seq.Date(from=d1,
                 by='week',
                 length.out=104)
## Dependent variable
Y <- rnorm(104,50,10) + rnorm(104,10,1)*cos((1:104)/6)
## Independent variable for temperature
Temp <- rnorm(104,10,1) + cos((1:104)/12)
## Dummy variable for holidays (just picked a few off the calendar)
Holiday <- rep(0,104)
Holiday[c(3,3+52, 8,8+52, 22,22+52, 47,47+52, 52,52+52)] <- 1
Holiday <- ifelse(Holiday==0,"N","Y")
## Make a data.frame to hold variables
aDF <- data.frame(
  Date=dSeq,
  Y=Y,
  Temp=Temp,
  Holiday=Holiday)
## Make a time series version of this with the "zoo" function
## for using dynamic linear model.
zDF <- aDF
zDF[,2] <- zoo(aDF[,2],aDF[,1])
zDF[,3] <- zoo(aDF[,3],aDF[,1])
zDF[,4] <- zoo(aDF[,4],aDF[,1])
## A possible DLM... type ?dynlm for details of the function
dlm1 <- dynlm(Y ~ L(Y,1) + L(Y,13) + Temp + Holiday, data=zDF)
## Model summary
summary(dlm1)
## Estimated coefficients:
coefficients(dlm1)

Like I said, this is just one of many possibilities for analyzing multivariate time series in R; but to be honest, if you are "totally new to statistics" and not working on this particular project with someone who has experience with DLMs or similar models, I highly suggest reading through Forecasting: principles and practice by Rob Hyndman and George Athanasopoulos. It's a free online book written by two very knowledgeable econometricians and a significant amount of the content is geared towards people with little or no formal background in statistics / forecasting methods. Here's a link: https://www.otexts.org/fpp. On a related note, if you are going to be regularly working with time series data in R, I would suggest installing Hyndman's R package forecast, which is phenomenally useful. Additionally, your second question about deciding which independent variables have a more significant impact on sales is not something which can be succinctly answered. A typical modelling process involves a lot of steps related to diagnostic checking and evaluation of goodness-of-fit, and the tools for accomplishing such tasks can vary greatly depending on which type of statistical model you are using. Unfortunately, if you are brand new to statistics you will almost certainly have to invest a decent amount of time into understanding some important technical aspects of modelling, because there is much more to consider than the correlation of two variables, for example. This is another reason that I recommend reading through Hyndman and Athanasopoulos' online book, as it addresses a wide variety of fundamental aspects involved in the forecasting process.

Solved – How to fit OLS with many categorical levels, on more than one category

I would consider the following two approaches: dimensionality reduction and sparse regression. The traditional methods for the dimensionality reduction are projective and manifold, as mentioned in my answer here and the reference within, as well as latent variables modeling (LVM) methods, such as exploratory factor analysis (EFA) and confirmatory factor analysis (CFA).

Recently I've run across two R packages - ClustOfVar and clere - that essentially implement dimensionality reduction, but differently than traditional methods like PCA, and, in my view, are closer in their nature to the LVM approach. This alternative approach is referred to as variable clustering and you can find more details in the packages' JSS vignettes: this paper and this paper (unpublished yet?), correspondingly.

Additionally, since high-dimensional data is frequently sparse, the second main approach is to use sparse regression. R ecosystem offers many packages, useful for sparse regression analysis, such as Matrix, SprseM, MatrixModels, glmnet and flare. For links and more relevant resources, please see my related answer on DS SE site: https://datascience.stackexchange.com/a/918/2452.

For some overview and more specific examples of support for categorical explanatory variables by PCA, MDS and MCA methods as well as latent variable modeling approaches, please see this paper, these presentation slides (starting from slide 15), this paper, this paper and this paper.

Best Answer

Related Solutions

Time Series – Building a Model with Multiple Independent Variables

Solved – How to fit OLS with many categorical levels, on more than one category

Related Question