Solved – How to perform Time Series Analysis on daily data

predictive-modelsrtime series

Sorry if my question is silly but I am extremely new to Data Science and Time series analysis.

So I have Tv program viewerships for the last 1 year and want to predict for the next 2 weeks. The data is of the form:

Year, Date, Week_day, Channel, Program, start_time, end_time, length, avg_impressions

avg_impressions is the average of all impressions of the program excluding the breaks. For example, if there were 10 impressions until first break, 20 impressions after first break until second and 30 impressions after the third break to the end, avg_impressions will be 20.

I want to do a time series analysis for the prediction. The data has 211,720 entries for 14 different channels. The data is from 10/10/2015 to 9/9/2016.

I am not very sure on how to convert the “Impressions` into a time series object. I tried:

pts <- ts(train_agg$Impressions, start=c(2015, 10, 10), end=c(2016, 9, 9), frequency=357)

Is this format correct? I want to check for seasonality in the data. (I read I could tbats on the ts object for that).

Can someone please explain how to specify start dates and end dates for the vector? I read the answers from here, here and here but I am still not sure if I am doing it right.

My R code for what all i did till now:

train_data <- head(prog, 207595)
test_data <- tail(prog, 4126)
colnames(train_data)[12] <- "Impressions"
colnames(test_data)[12] <- "Impressions"
train_agg <- aggregate(Impressions~date_in_days, data=subset(train_data, Channel=="NBC" & Hour==19), mean)
test_agg <- aggregate(Impressions~date_in_days, data=subset(test_data,    Channel=="NBC" & Hour==19), mean)

pts <- ts(train_agg$Impressions, start=c(2015,10,10), end=c(2016,9,9), frequency=357)
plot.ts(pts)

pts.msts <- msts(pts,seasonal.periods=c(7,357))
model <- tbats(pts.msts)
plot(forecast(model, h=7))
forecast(model, h=7)
accuracy(model)

Is this the right way? I am totally lost. Can someone please give me pointers as to what I should do?

Best Answer

For starters, your data is too short. When measuring seasonality(ie monthly) you need 3 iterations. Can you get more data? If not, then you are left to make a lot of assumptions which can be dangerous. Post your data and what country it is from and the beginning date.

By using regression, you can solve this problem. You can consider using 11 monthly dummies, 6 day of the week dummies and holiday dummies. Not all may be significant. Not all may be constant(ie june is high and then becomes low). You need to look for outliers and build a dummy for them. You need to look for lead and lag effects around holidays. You need to consider day of the month impacts, week of the month impacts, long weekend, friday before a monday holiday, monday after a friday holiday. You might have a trend or multiple trends. You might have a change in the general volume called a level shift.

Related Solutions

Solved – Getting started with time series in R

It seems like you need the package xts. Create your time serie using

install.packages('xts')
library(xts)
X = xts(coredata(DF[,2]), order.by=DF[,1])

Then you will be able to manipulate your data easily.

to.weekly(X)  
to.monthly(X)

Please note that you will then manipulate xts objects and not ts. But no worries, you can go back to ts whenever needed.

Time Series Analysis – Daily Analysis Using R

Your ACF and PACF indicate that you at least have weekly seasonality, which is shown by the peaks at lags 7, 14, 21 and so forth.

You may also have yearly seasonality, although it's not obvious from your time series.

Your best bet, given potentially multiple seasonalities, may be a tbats model, which explicitly models multiple types of seasonality. Load the forecast package:

library(forecast)

Your output from str(x) indicates that x does not yet carry information about potentially having multiple seasonalities. Look at ?tbats, and compare the output of str(taylor). Assign the seasonalities:

x.msts <- msts(x,seasonal.periods=c(7,365.25))

Now you can fit a tbats model. (Be patient, this may take a while.)

model <- tbats(x.msts)

Finally, you can forecast and plot:

plot(forecast(model,h=100))

You should not use arima() or auto.arima(), since these can only handle a single type of seasonality: either weekly or yearly. Don't ask me what auto.arima() would do on your data. It may pick one of the seasonalities, or it may disregard them altogether.

EDIT to answer additional questions from a comment:

How can I check whether the data has a yearly seasonality or not? Can I create another series of total number of events per month and use its ACF to decide this?

Calculating a model on monthly data might be a possibility. Then you could, e.g., compare AICs between models with and without seasonality.

However, I'd rather use a holdout sample to assess forecasting models. Hold out the last 100 data points. Fit a model with yearly and weekly seasonality to the rest of the data (like above), then fit one with only weekly seasonality, e.g., using auto.arima() on a ts with frequency=7. Forecast using both models into the holdout period. Check which one has a lower error, using MAE, MSE or whatever is most relevant to your loss function. If there is little difference between errors, go with the simpler model; otherwise, use the one with the lower error.

The proof of the pudding is in the eating, and the proof of the time series model is in the forecasting.

To improve matters, don't use a single holdout sample (which may be misleading, given the uptick at the end of your series), but use rolling origin forecasts, which is also known as "time series cross-validation". (I very much recommend that entire free online forecasting textbook.

So Seasonal ARIMA models cannot usually handle multiple seasonalities? Is it a property of the model itself or is it just the way the functions in R are written?

Standard ARIMA models handle seasonality by seasonal differencing. For seasonal monthly data, you would not model the raw time series, but the time series of differences between March 2015 and March 2014, between February 2015 and February 2014 and so forth. (To get forecasts on the original scale, you'd of course need to undifference again.)

There is no immediately obvious way to extend this idea to multiple seasonalities.

Of course, you can do something using ARIMAX, e.g., by including monthly dummies to model the yearly seasonality, then model residuals using weekly seasonal ARIMA. If you want to do this in R, use ts(x,frequency=7), create a matrix of monthly dummies and feed that into the xreg parameter of auto.arima().

I don't recall any publication that specifically extends ARIMA to multiple seasonalities, although I'm sure somebody has done something along the lines in my previous paragraph.

Best Answer

Related Solutions

Solved – Getting started with time series in R

Time Series Analysis – Daily Analysis Using R

Related Question