Solved – Data mining techniques in R for advertising and sales data

data miningrtime series

I would like to apply one or more data mining techniques to a dataset, in order to see the effect advertising has on sales.

I am working from this dataset. It has 36 consecutive entries of monthly data for both sales and advertising – it's very small.

I exported the dataset to a ".csv". I deleted the date column, because I will use R's ts (time series object). The ".csv" now looks like this:

Advertising,Sales
12,15
20.5,16
21,18
..., ..., ...
23.4,17
16.4,1

The example coded below works. However, I had to split the matrix into two lists, because of the HoltWinters() function. I would prefer to analyse Advertising and Sales together at the latter stages. What other data mining techniques may be more beneficial?

data <- read.csv("./advertising_sales.csv", header=TRUE)
data_ts <- ts(data, start = c(2011,1), frequency = 12)
print(data_ts) # to check data has been correctly added

> Jan 2012    13        17.3    21
+ Feb 2012    14        25.3    29  
+ ...
+ Nov 2013    35        23.4    17
+ Dec 2013    36        16.4     1

plot(decompose(data_ts))
data_ts_ad <- data_ts[,1] #assign advertising as list, for HoltWinters
data_ts_sa <- data_ts[,2] # assign sales as list, for HoltWinters

#do HoltWinters for advertising
plot(HoltWinters(data_ts_ad))
data_ts_ad.hw <- HoltWinters(data_ts_ad)
predict(data_ts_ad.hw,n.ahead=9)

>           Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
+ 2014 18.52852 25.47521 27.16683 36.41340 38.14678 33.04452 33.22488 32.12758
      Sep
+ 2014 32.58964

plot(data_ts_ad,xlim=c(2010,2014))
lines(predict(data_ts_ad.hw, n.ahead=24), col=2)

#do HoltWinters for sales
plot(HoltWinters(data_ts_sa))
data_ts_sa.hw <- HoltWinters(data_ts_sa)
predict(data_ts_sa.hw,n.ahead=9)

>          Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
+ 2014 11.05723 23.27877 50.06859 57.22696 61.50669 26.35195 62.26159 70.83347
      Sep
+ 2014 23.18957

plot(data_ts_sa,xlim=c(2010,2014))
lines(predict(data_ts_sa.hw, n.ahead=24), col=2)

I recently came across a book called R and data mining: Examples and Case Studies by Yanchang Zhao. It has excellent worked examples and this is where I have found inspiration. However, I can't get my small brain to think which techniques can be applied to this dataset.

I am new to R, so please try dumb-down your answers slightly.

EDIT: Output of data_ts is given below.

dput(data_ts)

structure(c(12, 20.5, 21, 15.5, 15.3, 23.5, 24.5, 21.3, 23.5, 
28, 24, 15.5, 17.3, 25.3, 25, 36.5, 36.5, 29.6, 30.5, 28, 26, 
21.5, 19.7, 19, 16, 20.7, 26.5, 30.6, 32.3, 29.5, 28.3, 31.3, 
32.2, 26.4, 23.4, 16.4, 15, 16, 18, 27, 21, 49, 21, 22, 28, 36, 
40, 3, 21, 29, 62, 65, 46, 44, 33, 62, 22, 12, 24, 3, 5, 14, 
36, 40, 49, 7, 52, 65, 17, 5, 17, 1), .Dim = c(36L, 2L), .Dimnames = list(
    NULL, c("Advertising", "Sales")), .Tsp = c(2006, 2008.91666666667, 
12), class = c("mts", "ts", "matrix"))

Best Answer

Given that you have a time series, with possible influences of trend and seasonality on sales, I recommend that you look for time series techniques that can handle causal effects such as advertising. This thread should be a good starting point, although your focus appears not to be forecasting.

Try something like this:

> library(forecast)
> model <- auto.arima(data_ts[,"Sales"],xreg=data_ts[,"Advertising"])

This will build an ARIMAX model for sales, with advertising as an external variable. You can then do summary(model) to see, e.g., parameter estimates.

> summary(model)
Series: data_ts[, "Sales"] 
ARIMA(0,0,0)(0,1,0)[12]                    

Coefficients:
      data_ts[, "Advertising"]
                        1.6445
s.e.                    0.6574

sigma^2 estimated as 575.3:  log likelihood=-51
AIC=106   AICc=106.57   BIC=108.35

Training set error measures:
                    ME     RMSE      MAE       MPE     MAPE      MASE
Training set -2.821585 13.84857 9.039446 -40.91741 64.68516 0.5506261
                    ACF1
Training set 0.003027406

We see that ARIMAX believes that each unit of advertising increases sales by 1.64. You can plot:

plot(data_ts[,"Sales"])
lines(data_ts[,"Advertising"],col="red")

If you have future values data_ts_ad_future for your advertising, you can forecast and plot point forecasts and prediction intervals:

set.seed(1)
data_ts_ad_future <- ts(sample(data_ts[,"Advertising"],12,replace=TRUE),
    start=c(2009,1),frequency=frequency(data_ts[,"Advertising"]))
fcst <- forecast(model,xreg=data_ts_ad_future)
plot(fcst)
lines(data_ts[,"Advertising"],col="red")
lines(data_ts_ad_future,col="red",lty=2)

time series

Related Question