Solved – Forecasting daily time series sales revenue with many zero entries

arimacrostons-methodforecastingtime serieszero inflation

I have been trying to forecast the sales revenue of different product groups (the displayed sales revenue is aggregated over all products for each day e.g. smartphones with different prices as one group) but haven't found the right approach yet. I am pretty much a novice in time series analysis / forecasting and use Python. I tried using (S)ARIMA but didn't get meaningful results. I think the main issue is that my data is too sparse and I only have about a year and a half of data points. On top of that the time series has high fluctuations. Is my assumption correct that ARIMA has issues with data structured like mine? (Since it uses the mean of series?)

The top image is one of the "nicer" timeseries and the bottom an "average" one.
I tried just resampling to weekly data and then I think the forecast fit better but I lose some information with averaging over the week. For the bottom example weekly data was still not enough to use ARIMA. I would rather use daily data if possible. In the end I would like to add exogenous features like day of the week, weather, promotion etc

Given the count like nature of my data what options do I have? I would really appreciate some pointers. My research lead to "Crostons method". Does it fit to my problem? Would I need to change my $ into sales numbers #? Is there a Python package for croston's or do I have to use R? Any help is much appreciated.

pastebin data

EDIT

To clarify what the different time series represent:
Both are different product groups (I have about 10-20). And the forecasting I am more interested in (since most of my data looks like that) is the bottom one.

I did some more preparation for the bottom time series.

from statsmodels.tsa.stattools import adfuller
series = dataframe["Sales Revenue2"]
X = series.values
result = adfuller(X)

(-1.3334675205365911,
 0.6137252422831784,
 15,
 267,
 {'1%': -3.4550813975770827,
  '10%': -2.5725712007462582,
  '5%': -2.8724265892710914},
 3623.772858172633)

So the time series is not stationary.

Now using decompose:

 from statsmodels.tsa.seasonal import seasonal_decompose
 result = seasonal_decompose(dataframe["Sales Revenue2"], model='additive')
 fig = result.plot()
 plt.show()

Honestly I have not figured out the decomposition 100%. Does the picture show me have a weekly seasonality since the patterns repeats every 7 days?

Next looking at the ACF and PACF

What does the pattern tell me?

And that?

I also tried to difference the time series and got

def difference(dataset, interval=1):
    diff = list()
    for i in range(interval, len(dataset)):
        value = dataset[i] - dataset[i - interval]
        diff.append(value)
    return diff

PACF looks very funky. (negative values increasing from 41 to 48 outside conf interval to -1, then jumping to positive 3 at 49, slowly decreasing afterwards)
ACF was inside conf interval apart from a negative value at order 1

My next step was to try to use auto.arima (in Python)

I tried m=7, for the weekly seasonality? and different start values.

What I don't get is why those were the orders which fit best? How does that compare to my first ACF/PACF – would I need to rather dif t=7 to see the real ACF, PACF of that order?

I also tried to fit without seasonal trend setting m=1.

What irritated me the most is that order (1,1,1) got very close

to the best model without seasonality (12,1,1) (how does that order come together?) The AICs where 3743 compared to 3738

I would really appreciate if someone could guide me a bit through my results. Can I use ARIMA to forecast the data I have? The new results do not look too bad. I was mostly getting results similar to the last example just around the mean. But I was using stepwise before and now I explicitly tried higher orders. What if I have even less non zero values. What would my approach be to convert a time series to count data?

Thanks again!

(Since I don't have enough reputation I had to cut some graphs. The first time series is now gone since it doesn't really focus on my topic – I just wanted to show a comparison. I also had to cut the differenced ACF and PACF

Best Answer

Would I need to change my $ into sales numbers #?

Ideally yes. Since this is not a unique product, how do I know if 600\$ is one 600\$ unit or 2 300$ units?

Is there a Python package for crostons or do I have to use R?

Right now there are not any good R packages for Croston's and similar intermittent methods. R offers better options.

Is my assumption correct that ARIMA has issues with data structured like mine?

Depends. ARIMA might work with the data in the top graph, but it won't work with the data in the bottom graph (or anything sparser than that).

Croston's and it's newer versions (TSB, etc...) are a better option. But you need to keep in mind that such methods don't produce a normal forecast the way ARIMA or ETS does. They forecast a rate of sale (or velocity) which can then be used then to figure out average sales over a long period of time.

You should also look at count predicting methods like Negative Binomial and Poisson distributions.

Related Solutions

Solved – Time Series Forecasting with Daily Data: ARIMA with regressor

You should be evaluating models and forecasts from different origins across different horizons and not one one number in order to gauge an approach.

I assume that your data is from the US. I prefer 3+ years of daily data as you can have two holidays landing on a weekend and get no weekday read. It looks like your Thanksgiving impact is a day off in the 2012 or there was a recording error of some kind and caused the model to miss the Thanksgiving day effect.

Januarys are typically low in the dataset if you look as a % of the year. Weekends are high. The dummies reflect this behavior....MONTH_EFF01, FIXED_EFF_N10507,FIXED_EFF_N10607

I have found that using an AR component with daily data assumes that the last two weeks day of the week pattern is how the pattern is in general which is a big assumption. We started with 11 monthly dummies and 6 daily dummies. Some dropped out of the model. B**1 means that there is a lag impact the day after a holiday. There were 6 special days of the month (days 2,3,5,21,29,30----21 might be spurious?) and 3 time trends, 2 seasonal pulses (where a day of the week started deviating from the typical, a 0 before this data and a 1 every 7th day after) and 2 outliers (note the thanksgiving!) This took just under 7 minutes to run. Download all results here www.autobox.com/se/dd/daily.zip

It includes a quick and dirty XLS sheet to check to see if the model makes sense. Of course, the XLS % are in fact bad as they are crude benchmarks.

Try estimating this model:

Y(T) =  .53169E+06                                                                                        
       +[X1(T)][(+  .13482E+06B** 1)]                                       M_HALLOWEEN
       +[X2(T)][(+  .17378E+06B**-3)]                                       M_JULY4TH
       +[X3(T)][(-  .11556E+06)]                                            M_MEMORIALDAY
       +[X4(T)][(-  .16706E+06B**-4+  .13960E+06B**-3-  .15636E+06B**-2                                                 
       -  .19886E+06B**-1)]                                                 M_NEWYEARS
       +[X5(T)][(+  .17023E+06B**-2-  .26854E+06B**-1-  .14257E+06B** 1)]   M_THANKSGIVI
       +[X6(T)][(-  71726.    )]                                            MONTH_EFF01
       +[X7(T)][(+  55617.    )]                                            MONTH_EFF02
       +[X8(T)][(+  27827.    )]                                            MONTH_EFF03
       +[X9(T)][(-  37945.    )]                                            MONTH_EFF09
       +[X10(T)[(-  23652.    )]                                            MONTH_EFF10
       +[X11(T)[(-  33488.    )]                                            MONTH_EFF11
       +[X12(T)[(+  39389.    )]                                            FIXED_EFF_N10107
       +[X13(T)[(+  63399.    )]                                            FIXED_EFF_N10207
       +[X14(T)[(+  .13727E+06)]                                            FIXED_EFF_N10307
       +[X15(T)[(+  .25144E+06)]                                            FIXED_EFF_N10407
       +[X16(T)[(+  .32004E+06)]                                            FIXED_EFF_N10507
       +[X17(T)[(+  .29156E+06)]                                            FIXED_EFF_N10607
       +[X18(T)[(+  74960.    )]                                            FIXED_DAY02
       +[X19(T)[(+  39299.    )]                                            FIXED_DAY03
       +[X20(T)[(+  27660.    )]                                            FIXED_DAY05
       +[X21(T)[(-  33451.    )]                                            FIXED_DAY21
       +[X22(T)[(+  43602.    )]                                            FIXED_DAY29
       +[X23(T)[(+  68016.    )]                                            FIXED_DAY30
       +[X24(T)[(+  226.98    )]                                            :TIME TREND        1                   1/  1   1/ 3/2011   I~T00001__010311stack
       +[X25(T)[(-  133.25    )]                                            :TIME TREND      423                  61/  3   2/29/2012   I~T00423__010311stack
       +[X26(T)[(+  164.56    )]                                            :TIME TREND      631                  91/  1   9/24/2012   I~T00631__010311stack
       +[X27(T)[(-  .42528E+06)]                                            :SEASONAL PULSE  733                 105/  5   1/ 4/2013   I~S00733__010311stack
       +[X28(T)[(-  .33108E+06)]                                            :SEASONAL PULSE  370                  53/  6   1/ 7/2012   I~S00370__010311stack
       +[X29(T)[(-  .82083E+06)]                                            :PULSE           326                  47/  4  11/24/2011   I~P00326__010311stack
       +[X30(T)[(+  .17502E+06)]                                            :PULSE           394                  57/  2   1/31/2012   I~P00394__010311stack
      +                    +   [A(T)]

Solved – Modeling seasonality in sales time series

Have you seen this thread? What method can be used to detect seasonality in data?

Keep in mind that, when using dummy variables, you should only include 3 seasons (e.g. Summer, Spring and Winter) and not all four, so as to avoid perfect multicollinearity (the so-called dummy variable trap).

Best Answer

Related Solutions

Solved – Time Series Forecasting with Daily Data: ARIMA with regressor

Solved – Modeling seasonality in sales time series

Related Question