I've read that ARIMA assumes that endogenous input array is evenly spaced. If that is the case, then what is the point of the dates parameter in statsmodels.tsa.ARIMA(), which seems like it is there to support irregularly spaced data? Also, what are the assumptions for the optional exogenous arrays, need these be spaced in the exact same way as the endogenous?
Solved – Does ARIMA assume evenly-spaced data in statsmodels
arimastatsmodelstime seriesunevenly-spaced-time-series
Related Solutions
If the observations of a stochastic process are irregularly spaced the most natural way to model the observations is as discrete time observations from a continuous time process.
What is generally needed of a model specification is the joint distribution of the observations $X_{1}, \ldots, X_n$ observed at times $t_1 < t_2 < \ldots < t_n$, and this can, for instance, be broken down into conditional distributions of $X_{i}$ given $X_{i-1}, \ldots, X_1$. If the process is a Markov process this conditional distribution depends on $X_{i-1}$ $-$ not on $X_{i-2}, \ldots, X_1$ $-$ and it depends on $t_i$ and $t_{i-1}$. If the process is time-homogeneous the dependence on the time points is only through their difference $t_i - t_{i-1}$.
We see from this that if we have equidistant observations (with $t_i - t_{i-1} = 1$, say) from a time-homogeneous Markov process we only need to specify a single conditional probability distribution, $P^1$, to specify a model. Otherwise we need to specify a whole collection $P^{t_{i}-t_{i-1}}$ of conditional probability distributions indexed by the time differences of the observations to specify a model. The latter is, in fact, most easily done by specifying a family $P^t$ of continuous time conditional probability distributions.
A common way to obtain a continuous time model specification is through a stochastic differential equation (SDE) $$dX_t = a(X_t) dt + b(X_t) dB_t.$$ A good place to get started with doing statistics for SDE models is Simulation and Inference for Stochastic Differential Equations by Stefano Iacus. It might be that many methods and results are described for equidistant observations, but this is typically just convenient for the presentation and not essential for the application. One main obstacle is that the SDE-specification rarely allows for an explicit likelihood when you have discrete observations, but there are well developed estimation equation alternatives.
If you want to get beyond Markov processes the stochastic volatility models are like (G)ARCH models attempts to model a heterogeneous variance (volatility). One can also consider delay equations like $$dX_t = \int_0^t a(s)(X_t-X_s) ds + \sigma dB_t$$ that are continuous time analogs of AR$(p)$-processes.
I think it is fair to say that the common practice when dealing with observations at irregular time points is to build a continuous time stochastic model.
You have only 5 months worth of data, I assume observed on daily basis. Your cycle is monthly so m should be 30. Also, your data looks seasonal and therefore should set to true.
Don't try to overfit your data and simply use the default on your first run:
auto_arima(b, error_action='ignore', trace=1, seasonal=True, m=30)
Best Answer
Models written in terms of lagged variables (which ARIMA is) work only on equally-spaced time periods. If the data was irregular and worked in this function, you would have to specify a correlation function (such as exponential correlation, gaussian correlation, etc.), as you do in geostatistical models. The dates parameter is only for plotting purposes.
The exogenous arrays usually are spaced the same way as the endogenous, because they are basically explanatory variables in a regression that explains the endogenous variable, so it usually would not make sense for them to be collected at different times (there are exceptions, such as when you collect the exogenous variable at a different frequency and match them with the closest date for the endogenous variable (i.e. you have yearly GDP measurements you are using to predict monthly local unemployment values), or where you use lagged exogenous variables (i.e. using precipitation from last year and the year before to predict tree growth this year, with the non-independent error terms modeled by the ARIMA model).