You're on the right track by acknowledging that ARIMA modeling is what you should be looking into.
I've seen ARIMA modeling applied to cases involving: inventory stock, business sales, levels of production of particular goods, and various other business related time-series. Without access to the data, I can only speculate that the data you're working with falls into this same sort of category.
Of course, ARIMA modeling is univariate, so any forecasts that you produce will be forecasts for the time-series under investigation. For example, if you model prices then you will derive forecasts for prices - not gross profits. Indeed, you can use price forecasts to build forecasts for profits, so choose carefully the data you want to work with if you have choice among many time-series.
It is common to see ARIMA models used as a benchmark, so even if you believe that more complex models (multiple-series & econometric models) may give you superior forecasts, ARIMA modeling is, nevertheless, still a worthwhile pursuit; if you build a number of models you have something to compare them against and this also helps decide whether or not the extra complexity is necessary.
The reason why ARIMA models are good for benchmarking is because: ARIMA forecasts are optimal (smaller mean-squared forecast error) univariate forecasts (if correctly built). The forecasts are optimal among forecasts from univariate, linear, fixed-coefficient models.
Analysis of your data may lead you to develop other models such as multivariate models, non-linear models or even time-varying parameter models, but starting with the simpler class of ARIMA models is a wise choice in itself because ARIMA analysis can later on complement econometric analysis. For a short discussion on this see Zellner (1978).
Obviously, the classic text to consult for ARIMA modeling (and the closely related Transfer Function models) is Box & Jenkins (1970). A good alternative is Pankratz (1983) which is basically a shorter and simpler version of Box & Jenkins' work - all of the main points are retained in Pankratz's book too.
As already mentioned, ARIMA analysis involves looking at a single time-series of past observations. At some stage, you may want to introduce other independent variables in addition to past observations of the dependent variable. This brings you into the territory of distributed lag models which may or may not be autoregressive. Extending the framework once more and these models can be single-equation or multi-equation (vector equation) models.
One of the factors to be considered when deciding to use single or vector equations will be whether or not there is possible lagged feedback effects among the various variables. These issues are further addressed in Pankratz (1991) which focuses on dynamic regression models.
Lastly, an excellent online time-series forecasting textbook is Rob Hyndman's Forecasting: principles and practice. Furthermore, if you are an R user (or would consider becoming one) then it would be worth your time to familiarize yourself with the R forecast package (again, thanks to Rob Hyndman).
References:
Box, George and Jenkins, Gwilym (1970) Time series analysis: Forecasting and control, San Francisco: Holden-Day.
Hyndman, R.J. and Athanasopoulos, G. (2013) Forecasting: principles and practice. http://otexts.com/fpp/. Accessed on 17 June 2013.
Pankratz, Alan (1983) Forecasting with univariate Box–Jenkins models: concepts and cases, New York: John Wiley & Sons.
Pankratz, Alan (1991) Forecasting with Dynamic Regression Models, New York: John Wiley & Sons.
Zellner, Arnold, 1978. "Folklore versus Fact in Forecasting with Econometric Methods," The Journal of Business, University of Chicago Press, vol. 51(4), pages 587-93, October.
There is an implied statement in your question that I want to clarify before giving an answer. You stated :
I have also tested with freq=1 (but it is obviously a mistake because they are not yearly values)
This implies that either data frequency is only governed by frequency per year or that all and or only yearly data have a frequency of 1. Both of these statements are false. Solar cycles have a frequency of about 11 years, so yearly solar cycle data would have a frequency of about 11. On the other hand, something like server copy errors per copy attempt may be sampled more than once per second, but still have a frequency of 1 since copy errors are a mostly random process. When determining the frequency of the series, you need to think about what is causing the seasonality and not just rely on the sample rate (though that can be a good starting point). The answer below assumes you have correctly identifies the seasonal period as 1 year.
As to your original question, there are a few ways you can deal with your data. The simplest would be to set your seasonality to the mean value of your number of samples per year. Time series do not need to have an integer seasonality, so you can make a time series in R
as below:
ts(rnorm(100), frequency = 14.73)
Another option would be to add back in the missing days to the time series and use a frequency of 365 (or even better 365.24). If you have the time points for each observation, you can use the zoo
package in R to make an irregular time series and then fill in the missing values.
You can make the series using:
x.Date <- as.Date("2003-02-01") + sample(1000,900) - 1
x <- zoo(rnorm(900), x.Date)
y <- ts(as.ts(x), frequency = 365.24)
The missing values can be filled using many methods, but one to consider is zoo::na.approx
. From there the series can be decomposed as normal.
decompose(na.approx(y))
A few final notes and options: 1) decompose
is a useful method, but you may also want to consider the stl
decomposition for your data as even the decompose
docs say that "stl
provides a much more sophisticated decomposition." 2) You will always get an error if you try to use decompose
or stl
on a series with a frequency of 1. Both functions seek to separate the seasonal and trend components of the data, so if there is no seasonal component (ie frequency = 1), there is a problem. If instead you just want to separate trend from noise, you might consider using a moving average.
Best Answer
To answer your main question:
There is nothing equivalent to the HTS package in Python. The two things that I know of that are the closest are PyAF and htsprophet. However they use different forecasting models than those used in HTS.
PyAF uses models from scikit-learn to do forecasting, which is unusual since the sklearn models aren't usually amenable to time series problems. htsprophet uses only the FB Prophet model.
By contrast HTS uses ARIMA and ETS, which are more standard forecasting methods (although FB Prophet is increasing in popularity).
From what you described in your post though, I'm not entirely sure whether your problem is indeed a time series problem, or if it is indeed hierarchical in nature the way it hierarchy is understood in HTS.
Can you please clarify the details of what you are trying to do?