This isn't really my field, so some musings:
I will start with the concept of surprise. What does it mean to be surprised?
Usually, it means that something happened that was not expected to happen. So, surprise it a probabilistic concept and can be explicated as such (I J Good has written about that). See also Wikipedia and Bayesian Surprise.
Take the particular case of a yes/no situation, something can happen or not. It happens with probability $p$. Say, if p=0.9 and it happens, you are not really surprised.
If $p=0.05$ and it happens, you are somewhat surprised. And if $p=0.0000001$ and it happens, you are really surprised. So, a natural measure of "surprise value in observed outcome" is some (anti)monotone function of the probability of what happened. It seems natural (and works well ...) to take the logarithm of probability of what happened, and then we throw in a minus sign to get a positive number. Also, by taking the logarithm we concentrate on the order of the surprise, and, in practice, probabilities are often only known up to order, more or less.
So, we define
$$
\text{Surprise}(A) = -\log p(A)
$$
where $A$ is the observed outcome, and $p(A)$ is its probability.
Now we can ask what is the expected surprise. Let $X$ be a Bernoulli random variable with probability $p$. It has two possibly outcomes, 0 and 1. The respective surprise values is
$$\begin{align}
\text{Surprise}(0) &= -\log(1-p) \\
\text{Surprise}(1) &= -\log p \end{align}
$$
so the surprise when observing $X$ is itself a random variable with expectation
$$
p \cdot -\log p + (1-p) \cdot -\log(1-p)
$$
and that is --- surprise! --- the entropy of $X$! So entropy is expected surprise!
Now, this question is about maximum entropy. Why would anybody want to use a maximum entropy distribution? Well, it must be because they want to be maximally surprised! Why would anybody want that?
A way to look at it is the following: You want to learn about something, and to that goal you set up some learning experiences (or experiments ...). If you already knew everything about this topic, you are able to always predict perfectly, so are never surprised. Then you never get new experience, so do not learn anything new (but you know everything already---there is nothing to learn, so that is OK). In the more typical situation that you are confused, not able to predict perfectly, there is a learning opportunity! This leads to the idea that we can measure the "amount of possible learning" by the expected surprise, that is, entropy. So, maximizing entropy is nothing other than maximizing opportunity for learning. That sounds like a useful concept, which could be useful in designing experiments and such things.
A poetic example is the well known
Wenn einer eine reise macht, dann kann er was erzählen ...
One practical example: You want to design a system for online tests (online meaning that not everybody gets the same questions, the questions are chosen dynamically depending on previous answers, so optimized, in some way, for each person).
If you make too difficult questions, so they are never mastered, you learn nothing. That indicates you must lower the difficulty level. What is the optimal difficulty level, that is, the difficulty level which maximizes the rate of learning? Let the probability of correct answer be $p$. We want the value of $p$ that maximizes the Bernoulli entropy. But that is $p=0.5$. So you aim to state questions where the probability of obtaining a correct answer (from that person) is 0.5.
Then the case of a continuous random variable $X$. How can we be surprised by observing $X$? The probability of any particular outcome $\{X=x\}$ is zero, the $-\log p$ definition is useless. But we will be surprised if the probability of observing something like $x$ is small, that is, if the density function value $f(x)$ is small (assuming $f$ is continuous). That leads to the definition
$$ \DeclareMathOperator{\E}{\mathbb{E}}
\text{Surprise}(x) = -\log f(x)
$$
With that definition, the expected surprise from observing $X$ is
$$
\E \{-\log f(X)\} = -\int f(x) \log f(x) \; dx
$$
that is, the expected surprise from observing $X$ is the differential entropy of $X$. It can also be seen as the expected negative loglikelihood.
But this isn't really the same as the first, event, case. Too see that, an example. Let the random variable $X$ represent the length of a throw of a stone (say in a sports competition). To measure that length we need to choose a length unit, since there is no intrinsic scale to length, as there is to probability. We could measure in mm or in km, or more usually, in meters. But our definition of surprise, hence expected surprise, depends on the unit chosen, so there is no invariance. For that reason, the values of differential entropy are not directly comparable the way that Shannon entropy is. It might still be useful, if one remembers this problem.
Parameters m
and r
, involved in calculation of approximate entropy (ApEn) of time series, are window (sequence) length and tolerance (filter value), correspondingly. In fact, in terms of m
, r
as well as N
(number of data points), ApEn is defined as "natural logarithm of the relative prevalence of repetitive patterns of length m
as compared with those of length m + 1
" (Balasis, Daglis, Anastasiadis & Eftaxias, 2011, p. 215):
$$ ApEn(m, r, N) = \Phi^m(r) - \Phi^{m+1}(r), $$
$\text{where }$
$$ \Phi^m(r) = {\LARGE{\Sigma}_i} lnC^m_i(r)/(N - m + 1) $$
Therefore, it appears that changing the tolerance r
allows to control the (temporal) granularity of determining time series' entropy. Nevertheless, using the default values for both m
and r
parameters in pracma
package's entropy function calls works fine. The only fix that needs to be done to see the correct entropy values relation for all three time series (lower entropy for more well-defined series, higher entropy for more random data) is to increase the length of random data vector:
library(pracma)
set.seed(10)
all.series <- list(series1 = AirPassengers,
series2 = sunspot.year,
series3 = rnorm(500)) # <== size increased
sapply(all.series, approx_entropy)
series1 series2 series3
0.5157758 0.7622430 1.4741971
The results are as expected - as the predictability of fluctuations decreases from most determined series1
to most random series 3
, their entropy consequently increases: ApEn(series1) < ApEn(series2) < ApEn(series3)
.
In regard to other measures of forecastability, you may want to check mean absolute scaled errors (MASE) - see this discussion for more details. Forecastable component analysis also seems to be an interesting and new approach to determining forecastability of time series. And, expectedly, there is an R
package for that, as well - ForeCA.
library(ForeCA)
sapply(all.series,
Omega, spectrum.control = list(method = "wosa"))
series1 series2 series3
41.239218 25.333105 1.171738
Here $\Omega \in [0, 1]$ is a measure of forecastability where $\Omega(white noise) = 0\%$ and $\Omega(sinusoid) = 100 \%$.
References
Balasis, G., Daglis, I. A., Anastasiadis, A., & Eftaxias, K. (2011). Detection of dynamical complexity changes in Dst time sSeries using entropy concepts and rescaled range analysis. In W. Liu and M. Fujimoto (Eds.), The Dynamic Magnetosphere, IAGA Special Sopron Book, Series 3, 211. doi:10.1007/978-94-007-0501-2_12. Springer. Retrieved from http://members.noa.gr/anastasi/papers/B29.pdf
Georg M. Goerg (2013): Forecastable Component Analysis. JMLR, W&CP (2) 2013: 64-72. http://machinelearning.wustl.edu/mlpapers/papers/goerg13
Best Answer
Assuming you're limiting yourself to stationary processes:
Usually that's how it's done, yes.
No, it's not whether. It's really more like how few frequencies. The fewer the better. Any stationary process can be written in terms of a sum of fourier frequencies with random weights. That's the spectral representation theorem. I would say this entropy is like any other entropy. The higher the value, the larger the average “surprise” (log density), and in this case, the flatter the density. A completely flat periodogram is the periodogram of white noise, which is completely unpredictable.
Regarding other assumptions, I'm not sure. I can't rule out things with any certainty, but I would venture to guess they're assuming stationarity.