Solved – A simpler way to calculate Exponentially Weighted Moving Average

algorithmsforecastingtime seriesweighted mean

Proposed Method:

Given a time series $x_i$, I want to compute a weighted moving average with an averaging window of $N$ points, where the weightings favour more recent values over older values.

In choosing the weights, I am using the familiar fact that a geometric series converges to 1, i.e. $\sum (\frac{1}{2})^k$, provided infinitely many terms are taken.

To get a discrete number of weights that sum to unity, I am simply taking the first $N$ terms of the geometric series $(\frac{1}{2})^k$, and then normalising by their sum.

When $N=4$, for example, this gives the non-normalised weights

0.0625  0.1250  0.2500  0.5000

which, after normalising by their sum, gives

0.0667  0.1333  0.2667  0.5333

The moving average is then simply the sum of the product of the most recent 4 values against these normalised weights.

This method generalises in the obvious way to moving windows of length $N$, and seems computationally easy as well.

Question:

Is there any reason not to use this simple way to calculate a weighted moving average using 'exponential weights'?

I ask because the Wikipedia entry for EWMA seems more complicated. Which makes me wonder whether the textbook definition of EWMA perhaps has some statistical properties that the above simple definition does not? Or are they in fact equivalent?

Best Answer

I've found that computing exponetially weighted running averages using $\overline{x} \leftarrow \overline{x} + \alpha (x - \overline{x})$, $\alpha<1$ is

  • a simple one-line method,
  • that is easily, if only approximately, interpretable in terms of an "effective number of samples" $N=\alpha^{-1}$ (compare this form to the form for computing the running average),
  • only requires the current datum (and the current mean value), and
  • is numerically stable.

Technically, this approach does incorporate all history into the average. The two main advantages to using the full window (as opposed to the truncated one discussed in the question) are that in some cases it can ease analytic characterization of the filtering, and it reduces the fluctuations induced if a very large (or small) data value is part of the data set. For example consider the filter result if the data are all zero except for one datum whose value is $10^6$.

Related Question