Solved – Calculating mean of continuous time series

meantime seriesweighted mean

I'm calculating the arithmetic mean of values in a continuous time series, weighted by duration. Points in the time series are not guaranteed to be evenly spaced.

Example data:

Time    Value
0       1
1000    2
2000    3
3000    4
5000    5

Where Time is the duration since the start of the time series.

My original approach to calculate the mean was to calculate the mean between adjacent points, assign a weight to that value based on the duration between the adjacent points and total duration, and sum the results:

Mean    Weight    Normalized weight   Result
1.5     1000      0.2                 0.3
2.5     1000      0.2                 0.5
3.5     1000      0.2                 0.7
4.5     2000      0.4                 1.8
                                      ---
                                      3.3

Resulting in a mean of 3.3

Thankfully I stumbled across the MATLAB documentation for calculating the mean of timeseries data, which suggests my method is probably incorrect.

Their method assigns weights as such:

Assign a weight to each point's value; first point is the duration of the first time interval t(2) - t(1), last point is the duration of the last time interval t(end) - t(end-1), and points that are neither the first or last points are the duration between the midpoint of the previous time interval and the midpoint of the next time interval.
Normalize the weighting for each time by dividing each weighting by the mean of all weightings.
Multiply the values for each point by its normalized weighting.

This results in the following:

Value   Weight   Normalized weight   Result
1       1000     0.153846            0.153846
2       1000     0.153846            0.307692
3       1000     0.153846            0.461538
4       1500     0.230769            0.923076
5       2000     0.307692            1.53846
        ----     --------            --------
        6500     1                   3.384612

Resulting in a mean of 3.384612

My question is: Why are the weights in the MATLAB documentation calculated in the way they are? Specifically, why are the first and last points assigned the full time interval between the next and previous points, respectively?

If the first and last points were only assigned half (i.e. to the midpoint), the resulting calculated mean is the same as my original approach. Is my original approach wrong?

Best Answer

Rather than being right or wrong, here Matlab assumes that your start and end samples also extend towards to their left and right by equal amounts. So, according to Matlab, your signal starts from t=-0.5 and ends at t=6. This is 0-th order interpolation technique, in which every sample extends to its left or right by equal amounts.

Related Solutions

Solved – Using Rolling Forecast Origin Resampling in R for Neural Network Time Series

I'll try to answer to your questions.

What does the initialWindow mean in layman's terms?

To better understand this I suggest to have a look at https://rpubs.com/crossxwill/time-series-cv. Put in simple terms, the initialWindow is the size (number of samples or row's dataset) of the training dataset used by the MLP (Multi-Layer Perceptron neural network) for its training stage used for each resample.

What does the fixedWindow mean in layman's terms?

This means that for each resample, the size of the training dataset does not change but remain the same; it's simple shifted onwards by an amount specified by horizon (in your case set to 1, i.e. a one step-ahead forecast)

How is the output of the model interpreted? More specifically, what does the "size" mean?

Size is the number of neurons in the hidden layer of the MLP. You specified size but it has been neglected in the way you set it in your code.

Why could be causing NaNs in the Rsquared column?

Try to use preProc = c("scale","center") in train.

How do I obtain the outputs/predictions so I can create unscaled forecasts?

Use:

mynn$Prediction <- predict(mynn, newdata = YourNewTestSet)

I also suggest to consider the following topic (Whether preprocessing is needed before prediction using FinalModel of RandomForest with caret package?)

I hope it was useful.

Solved – How many data points to acurately approximate the average of recent values in a time series

I agree with Carl that fitting a model to the data would be an ideal solution, if it makes sense in your context. But I also want to suggest that the following way of thinking about your problem might be helpful.

Suppose that I have a time-varying process that looks like a random walk (so that the location is autocorrelated through time). Suppose I measure this position with some measurement error (which is an IID random variable) at equally spaced time points. I understand your question to mean:

How many time points should I average in order to get the best possible information about the process' current location?

If the random walk careens around wildly and the measurement error is small, then the answer might well be use only the most recent point. Using previous points in the average would introduce lots of extra variation due to the random walk, and a single point is already a reasonably good approximation of the process' true location. If the measurement error is large and the random walk moves slowly, then the answer will be use a lot of points. The random walk is relatively stationary, and you need to average over the noise in the measurement. Interestingly, if your measurement noise is extremely fat-tailed (i.e. Cauchy distributed), then the answer will always be use just the most recent point (because the average of multiple points does not provide a better approximation to the central tendency of that distribution than any single point does!).

It should be possible to work out the ideal number of points to use in special cases where the distribution followed by the random walk and the distribution of the measurement error are both known. However, this is precisely the case where a model, as suggested by Carl, would be useful.

Edit Carl's comment also made me realize that it's very likely that a weighted average (that weights more recent points more heavily) could outperform an average that introduces some hard-threshold cutoff for inclusion.

Best Answer

Related Solutions

Solved – Using Rolling Forecast Origin Resampling in R for Neural Network Time Series

Solved – How many data points to acurately approximate the average of recent values in a time series

Related Question