Solved – Standard deviation non centered (normal) data

centeringdispersionnormal distributionstandard deviation

I'd like to get a prediction interval (or a good measure of dispersion) of my data but it seems like standard deviation is not the right way to do it because my data is not centered (or not normally distributed), look at the image below:

Non-normal data

The grey lines on the background represent my data, blue dashed (upper and lower) represent mean+/-2*std (95% Prediction Interval), and the center blue line is the mean. Clearly the mean-2*std (lower dashed) is far from my data and mean+2*std (upper dashed) is lower than it should.

How could I get a good measurement of dispersion on this kind of data? And how could I get a prediction interval for the final data point?

Best Answer

When the variance of your data is not constant we call this heteroskedasticity.

In addition to the changing variance it quite clear that the data is greater than the mean more than it is smaller than the mean. For something like return on investment I'd expect this kind of trend, the more money you have the more money you can make. If it's impossible to go below -100% return I'd recommend log-transforming the data so you get $X=log(100+\text{Return})$ or if there was an incident of -100% return then use $X=log(101+\text{Return})$ to save yourself an error.

Your data is a time series but I'm going to ignore that and treat it like independant points on each line. Heteroskedasticity is confusing enough without adding the effects of time series.

When dealing with changing variance first you need to decide how the variance changes from month to month. A common assumption is that there is a linear trend in the variance so variance as a function of month is $\sigma^2= c\text{[month]}$.

1. Removing changing variance

The most simple way to account for the changing variance is to modify your data to make the variance constant, then calculate a prediction interval for normal regression and convert the results back to the case of changing variance.

For each month calculate the mean $\bar{x}_m$ and normalize the data like $x'=\frac{x-\bar{x}_m}{\sqrt{c\sigma^2}}$

Calculate the regression slope and intercept then calculate the prediction bands for the regression (formulas on page 11 of this pdf)

If the lower and upper bands for a month are $L'$ and $U'$ then convert these back to your changing variance by using $L=\sqrt{c\sigma^2}L'+ \bar{x}_m$ and $U=\sqrt{c\sigma^2}U'+ \bar{x}_m$.

2. Accounting for changing variance in the regression

A similar question from stack exchange has some answers which shows how to account for changing variance. Since there's more variation in the data for larger month values we want to weight those less and base the regression more on the months with little variation/uncertainty. A common choice of weight is $\frac{1}{month}$, if month $0$ is in your data then modify this to $\frac{1}{month+1}$.

This stack exchange answer gives the R code to do this.