Minimize the standard deviation in list of numbers by subtracting from each value

optimizationprobabilitystandard deviationstatistics

Suppose we have a list of values, $L = (x_1, x_2, …, x_n)$, and another list of numbers, $Q = (q_1, q_2, …, q_n), q_i$ being the maximum value we can subtract from $x_i$ (so from $x_i$ we can subtract any number in $[0; q_i]$. I want to find a combination of subtractions such that the standard deviation of the resulting list will be minimal. (In other words, I want to find a set of values, $R = (r_1, r_2, …, r_n)$, $r_i \in [0; q_i]$ such that the standard deviation of $L' = (x_1-r_1, x_2-r_2, …, x_n-r_n)$ is minimal.)

A graphical example (the blue points are the plotted $x_i$):

enter image description here

Any help would be appreciated. Thanks.

Best Answer

Here's an algorithmic way to do it.

Set up $3n$ registers labelled $ij\;$ ($i=1,...,n;\;j=1,2,3$). Initially $x_i$ is placed in registers $i1$ and $i2$ while $q_i$ is placed in register $i3\,$ ($i=1,...,n$); the content of register $i1$ will be updated at each stage, and called $x_i'$, while the respective contents $x_i$ and $q_i$ of registers $i2$ and $i3$ will be kept as a constant reference. A further register holds the mean $m'=(x_1'+\cdots+x_n')/n$ of the current numbers $x_i'$.

Reduce each (original) $x_i$ by whatever allowed amount will minimize the distance of the reduced number from $m'$, and update the content of register $i1$ with this reduced number. Thus, numbers $x_i'$ less than $m'$ will remain unchanged, while a number $x_i'$ that exceeds $m'$ will be reduced either to $m'$ (if $x_i-q_i\leqslant m'$) or to $x_i-q_i$ (if $x_i-q_i>m'$). The registers $i1\;$($i=1,...,n$) are then refreshed with the corresponding newly reduced numbers $x_i'$, and the new mean of these numbers replaces the previous mean in the allocated register.

Repeat the above operation until the process converges.

To show the convergence and (at least local) optimality of the above algorithm, define $x_{ij}$ and $m_j$ recursively by $$x_{i1}:=x_i,\qquad m_j:=\frac1n\sum_{i=1}^nx_{ij},$$ $$x_{i,\,j+1}:=\begin{cases}\quad x_{ij}&\text{if}\quad x_{ij}\leqslant m_j,\\ \quad m_j&\text{if}\quad x_i-q_i<m_j<x_i,\\ x_i-q_i&\text{if}\quad m_j\leqslant x_i-q_i,\end{cases}$$ for $i=1,...,n$ and $j=1,2,...$. Since the sequence $(x_{i1},x_{i2},...)$ is decreasing and bounded below by $x_i-q_i$, it has a limit, which we denote by $l_i$. We denote the mean $(l_1+\cdots +l_n)/n$ by $m$.

We now show that the variance of the data $l_1,...,l_n$ is minimal, in the sense that a small perturbation of any one of them increases the variance. For simplicity, let us shift the original data so that $m=0$. Then it follows from the definition of $x_{ij}$ that

$$l_i=\begin{cases}\quad x_i&\text{if}\quad x_i\leqslant 0,\\ \quad 0&\text{if}\quad x_i-q_i<0<x_i,\\ x_i-q_i&\text{if}\quad 0\leqslant x_i-q_i.\end{cases}$$ Suppose, for convenience, that the perturbed datum is the first one ($i=1$). The unperturbed variance may be written $$\frac1nl_1^2+\frac1n\sum_{i=2}^nl_i^2.$$ Suppose that $l_1$ becomes $l_1+n\varepsilon$, with $|\varepsilon|$ suitably small, so that the corresponding mean is perturbed to $-\varepsilon$, and the variance becomes $$\frac1n(l_1+n\varepsilon+\varepsilon)^2+\frac1n\sum_{i=2}^n(l_i+\varepsilon)^2.$$ The increase in variance is therefore $$\frac2n(n+1)\varepsilon l_1+\frac2n\varepsilon\sum_{i=2}^nl_1+\frac{n^2+2n+2}n\varepsilon^2$$ $$=2\varepsilon l_1+\frac2n\varepsilon\sum_{i=1}^nl_i+|O(\varepsilon^2)|$$ $$=2\varepsilon l_1+|O(\varepsilon^2)|.$$ Consider the three cases for $i=1$ in the formula for $l_i$ above. In the first case, the only allowed perturbation will be negative, namely with $\varepsilon$ negative, and $l_1$ cannot be positive either. So the first-order increase $2\varepsilon l_1$ in the variance cannot be negative. Similar reasoning shows that the variance increase is non-negative in the third case too. In the middle case, while there is a zero first-order increase in the variance, there is still a positive increase, from the second-order term.

Added note: Although the above method converges, there are extreme types of data for which the convergence is very slow. In particular, consider the case when $n=100,\,$ $q_1=\cdots=q_n=1$, $x_1=0,$ and $x_2=\cdots=x_n=1$. Then all the $x_{ij}$ converge to $0$, but after $100$ steps all but one are still about $1/\mathrm e$ from the limit. There is thus plenty of room for tweaking this algorithm to speed it up when the data are like this.