[Math] Parameter scaling in optimization

change-of-variablelinear-transformationsnumerical methodsnumerical optimizationoptimization

I'm currently working on an iterative approach to solving an optimization problem. The implementation seems to be calculating biased directions so a colleague suggested I look into parameter scaling. I was able to find some basic material on the matter in a textbook (Gill, Murray, Wright. 1982. Practical Optimization. 7.5: 273-275) as well as this website: http://www.esrf.eu/computing/scientific/FIT2D/MF/node2.html which seem to be talking about the same thing.

Background:

For $f: \mathbb{R}^n \mapsto \mathbb{R}$, we have the problem: \begin{equation} \underset{\vec{x}^* \in \mathbb{R}^n}{\text{min}} f(\vec{x}) \tag{1} \end{equation}

we have scaled parameter $\vec{y} \in \mathbb{R}^n$ for scaling matrix $D \in \mathbb{R}^{n \times n}$ and translating vector $\vec{c} \in \mathbb{R}^n$ \begin{equation} \vec{y} = D^{-1}(\vec{x} – \vec{c}) \tag{2} \end{equation}

from which we can obtain our 'original' parameter $\vec{x}$ using: \begin{equation} \vec{x} = D \vec{y} + \vec{c} \tag{3} \end{equation}

So far, so good. By smartly choosing $D$ and $\vec{c}$, I can have $y_1, y_2 … \in [-1, 1]$ which is exactly what I'd like to have.

Missing information/lack of understanding:

Now as I seem to have understood it, the idea is to scale $\vec{x}$ to $\vec{y}$ and to solve the problem: \begin{equation} \underset{\vec{y}^* \in \mathbb{R}^n}{\text{min}} f(\vec{y}) \tag{4} \end{equation} and then to scale back to obtain our desired solution $\vec{x}^*$.

However, this doesn't make sense to me: by scaling $\vec{x}$ to $\vec{y}$, have we not simply changed the starting/initial point with which we run the minimization algorithm? The way I see it, we have not changed $f(\vec{x})$ or its gradient at all, which means it is the same problem but with a different starting point.

If I were not looking for a global minimum, how does this idea make sense? What am I overlooking about this concept?

Edit

I followed the suggestion on this site: http://www.fitzgibbon.ie/optimization-parameter-scaling and my algorithm converged correctly. Now I only wish to understand the theory behind how this works. Any help would be much appreciated!

End of edit

Thanks in advance,
MotiveHunter

Best Answer

After reading more about it here and there, and trying out the idea with an example problem, I will attempt to answer this question.

For purposes of study, I selected a bivariate objective function composed of sinus functions with a high frequency X-component. This means the gradient of the functon should have a larger X-component than Y-component i.e., the gradient-based algorithm should take larger steps in the X-direction.

In the method itself:

the point is first scaled using Eqn. (2) to obtain $y$. This is, mathematically speaking, simply a translation of the initial vector. No "magic" has occurred yet.
In every iteration of the algorithm, the inverse scaling operation in Eqn. (3) is applied to the point before the evaluation of the objective function and/or its gradient. However, once the step-size has been calculated, the algorithm always operates on the scaled parameters.
- the main intended effect is that, from the perspective of the algorithm, in the ideal case, all parameters exist in the same, small interval.
- the inverse scaling as described by Eqn. (3) has an effect on the gradient $g(\vec{x})$ of the objective function (can be derived using the chain rule): \begin{equation} g_{\text{scaled}} = D \ g(\vec{x}) \end{equation} Assuming the scaling matrix has been properly calculated, this would cause the algorithm to take steps which are e.g. half as long in the X-direction as those which would be taken without the scaling. This is a bonus when using gradient-based methods.
The final solution is scaled back to obtain $x^*$.

To a gradient-based algorithm, an arbitrary domain with local maxima and minima which has been discretized with a certain step-size (decided by the gradient) looks like this:

The same domain, when discretized with half the above step-size, looks so:

With the scaling, one notices that "hidden" optimum points have suddenly appeared in the interval [0.7, 1.0]. It is as if a lens has been applied over the domain.

I hope that everything I've written in this answer is correct. Suggestions/modifications are very welcome and will be incorporated as edits.

Best Answer

Related Solutions

Minimization of matrix $2$-norm

Related Question