Let us discuss the example you were given. Generally, this optimization method uses the following strategy. Let $f(x,y,z)$ be the function that we are attempting to determine the critical points for, subject to the constraint equation $$g(x,y,z)=k$$ for some $k \in \mathbb{R}$. We solve the following system:
$$\nabla f(x,y,z) = \lambda \nabla g(x,y,z) \\g(x,y,z)=k$$
of four equations and four unknowns (note that $\nabla$ is the gradient function which returns the vector composed of partial derivatives with respect to $x$, $y$, and $z$).
In this case, we have $f(x,y,z)=2x+y-2z$ and $g(x,y,z)=x^2+y^2+z^2=4$ (this is a sphere of radius $2$). Thus, we have the following system of equations:
$$\begin{cases}2 = 2\lambda x \,\,\,\,\,\,(f_x = \lambda g_x) \\ 1 = 2\lambda y \,\,\,\,\,\, (f_y=\lambda g_y)\\ -2 = 2\lambda z \,\,\,\,\,(f_z= \lambda g_z)\\ x^2+y^2+z^2=4\end{cases}$$
There are various ways that you can solve this, but we will solve in the following way. Multiplying the first equation by $yz$, the second equation by $xz$, and the third equation by $xy$ and setting each of these equal to one another, we obtain $$2\lambda xyz = \begin{cases} 2yz \\ xz \\ -2xy \end{cases}$$
So, first we have $x = 2y$ upon dividing $2yz=xz$ by $z \neq 0$. Then we also have $z=-2y$ upon dividing $xz=-2xy$ by $x \neq 0$. Finally, we have $x=-z$ upon dividing $2yz = -2xy$ by $2y \neq 0$. Applying this, we substitute for $x$ and $z$ in terms of $y$ into the fourth equation to get
$$x^2 +y^2 +z^2 =4 \implies 4y^2 + y^2 + 4y^2 = 9y^2 = 4 \implies y = \mp \frac{2}{3}$$
I will let you solve for the other $3$ unknowns (consider each case separately: assume $y = -\frac{2}{3}$ and solve for $x,z,\lambda$ and then assume $y=\frac{2}{3}$ and solve for $x,z,\lambda$). Recall from before that $z = -2y$ and $x=-z$. You will find the two solutions
$$(x,y,z,\lambda)=\left(\mp \frac{4}{3},\mp \frac{2}{3}, \pm \frac{4}{3},\mp \frac{3}{4}\right) .$$
These solutions $(x,y,z)$ are the critical points of the function $f$ under this constraint $g(x,y,z)=4$ and we can use multiple ways to classify them (as, for instance, maximums, minimums, or saddle points).
I am not a big fan of Lagrangian relaxation. I have never found it generic enough. I would rather go with a linear fusion approach.
Let's introduce more matrices to be able to express a more powerful problem.
$X_h$ approximate $X^H$
$X_i$ approximate $X^{-1}$ if exists, or some well behaved $X^{\dagger}$ pseudoinverse if it doesn't.
Now we can write $$\|X_i\Sigma - X_h\|_F$$
And regularization terms $\alpha\|AX-B\|$, $\beta\|X_i X - I\|$, $\gamma\|T X_h - X\|$
Where T does conjugate transpose on vectorisation.
Now there remains one un-linearity here : The $\beta$ term.
So a linear least squares will not be enough.
But we can do an iterated two-stage linear-least squares.
- First phase optimizes $X,X_h$,
- Second phase optimizes $X_i$
Using Kronecker products we can express "T" operation above with a matrix on the vectorization of $X_h$.
The linear fusion comes into play when we connect $X$ and $X_h$ through $\gamma$-regularization and $X$ and $X_i$ through $\beta$-regularization
Best Answer
Vectors will be typeset as rows, and matrices typeset as lists of rows.
Theorem. As we vary $X$ over all skew-symmetric matrices, the smallest possible length for $E= AX-B$ is $||E||= \frac{|A\cdot B}{||A||}$.
Proof. Use Gram-Schmidt to construct an orthonormal basis for space chosen so that your two given vectors $A$ and $B$ take the form $A=(a,0,0\ldots,0)$ and $B=(b_1,b_2,\ldots,0)$. The structure of any skew-symmetric matrix is $ M=((0,s, \ldots);(-s,0,\ldots);(\ldots);\ldots)$ for some unknown scalar $s$. (The suppressed terms in this matrix denoted by $\ldots$ have no effect in the computation that follows.) Note that the row vector $A.M=(0, as, \ldots)$ and thus the error vector $E=A.M-B= (-b_1, as-b_2,\ldots)$. The length of this error vector $E$ is minimized when the suppressed terms $\ldots$ are zero, and when $s=b_2/a$; and with this choice the minimum length is $||E||= |b_1|=\frac{|A.B|}{||A||}$.
P.S. In general you can see that since always $AX \perp A$ (proof below), there is no hope that $AX$ can cancel out the component of $B$ that is parallel to $A$.
Proof. If $X$ is skew-symmetric, then $<AX, A>= <A, AX^t> =-<A,AX>=-<AX,A>$ so $<AX,A>=0$; that is $AX\perp A$.