I found the following paper to be useful:
Edelman, A., Arias, T. A., & Smith, S. T. (1998).
The geometry of algorithms with orthogonality constraints.
SIAM journal on Matrix Analysis and Applications, 20(2), 303-353.
This paper has far more information than you may need, in terms of differential geometry and higher-order optimization methods.
However the information to answer your question is actually quite straightforward, and is in fact contained mostly in equations 2.53 and 2.70, which are both of the form
$$\nabla{f}=G-X\Phi$$
where the nominal gradient
$$G=\frac{\partial{f}}{\partial{X}}$$
is corrected to constrained gradient $\nabla{f}$ by subtracting off its projection $\Phi$ onto the current solution $X$. This is the normal to the manifold, similar to circular motion, and ensures the corrected gradient is tangent to the manifold.
Note: These formulas assume you are already on the manifold, i.e. $X^TX=I$. So in practice you will need to make sure your initial condition is appropriate (e.g. $X_0=I$, possible rectangular). You may also need to occasionally correct accumulated roundoff and/or truncation error (e.g. via SVD, see "ZCA" example below).
In the unconstrained case, $\Phi=0$, while in the constrained case $\Phi$ takes two forms:
$$\Phi_{\mathrm{\small{G}}}=X^TG\implies\nabla_{\mathrm{\small{G}}}{f}=\left(I-XX^T\right)G$$
which corresponds to the "Grassmann manifold". The distinction here is that $\nabla_{\mathrm{\small{G}}}$ is insensitive to rotations, since for a rotation $Q^TQ=I$ and $X=X_0Q$, we have $XX^T=X_0X_0^T$.
The second form is
$$\Phi_{\mathrm{\small{S}}}=G^TX\implies\nabla_{\mathrm{\small{S}}}{f}=G-XG^TX$$
which corresponds to the "Stiefel manifold", and is sensitive to rotations.
A simple example is approximation of a given matrix $A\in\mathbb{R}^{n\times{p}}$ with $p\leq{n}$ by an orthogonal matrix $X$, minimizing the least-squares error. In this case we have
$$f[X]=\tfrac{1}{2}\|X-A\|_F^2=\tfrac{1}{2}\sum_{ij}\left(X_{ij}-A_{ij}\right)^2\implies{G}=X-A$$
The unconstrained case $\nabla{f}=G$ has solution $X=A$, because we are not concerned with ensuring $X$ is orthogonal.
For the Grassmann case we have
$$\nabla_{\mathrm{\small{G}}}{f}=\left(XX^T-I\right)A=0$$
This can only have a solution is $A$ is square rather than "skinny", because if $p<n$ then $X$ will have a null space.
For the Stiefel case, we have
$$\nabla_{\mathrm{\small{S}}}{f}=XA^TX-A=0$$
which can be solved even when $p<n$.
These two cases, Grassmann vs. Stiefel, essentially correspond to the difference between "PCA vs. ZCA whitening". In terms of the SVD, if the input matrix is $A=USV^T$, then the solutions are $X_{\mathrm{\small{G}}}=U$ and $X_{\mathrm{\small{S}}}=UV^T$. The PCA solution $X_{\mathrm{\small{G}}}$ only applies to a square input, i.e. $A$ must be the "covariance matrix". However the ZCA solution $X_{\mathrm{\small{S}}}$ can be used when $A$ is a "data matrix". (This is more properly known as the Orthogonal Procrustes problem.)
Best Answer
If your problem is a multiobjective optimization problem with constraints, and both the objectives and/or constraints are nonlinear/ non convex in nature then an appropriate method of choice is evolutionary multiobjective optimization method. Click here for the list of reference and methods that can be used for your problem.
In terms of software,
Both the aforementioned software implements Deb's a very popular NSGAII algorithm.
Please tell us if you succeed in using these for your problem and if you have any questions.