I can answer this:
What is the affine transformation converting world coordinates to camera coordinates? (camera world coordinates: $c=(c_x,c_y,c_z)^\top$, visual center world coordinates, $v=(v_x,v_y,v_z)^\top$)
I'm assuming the traditional camera image coordinates (before projection) having $z$ drilling "into" the image, $x$ pointing from left to right, and $y$ pointing downward.
Now let's track how the axes must be rotated without translation:
1. the new $z$ axis ($z'$) will point along $v-c$.
1. the new $x$ axis ($x'$) is perpendicular to $z$ and $z'$
1. the new $y$ axis ($y'$) is perpendicular to $x'$ and $z'$.
You can find three vectors that point along the new axes in world coordinates, normalize them, then put them in the rows of a $3\times 3$ matrix $R$: this converts world coordinates to rotated camera orientation.
Finally, if you know the translation $t$ in world coordinates (it would be $(-10,-10,-10)^\top$ to translate to the camera's position in world coordinates) then the translation in camera coordinates is $t'=Rt$
Let's actually carry this out for your example. Let's work on a triad of orthogonal vectors:
$z'=(-1,-1,-1)$, pointing in the direction the camera must face.
$x'=z'\times z=(-1,1,0)^\top$
$y'=z'\times x'=(1,1,-2)^\top$
Normalizing these and using them as the rows of a matrix you get:
$$
R=\frac{1}{\sqrt{6}}\begin{bmatrix}
-\sqrt{3}&\sqrt{3}&0\\
1&1&-2\\
-\sqrt{2}&-\sqrt{2}&-\sqrt{2}
\end{bmatrix}
$$
Then $t'=Rt=(0,0,10\sqrt{3})$.
Notice that the angle of declination is an odd angle near $35^\circ$ rather than exactly $45^\circ$. (I had a hard time seeing this at first, but if you draw a cube and check the angle between $(1,1,0)$ and $(1,1,1)$ you'll see what I mean.)
Now you've converted world coordinates to rotated frame that is aligned with your camera's frame, but differs by a translation.
This gives you the resulting affine transformation $\begin{bmatrix}R&t'\\0_{1\times 3}&1\end{bmatrix}$
which carries world coordinates to camera coordinates.
As a sanity check, you can confirm that the world's origin maps to camera $(0,0,10\sqrt{3})^\top$ and that the world camera location $(10,10,10)$ now maps to the camera's origin. A third check of your choice should be sufficient to convince you this is the right $R$ and $t'$.
One caveat: I'm not 100% sure the step with $z\times z'$ is always in this order. I picked it this way on this occasion because it gave the right orientation for $x'$ and $y'$ in the end. Hopefully that is all consistent, but maybe there is some sign ambiguity after all.
The second question is how to construct the "UP" vector.
I don't understand what you are asking. If you mean the camera coordinates for the direction of the world $z$-axis, then that would just be $R(0,0,1)^\top +t'$.
Finally, I will have to rotate camera as well from "landscape" to "portrait" orientation .
I'm interpreting this to mean that you'd want to rotate the image plane so that the $y$-axis is horizontal, which could be done with a $\pi/4$ rotation in either way around the camera $z$-axis.
This transformation should be entirely obvious:
$$U=
\begin{bmatrix}
0&-1&0\\
1&0&0\\
0&0&1\end{bmatrix}
$$
$U$ gives the rotation in the clockwise direction around the $z$ axis (which would look to be counterclockwise if you are looking up the $z$ axis into the picture) and $U^\top$ would give the rotation in the other direction.
Working in homogeneous coordinates, the Cartesian equation $ax+by+cz+d=0$ can be expressed as $\mathbf\pi^T\mathbf X=0$, where the homogeneous vector $\mathbf\pi=[a:b:c:d]$. If $\mathtt M$ is a nonsingular transformation matrix, then $$\mathbf\pi^T\mathbf X=\mathbf\pi^T\mathtt M^{-1}\mathtt M\mathbf X=(\mathtt M^{-T}\mathbf\pi)^T(\mathtt M\mathbf X) = 0,$$ which shows that the vectors that represent planes are covariant: if points transform as $\mathbf X'=\mathtt M\mathbf X$, then planes transform as $\mathbf\pi'=\mathtt M^{-T}\mathbf\pi$.
In your case, the equation of the plane in camera coordinates is given by the point-normal form $\mathbf N\cdot(\mathbf X-\mathbf P)=0$, so we have $\mathbf\pi_C=[\mathbf N^T;-\mathbf N^T\mathbf P]^T$. We have for the world-to-camera mapping the matrix $\mathtt M = \left[\begin{array}{c|c}\mathtt R & \mathbf T\end{array}\right]$ and so camera-coordinate planes are transformed into world coordinates by $(\mathtt M^{-1})^{-T} = \mathtt M^T$, i.e., $$\mathbf\pi_W = \mathtt M^T\mathbf\pi_C = \left[\begin{array}{c|c} \mathtt R^T & \mathbf 0 \\ \hline \mathbf T^T & 1\end{array}\right]\begin{bmatrix} \mathbf N \\ -\mathbf N^T \mathbf P\end{bmatrix} = \begin{bmatrix} \mathtt R^T \mathbf N \\ \mathbf N^T\mathbf T-\mathbf N^T\mathbf P \end{bmatrix}.$$
For your example, $\mathbf\pi_C = [1,2,1,-9]^T$ and $$\mathbf\pi_W = \left[\begin{array}{r}0&0&1&0\\-1&0&0&0\\0&-1&0&0\\3&3&9&1\end{array}\right]\left[\begin{array}{r}1\\2\\1\\-9\end{array}\right] = \left[\begin{array}{r}1\\-1\\-2\\9\end{array}\right]$$ and so the equation of the plane in world coordinates is $x-y-2z+9=0$.
Using your approach, transform $\mathbf P_C$ to world coordinates: $$\mathbf P_W = \left[\begin{array}{c|c} \mathtt R^T & -\mathtt R^T\mathbf T \end{array}\right] \begin{bmatrix}\mathbf P_C\\1\end{bmatrix} = \left[\begin{array}{r}0&0&1&-9\\-1&0&0&3\\0&-1&0&3\end{array}\right]\begin{bmatrix}1\\4\\0\\1\end{bmatrix} = \left[\begin{array}{r}-9\\2\\-1\end{array}\right].$$ Compared to what you described in your question, it looks like you only translated $\mathbf P_C$, but to convert to world coordinates it must be both translated and rotated. Normal vectors are covariant, so $$\mathbf N_W = (\mathtt R^{-1})^{-T}\mathbf N_C = \mathtt R^T\mathbf N_C = \left[\begin{array}{r}0&0&1\\-1&0&0\\0&-1&0\end{array}\right]\begin{bmatrix}1\\2\\1\end{bmatrix} = \left[\begin{array}{r}1\\-1\\-2\end{array}\right],$$ giving for the world-coordinate equation of the plane $$(1,-1,2)\cdot(x+9,y-2,z+1)=x-y-2z+9=0$$ as above. Comparing this to your calculation, you transformed the normal vector incorrectly as well.
We can check distances, as you suggest: The distance of the plane from the camera (camera-coordinate origin) is $${[1,2,1]\cdot[1,4,0]\over\|[1,2,1]\|} = {9\over\sqrt6}.$$ The world coordinates of the camera are the last column of the camera-to-world matrix, and the world-coordinate distance of this point from the plane is $${|[1,-1,-2]\cdot[-9,3,3]+9|\over\|[1,-1,-2]\|} = {9\over\sqrt6}.$$ You can also check that this is indeed the correct plane by transforming a few points on it to world coordinates and then plugging those coordinates into its world-coordinate equation.
Best Answer
I'm on the right track now, I did as I've suggest earlier :
but I've also applied the transform to the up-vector of the camera to get it right. Result: for any kind of translation/rotation applied to the object the camera move properly (Yeah!). The only problem that remains is when I scale the object. If I shrink it, it gets smaller in the camera view after the transform is applied (and vice-versa).