I can answer this:
What is the affine transformation converting world coordinates to camera coordinates? (camera world coordinates: $c=(c_x,c_y,c_z)^\top$, visual center world coordinates, $v=(v_x,v_y,v_z)^\top$)
I'm assuming the traditional camera image coordinates (before projection) having $z$ drilling "into" the image, $x$ pointing from left to right, and $y$ pointing downward.
Now let's track how the axes must be rotated without translation:
1. the new $z$ axis ($z'$) will point along $v-c$.
1. the new $x$ axis ($x'$) is perpendicular to $z$ and $z'$
1. the new $y$ axis ($y'$) is perpendicular to $x'$ and $z'$.
You can find three vectors that point along the new axes in world coordinates, normalize them, then put them in the rows of a $3\times 3$ matrix $R$: this converts world coordinates to rotated camera orientation.
Finally, if you know the translation $t$ in world coordinates (it would be $(-10,-10,-10)^\top$ to translate to the camera's position in world coordinates) then the translation in camera coordinates is $t'=Rt$
Let's actually carry this out for your example. Let's work on a triad of orthogonal vectors:
$z'=(-1,-1,-1)$, pointing in the direction the camera must face.
$x'=z'\times z=(-1,1,0)^\top$
$y'=z'\times x'=(1,1,-2)^\top$
Normalizing these and using them as the rows of a matrix you get:
$$
R=\frac{1}{\sqrt{6}}\begin{bmatrix}
-\sqrt{3}&\sqrt{3}&0\\
1&1&-2\\
-\sqrt{2}&-\sqrt{2}&-\sqrt{2}
\end{bmatrix}
$$
Then $t'=Rt=(0,0,10\sqrt{3})$.
Notice that the angle of declination is an odd angle near $35^\circ$ rather than exactly $45^\circ$. (I had a hard time seeing this at first, but if you draw a cube and check the angle between $(1,1,0)$ and $(1,1,1)$ you'll see what I mean.)
Now you've converted world coordinates to rotated frame that is aligned with your camera's frame, but differs by a translation.
This gives you the resulting affine transformation $\begin{bmatrix}R&t'\\0_{1\times 3}&1\end{bmatrix}$
which carries world coordinates to camera coordinates.
As a sanity check, you can confirm that the world's origin maps to camera $(0,0,10\sqrt{3})^\top$ and that the world camera location $(10,10,10)$ now maps to the camera's origin. A third check of your choice should be sufficient to convince you this is the right $R$ and $t'$.
One caveat: I'm not 100% sure the step with $z\times z'$ is always in this order. I picked it this way on this occasion because it gave the right orientation for $x'$ and $y'$ in the end. Hopefully that is all consistent, but maybe there is some sign ambiguity after all.
The second question is how to construct the "UP" vector.
I don't understand what you are asking. If you mean the camera coordinates for the direction of the world $z$-axis, then that would just be $R(0,0,1)^\top +t'$.
Finally, I will have to rotate camera as well from "landscape" to "portrait" orientation .
I'm interpreting this to mean that you'd want to rotate the image plane so that the $y$-axis is horizontal, which could be done with a $\pi/4$ rotation in either way around the camera $z$-axis.
This transformation should be entirely obvious:
$$U=
\begin{bmatrix}
0&-1&0\\
1&0&0\\
0&0&1\end{bmatrix}
$$
$U$ gives the rotation in the clockwise direction around the $z$ axis (which would look to be counterclockwise if you are looking up the $z$ axis into the picture) and $U^\top$ would give the rotation in the other direction.
Instead of trying to debug your code and verify all of those back-mappings, I’m going to describe a way for you to check your own results objectively. If you don’t have a good idea of what the results should be, then I don’t really see how you can tell whether or not they’re “reasonable.”
Assuming that there’s no skew in the camera, the matrix $K$ has the form $$K=\begin{bmatrix}s_x&0&c_x\\0&s_y&c_y\\0&0&1\end{bmatrix}.$$ The values along the diagonal are $x$- and $y$- scale factors, and $(c_x,c_y)$ are the image coordinates of the camera’s axis, which is assumed to be normal to the image plane ($z=1$ by convention). So, in this coordinate system, the direction vector for a point $(x,y)$ in the image is $(x-c_x,y-c_y,1)$ and to get the corresponding direction vector in the (external) camera coordinate system, divide by the respective scale factors: $((x-c_x)/s_x,(y-c_y)/s_y,1)$. This is exactly what you get by applying $K^{-1}$, which is easily found to be $$K^{-1}=\begin{bmatrix}1/s_x&0&-c_x/s_x\\0&1/s_y&-c_y/s_y\\0&0&1\end{bmatrix}$$ using your favorite method. Finally, to transform this vector into world coordinates, apply $R^{-1}$, which is just $R$’s transpose since it’s a rotation. The resulting ray, of course, originates from the camera’s position in world coordinates. It should be a simple matter to code up this cascade explicitly, after which you can compare it to the results that you get by any other method that you’re experimenting with.
In this specific case, $R$ is just the identity matrix, so there’s nothing else to do once you’ve got the direction vector in camera coordinates. We have $$s_x=282.363047 \\ s_y=280.10715905 \\ c_x=166.21515189 \\ c_y=108.05494375$$ so the internal-to-external transformation is approximately $$\begin{align}x&\to x/282.363-0.589 \\ y&\to y/280.107-0.386.\end{align}$$ Applying this to the point $(20,20)$ from your previous question gives $(-0.518,-0.314,1)$, which agrees with the direction vector computed there. Taking $(10,10)$ instead results in $(-0.553,-0.350,1)$, which you can then check against whatever your code produced, and so on.
All that aside, there’s a gotcha when using the pseudoinverse method described by Zisserman. He gives the following equation for the back-mapped ray: $$\mathbf X(\lambda)=P^+\mathbf x+\lambda\mathbf C.$$ Note that the parameter is a coefficient of $\mathbf C$, the camera’s position in world coordinates, not of the result of back-mapping the image point $\mathbf x$. Converted into Cartesian coordinates, there’s a factor of $\lambda+k$ (for some constant $k$) in the denominator, so this isn’t a simple linear parameterization. To extract a direction vector from this, you’ll need to convert $P^+\mathbf x$ into Cartesian coordinates and then subtract $\mathbf C$.
To illustrate, applying $P^+$ to $(10,10,1)$ produces $(-0.553,-0.175,1.0,-0.175)$, so the ray is $(-0.553,-t-0.175,1.0,t-0.175)$. In Cartesian coordinates, the back-mapped point is $(3.161,1.0,-5.713)$ and subtracting the camera’s position gives $(3.161,2.0,-5.713)$. To compare this to the known result above, divide by the third coordinate: $(-0.553,-0.350,1.0)$, which agrees.
Update 2018.07.31: For finite cameras, which is what you’re dealing with, Zisserman suggests a more convenient back-projection in the very next paragraph in equation (6.14). The underlying idea is that you decompose the camera matrix as $P = \left[M\mid\mathbf p_4\right]$ so that the back-projection of an image point $\mathbf x$ intersects the plane at infinity at $\mathbf D = ((M^{-1}\mathbf x)^T,0)^T$. This gives you the direction vector of the back-projected ray in world coordinates, and, of course, the camera center is at $\tilde{\mathbf C}=-M^{-1}\mathbf p_4$, i.e., the back-projected ray is $$\tilde{\mathbf X}(\mu) = -M^{-1}\mathbf p_4+\mu M^{-1}\mathbf x = M^{-1}(\mu\mathbf x-\mathbf p_4).$$ This parameterization of the ray doesn’t suffer from the non-linearity mentioned above.
Best Answer
I was able to formulate the problem I was trying to solve(I will update the question's details) and came up with a working solution. Before mentioning my approach, I would like to clear my confusion regarding the Camera Extrinsic and Camera Pose.
Camera Pose is the camera's location/orientation w.r.t the world whereas its Extrinsics are inverse of Camera Pose Transformation. What COLMAP provides is the Camera Pose for each of the images taken by the camera. For projecting a 3D point in the camera plane, one would use Camera Extrinsics whereas for projecting pixels out to the world, one needs Camera Pose. Here is how I approached the problem:
I manually(later automated) selected the pixels of interest and stored them in a list, List2D.
Augmented the List2D (nx2 array) for homogeneous coordinates(nx3 array).
Tranform each pixel coordinate, p in Camera frame using inverse of camera intrinsic matrix, K.
Transform the point, pc from camera frame to world frame, using the Camera Pose, T = [R | t]
Transform the camera origin, cam_origin = (0,0,0) from camera frame to world frame
Get a unit vector, v along the ray passing from the camera origin to the 3D point. Our 3D point of interest lies along this ray. To traverse on this ray, we can parametrize the vector and move along this ray.
Code for implementing all of the steps is as:
In the code above, a constant scaling factor for each pixel is used, which is not right but worked well for my case of defining a structure of the object in interest, though it was warped as points that were supposed to be farther away from the camera, were scaled by the same factor. For exact 3D model recovery, true depth for each pixel is required. Hence, instead of same scale factor, (d in the code), depth value for the corresponding pixel can be used.
As I continue to explore this topic in more detail, I will keep the answer updated with the results from this approach.
Updates: (COLMAP: Github: https://github.com/colmap/colmap/issues/1476)
Colmap's camera poses transform from world to camera coordinates, not the other way around. This leads to correctly oriented point clouds but I still see some offsets. Attaching images with results from previous understanding on Left and current on right.
Final Update: In the answer, I mentioned that COLMAP provides camera poses for each image, and that is the source of error as I went with the traditional definition of the term. I opened up an issue on GitHub to realize that COLMAP provides Camera poses that transform from world to camera coordinates. So, instead of passing [R,t] directly from COLMAP, I now pass [R.T, -R.T@t] to my Get3Dfrom2D() function and everything aligns:
There can be further improvements to filter the point cloud, reading the refined intrinsics for each camera etc.