[Math] How to find an all-in-one 2D to 3D Transformation Matrix for perspective projection, rotation, and translation

3dtransformation

I have read Finding a 3D transformation matrix based on the 2D coordinates but I think my situation is different because I think I need a 4×3 matrix, not a 3×3 matrix. I'm not sure but this might be because I have rotation and translation in addition to just the perspective transformation.

Here is the setup:

suppose you have several 2D points in an image:
(x1,y1)
(x2,y2)
(x3,y3)
(x4,y4)

suppose you also have several corresponding 3D points on an arbitrary plane:
(X1,Y1,Z1)
(X2,Y2,Z2)
(X3,Y3,Z3)
(X4,Y4,Z4)

to transform from 2D to 3D using homogenous coordinates, we can use

(X,Y,Z,W) = M*(x,y,1). Here M must be a 4×3 matrix

So a 2D-homogenousCoords point gets transformed into a 3D-homogenousCoords point.

Then, I could divide (X,Y,Z,W) by W to get (X,Y,Z,1), which is a form that I can read out the true X,Y,Z values in "regular" coordinates.

Now, here is a problem. I don't know what is W for any of my (X,Y,Z) points. (If I did know each point's W, I think there are standard linear algebra way for finding M.)

So to find M, I multiply things out like the following:

X = M11*x + M12*y + M13*1

Y = M21*x + M22*y + M23*1

Z = M31*x + M32*y + M33*1

W = M41*x + M42*y + M43*1

but these X,Y,Z,W are the homogenous coords, so to get the "real" X,Y,Z coords:

X = (M11*x + M12*y + M13*1) / (M41*x + M42*y + M43*1)

Y = (M21*x + M22*y + M23*1) / (M41*x + M42*y + M43*1)

Z = (M31*x + M32*y + M33*1) / (M41*x + M42*y + M43*1)

also, I can get rid of one parameter from each equation by multiplying each equation by (1/M43)/(1/M43). Then I can also rename the ratio of parameters. I'm left with:

X = (a1*x + a2*y + a3*1) / (a10*x + a11*y + 1)

Y = (a4*x + a5*y + a6*1) / (a10*x + a11*y + 1)

Z = (a7*x + a8*y + a9*1) / (a10*x + a11*y + 1)

finally I plug in all the (X,Y,Z) and (x,y,z) values that I have into multiple instances of these equations and algebraically re-arrange everything to get the classic A=Bx form, where x is vector of unknown a's (a1 … a11).

Once I have a1 through a11, I could go back and work out what the original components of M were. Either way I can now project points from 2D to 3D using perspective transformation even if there is rotation or translation.

My question is whether this is the best way to find this kinds of general 2D to 3D perspective transformation?

Best Answer

It looks like you are trying to solve for a map from 2D points to 3D points, so I'm a bit confused... a projection transformation would map the 3D points to the 2D points (and the inverse is, of course, impossible since each point on the projection plane could lie anywhere on a ray form the camera through the plane.)

Next, notice there is no difference between first transforming an object in 3D, and then projecting through a fixed camera, versus leaving the object in place and projecting through a camera of unknown position and orientation. Here I'll take the former approach.

We have some points in 3D and apply an affine transformation to them, then project through a camera at the origin looking down the $z$ axis, with the projection plane passing through $z=1$. This makes the projection matrix $P: (x,y,z) \to (u,v,w)$ easy: it is just the identity.

Before we project we apply some affine transformation $Mq + t$ to the 3D points $q$. Notice that we do not constraint $M$ to only rotate and scale here: to do so we would need to add additional (nonlinear) constraints on the coefficients of $M$. The short of it is that you will need to supply more than the theoretical minimum of four corresponding points to determine the map (and you will get shear if your corresponding points did not come from a bona fide Euclidean motion + projection.)

So now the total map can be written as

$$\left[\begin{array}{c}u\\v\\w\end{array}\right] = \left[\begin{array}{cccc}m_{11} & m_{12} & m_{13} & t_x\\m_{21} & m_{22} & m_{23} & t_y\\m_{31} & m_{32} & m_{33} & t_z\end{array}\right]\left[\begin{array}{c}x\\y\\z\\1\end{array}\right].$$

Since $(u,v,w) \sim (u/w,v/w,1)$, this map is scale-invariant, so we might as well set $m_{33} = 1$. We can also write it in block form (which will prove useful) as

$$\left[\begin{array}{c}u\\v\\w\end{array}\right] = \left[\begin{array}{c}N_{uv}\\N_w\end{array}\right]\left[\begin{array}{c}x\\y\\z\\1\end{array}\right].$$

Like you say, we only know $u/w$ and $v/w$ for the corresponding points, not $u,v,w$. Well, $$\left[\begin{array}{c}u/w\\v/w\end{array}\right] = N_{uv}\left[\begin{array}{c}x\\y\\z\\1\end{array}\right]/N_w \left[\begin{array}{c}x\\y\\z\\1\end{array}\right],$$ or $$N_w \left[\begin{array}{c}x\\y\\z\\1\end{array}\right]\left[\begin{array}{c}u/w\\v/w\end{array}\right] = N_{uv}\left[\begin{array}{c}x\\y\\z\\1\end{array}\right]$$

which is a system of two linear equations in 11 unknowns. Plugging in $5\frac{1}{2}$ corresponding points will let you solve for $N$.