[Math] Coordinate system transformation: from world coordinates to camera coordinates

coordinate systems

What I know

Say I have several coordinates in the world (Cartesian) coordinate system and their corresponding coordinates in the camera/local (Cartesian) coordinate system. In order to map new points from world to local, I need to derive the translation vector $T$ and rotation matrix $R$ from the given point pairs. To this end, I could form a simple linear system $Ax=b$ to solve for entries of $T$ and $R$.

$$\overbrace{\begin{pmatrix}X_1 & Y_1 & Z_1 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0\\0 & 0 & 0 & X_1 & Y_1 & Z_1 & 0 & 0 & 0 & 0 & 1 & 0\\0 & 0 & 0 & 0 & 0 & 0 & X_1 & Y_1 & Z_1 & 0 & 0 & 1\\&&&&&\vdots\\X_N & Y_N & Z_N & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0\\0 & 0 & 0 & X_N & Y_N & Z_N & 0 & 0 & 0 & 0 & 1 & 0\\0 & 0 & 0 & 0 & 0 & 0 & X_N & Y_N & Z_N & 0 & 0 & 1\end{pmatrix}}^A\overbrace{\begin{pmatrix}R_{11}\\R_{12}\\R_{13}\\\vdots\\T_1\\T_2\\T_3\end{pmatrix}}^x=\overbrace{\begin{pmatrix}X_1'\\Y_1'\\Z_1'\\\vdots\\X_N'\\Y_N'\\Z_N'\end{pmatrix}}^b
$$

What I don't know

Now instead of the point pairs, I am given the camera's coordinates and rotations w.r.t. the world coordinate system. At the first sight, I think the problem gets easier — I can just add the same translation [given by $(x_c-0, y_c-0, z_c-0)$, $_c$ for camera here] and apply the same rotations [given by $(\theta_x, \theta_y, \theta_z)$] to any new points I want to map.

But after thinking for a while, I am confused by the order of translation and rotation. Because the origin (given) is such a special point, I can do translation and then rotation or the other way, getting the same mapping from world origin to camera origin. What about new points? Which operation should I perform first? I am confused.

Best Answer

See if this reasoning helps. In 3D computer graphics, everything is rendered from the viewpoint of an imaginary camera, or viewer. The 3D content being viewed is represented by coordinates in a world coordinate system, and the camera location and orientation is also specified in world coordinates.

The viewing transform (the guts of all 3D graphics) consists of two 3D transforms, followed by a 3D --> 2D projection.

Coordinate translate: World to Eye. First, all world points must be put into local viewing coordinates where the camera defines the origin. This is done as a coordinate translation, where the camera's position coordinates are subtracted from each world point's coordinates....the result now is that all points revolve around an origin at the camera. This is how you want it, since if the camera (viewer) pans or tilts, points stay the same distance away, just changing direction.

Coordinate rotate: Local world to Eye gaze rotator. In the most general formulation, the camera can adopt any possible 3D orientation. Not only can the camera's center axis point out in any 3D direction, held in that pointing direction, the camera can be spun around the axis 360 degrees. (I'm gonna hurl!)

Therefore, a general purpose way to represent the orientation of the camera (viewer) is to use a 3x3 rotational matrix (avoiding the use of angles). The camera has its own local x-y-z axes. Some camera defs have them looking out the local x-axis (and some look out the local y-axis), and z points out the top of the camera. In any event, the orientation of the camera in world coordinates is represented by 3 direction vectors comprising the columns of the 3x3:

3x3 matrix R =

[ camera_x_axis_dir

camera_y_axis_dir

camera_z_axis_dir ]

Visually, you can imagine what these numbers mean. Imagine the camera coming with its 3 axes as 1-meter long sticks that point out from the camera, forming perfect right angles...x sticks out to the right, y-sticks out forward, and z-sticks out upward. Whereever the camera is setup at whatever orientation, slide the camera over to the world origin, without upsetting its orientation. Construct a unit sphere about the origin. Where does the cam's x-axis stick touching the unit sphere? These [ x y z ] world coordinates are the numbers in camera_x_axis_dir. Same for the y and z sticks. All camera orientations can be represented numerically in this simple manner. This is a 1:1 representation...there is only one set of 9 numbers to represent a unique camera orientation. This is better for software than using angles to represent orientation, since angles are not 1:1. (The information about 3D orientation is said to be overcompressed in roll-pitch-yaw angle form).

Once you know the camera orientation, the next step in rendering a 3D point is to coordinate-rotate the point so that the apparent location of the point is expressed in camera-oriented coordinates. If I'm looking directly at a 3D point, after coordinate rotation the point will fall on the camera's local y-axis (x and z coords will be zero). If the point is directly behind the camera, its transformed coordinates put it on the negative y-axis. If the camera is looking at the world upside-down, the coordinate rotate will flip the world upside down.

The last step is (assuming the optical axis is +y), to assume the viewing screen is directly in front of the viewer (perpendicular to the optical axis), and so points are projected onto the screen. The ray going out from the camera to the point is intersected (conceptually) with the view screen, and the intersection points (x, z) say where to draw the 3D point on the 2D view screen. (Da Vinci was the first tech-artist to popularize this model of the projective transform).

Not mentioned, but there is also some backspace clipping of 3D content unviewable because it is located behind the camera. The rotate transform clarifies this decision, as all 3D content with negative y-coords (in local, rotated camera coordinates) is in the viewer's backspace.

I hope this clarifies what the 3D viewing transform steps are, why they are needed, and the order undertaken.

Related Question