Back projecting a 2D pixel from an image to its corresponding 3D point.

computational geometrycomputer visiongeometrylinear algebraprojective-geometry

I have been trying to understand how to project a 3D point to 2D image and vice versa. I had images of a building taken by a drone, I reconstructed the building using COLMAP(an SFM tool). For those that are not aware, COLMAP, 3D reconstructs the entire scene, populates a point cloud and the camera poses for each image in the dataset. So, after running it through COLMAP, I have the following things available to me:

  1. A scaled point cloud of the scene.
  2. Transformations(R, t) from the world coordinate system to each camera coordinate system.

Just to get more intuition about the process, I am trying to project the pixel back to world**(at a scale)** using Camera Pose and Camera Intrinsics. I just need a unit vector in the right direction to the 3D point and not the absolute 3D point coordinate.

Here is my current understanding of the approach:

R_inv = R.T
M_inv = [R_inv | -R_inv*t]
# also augment M_inv with horizontal vector [0 0 0 1] so it's size is (4,4)
p= [u, v, 1]
p_camera = K_inv * p
p_camera_homogeneous = [p_camera[0], p_camera[1], p_camera[2], 1]
P = M_inv * p_camera_homogeneous

Here P = [X, Y, Z, W], so, Point in 3D is: (X/W, Y/W, Z/W). But when I plot the points in 3D graph, I don't get the correct structure. I understand I am missing something fundamental in my approach as I do not have a scale parameter in my equations too.

Please let me know what I am doing wrong. I can share the code, but the problem is with my understanding of the concept.

Notes: "x.T" represents transpose of a vector x

Best Answer

I was able to formulate the problem I was trying to solve(I will update the question's details) and came up with a working solution. Before mentioning my approach, I would like to clear my confusion regarding the Camera Extrinsic and Camera Pose.

Camera Pose is the camera's location/orientation w.r.t the world whereas its Extrinsics are inverse of Camera Pose Transformation. What COLMAP provides is the Camera Pose for each of the images taken by the camera. For projecting a 3D point in the camera plane, one would use Camera Extrinsics whereas for projecting pixels out to the world, one needs Camera Pose. Here is how I approached the problem:

  1. I manually(later automated) selected the pixels of interest and stored them in a list, List2D.

  2. Augmented the List2D (nx2 array) for homogeneous coordinates(nx3 array).

  3. Tranform each pixel coordinate, p in Camera frame using inverse of camera intrinsic matrix, K.

    pc = K-1 p

  4. Transform the point, pc from camera frame to world frame, using the Camera Pose, T = [R | t]

    pw = T pc

  5. Transform the camera origin, cam_origin = (0,0,0) from camera frame to world frame

    cam_originw = T cam_origin

  6. Get a unit vector, v along the ray passing from the camera origin to the 3D point. Our 3D point of interest lies along this ray. To traverse on this ray, we can parametrize the vector and move along this ray.

Code for implementing all of the steps is as:

def Get3Dfrom2D(List2D, K, R, t, d=1.75):
# List2D : n x 2 array of pixel locations in an image
# K : Intrinsic matrix for camera
# R : Rotation matrix describing rotation of camera frame
#     w.r.t world frame.
# t : translation vector describing the translation of camera frame
#     w.r.t world frame
# [R t] combined is known as the Camera Pose.

List2D = np.array(List2D)
List3D = []
# t.shape = (3,1)

for p in List2D:
    # Homogeneous pixel coordinate
    p = np.array([p[0], p[1], 1]).T; p.shape = (3,1)
    # print("pixel: \n", p)

    # Transform pixel in Camera coordinate frame
    pc = np.linalg.inv(K) @ p
    # print("pc : \n", pc, pc.shape)

    # Transform pixel in World coordinate frame
    pw = t + (R@pc)
    # print("pw : \n", pw, t.shape, R.shape, pc.shape)

    # Transform camera origin in World coordinate frame
    cam = np.array([0,0,0]).T; cam.shape = (3,1)
    cam_world = t + R @ cam
    # print("cam_world : \n", cam_world)

    # Find a ray from camera to 3d point
    vector = pw - cam_world
    unit_vector = vector / np.linalg.norm(vector)
    # print("unit_vector : \n", unit_vector)
    
    # Point scaled along this ray
    p3D = cam_world + d * unit_vector
    # print("p3D : \n", p3D)
    List3D.append(p3D)

return List3D

In the code above, a constant scaling factor for each pixel is used, which is not right but worked well for my case of defining a structure of the object in interest, though it was warped as points that were supposed to be farther away from the camera, were scaled by the same factor. For exact 3D model recovery, true depth for each pixel is required. Hence, instead of same scale factor, (d in the code), depth value for the corresponding pixel can be used.

As I continue to explore this topic in more detail, I will keep the answer updated with the results from this approach.

Updates: (COLMAP: Github: https://github.com/colmap/colmap/issues/1476)

Colmap's camera poses transform from world to camera coordinates, not the other way around. This leads to correctly oriented point clouds but I still see some offsets. Attaching images with results from previous understanding on Left and current on right.

Consecutive Sequence

Random images in dataset

Final Update: In the answer, I mentioned that COLMAP provides camera poses for each image, and that is the source of error as I went with the traditional definition of the term. I opened up an issue on GitHub to realize that COLMAP provides Camera poses that transform from world to camera coordinates. So, instead of passing [R,t] directly from COLMAP, I now pass [R.T, -R.T@t] to my Get3Dfrom2D() function and everything aligns:

Consecutive Sequence - Aligned

Random images in dataset - Aligned

There can be further improvements to filter the point cloud, reading the refined intrinsics for each camera etc.

Related Question