Solved – how to detect the exact size of an object in an image using machine learning

image processingmachine learningobject detection

I have a problem where I get an image from an underwater camera. The image is quite large compared to the objects shown so that it contains has mostly background (seafloor). Objects in the image are for instance corals or sponges.

In the image I would like to detect the exact size (maybe in #pixels or some other measure) of one class of objects, lets say a coral across a sequence of images.

Which approach should I take here? I know there are a lot of tool out there and I would be grateful for any advice on where to start and what to try as I have mostly worked with tabular data instead of images.

Does deep learning with cnn work best here?

Thanks a lot for your help!

Best Answer

This task of depth estimation is part of a hard and fundamental problem in computer vision called 3D reconstruction. Recovering metric information from images is sometimes called photogrammetry. It's hard because when you move from the real world to an image you lose information.

Specifically, the projective transformation $T$ that takes your 3D point $p$ to your 2D point $x$ via $x = Tp$ does not preserve distance. Since $T$ is a $2\times 3$ matrix, calculating $T^{-1}$ to solve $T^{-1}x= p$ is an underdetermined inverse problem. A consequence of this is that pixel lengths are not generally going to be meaningful in terms of real world distances. You can see a simple example of why doing 3D reconstruction is tricky by considering the forced perspective from the Ames room optical illusion:

enter image description here (Source: Ian Stannard https://flic.kr/p/8Pw5Rd)

Your visual processing system and many algorithms use cues such as shading and parallel lines to estimate depth but these can be tricked. Generally you need to know the camera location, and something of a known size observable in the image. If you want really accurate length measurements from photography you have to plan for it in the data collection process (it's very helpful to include these chess boards in the camera field of view).

Here are a bunch of well studied subproblems:

  • If you have one image you need to estimate everything from the image cues mentioned before. This is called monocular reconstruction or depth estimation.
  • If you have two overlapping images taken at the same time from different cameras then you can estimate the discrepancy between the images and triangulate using that. That's called stereo-reconstruction.
  • If you have multiple images taken from a single camera which is moving around you can estimate the camera location and then triangulate. This is called monocular simultaneous location and mapping (monoSLAM).
  • If you have many overlapping image than you can identify common points, estimate camera location, and triangulate as with stereo-reconstruction or monoSLAM but you do an extra step called bundle adjustment to correct for error propagation. This is called 3D reconstruction from multiple images.

There is some variation as to whether they recover the scene geometry up to a projective transformation, up to an affine transformation or up to a Euclidean transformation.

Here is a nice list of papers and software on the whole topic of 3D reconstruction here. A classic reference textbook is:

Hartley, Richard, and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003.

This paper gives an example of doing depth estimate from a single RGB image using a CNN (the code is also available):

Laina, Iro, et al. "Deeper depth prediction with fully convolutional residual networks." 3D Vision (3DV), 2016 Fourth International Conference on. IEEE, 2016.

Related Question