Two things, for starters.
One, definitively do not work in RGB. Your default should be Lab (aka CIE L*a*b*) colorspace. Discard L
. From your image it looks like the a
coordinate gives you the most information, but you probably should do a principal component analysis on a
and b
and work along the first (most important) component, just to keep things simple. If this does not work, you can try switching to a 2D model.
Just to get a feeling for it, in a
the three yellowish coins have STDs below 6, and means
of 137 ("gold"), 154, and 162 -- should be distinguishable.
Second, the lighting issue. Here you'll have to carefully define your problem. If you want to distinguish close colors under any lighting and in any context -- you can't, not like this, anyway. If you are only worried about local variations in brightness, Lab will mostly take care of this. If you want to be able to work both under daylight and incandescent light, can you ensure uniform white background, like in your example image? Generally, what are your lighting conditions?
Also, your image was taken with a fairly cheap camera, by the looks of it. It probably has some sort of automatic white balance feature, which messes up the colors pretty bad -- turn it off if you can. It also looks like the image either was coded in YCbCr at some point (happens a lot if it's a video camera) or in a similar variant of JPG; the color information is severely undersampled. In your case it might actually be good -- it means the camera has done some denoising for you in the color channels. On the other hand, it probably means that at some point the color information was also quantized stronger than brightness -- that's not so good. The main thing here is -- camera matters, and what you do should depend on the camera you are going to use.
If anything here does not make sense -- leave a comment.
This is a well known phenomenon. A good discussion can be found in the Deep Residual Learning for Image Recognition, especially Figure 1. The short summary is that when a neural network is very deep for a given problem, it tends to try and recreate the identity. This is because the first portion of the network has found an effective set of weights that optimize the objective, and now the latter portion of the neural network is essentially adding noise. So the latter portion will attempt to create an identity function, which is terrible because you're trying to make an identity function from a nonlinear set of activations. As an analogy, it's like approximating a line with polynomials of degree >1: you get a wavy mess. The above paper proposes ResNet, which is a deep neural network that allows you to skip over activations, which significantly improves the quality of deeper neural nets.
Best Answer
This task of depth estimation is part of a hard and fundamental problem in computer vision called 3D reconstruction. Recovering metric information from images is sometimes called photogrammetry. It's hard because when you move from the real world to an image you lose information.
Specifically, the projective transformation $T$ that takes your 3D point $p$ to your 2D point $x$ via $x = Tp$ does not preserve distance. Since $T$ is a $2\times 3$ matrix, calculating $T^{-1}$ to solve $T^{-1}x= p$ is an underdetermined inverse problem. A consequence of this is that pixel lengths are not generally going to be meaningful in terms of real world distances. You can see a simple example of why doing 3D reconstruction is tricky by considering the forced perspective from the Ames room optical illusion:
(Source: Ian Stannard https://flic.kr/p/8Pw5Rd)
Your visual processing system and many algorithms use cues such as shading and parallel lines to estimate depth but these can be tricked. Generally you need to know the camera location, and something of a known size observable in the image. If you want really accurate length measurements from photography you have to plan for it in the data collection process (it's very helpful to include these chess boards in the camera field of view).
Here are a bunch of well studied subproblems:
There is some variation as to whether they recover the scene geometry up to a projective transformation, up to an affine transformation or up to a Euclidean transformation.
Here is a nice list of papers and software on the whole topic of 3D reconstruction here. A classic reference textbook is:
Hartley, Richard, and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003.
This paper gives an example of doing depth estimate from a single RGB image using a CNN (the code is also available):
Laina, Iro, et al. "Deeper depth prediction with fully convolutional residual networks." 3D Vision (3DV), 2016 Fourth International Conference on. IEEE, 2016.