In the 19th century, the physicists Young and Helmholtz proposed a trichromatic theory of color, in which the eye was modeled as three filters with overlapping ranges. This is essentially a physical model of the pigments in the eye, and it predicts the response of the nerve cells at the retina. Helmholtz did related work on sound and timbre. Ca. 1950, Hering, Hurvich, and Jameson proposed significant modifications to the trichromatic theory, called opponent processing. This models a later stage in the processing of the signals, after the retinal response but before the more sophisticated stages of processing in the brain. Both the trichromatic model and opponent processing are needed in order to describe certain phenomena in human color perception.
The complete theory can be modeled by two functions depending on wavelength. I'll call these $RG(\lambda)$ and $BY(\lambda)$. These functions are drawn here. They both oscillate between positive and negative values. For any given pure wavelength $\lambda$, the net result of pigment-filtering plus the later neurological processing produces these two numbers, which can be thought of as the final signals that go on to later processing in the brain. I'm calling them $RG$ and $BY$ for the following reasons. Let's pretend, for the sake of simplicity, that these functions oscillated between -1 and +1. Then the pair $(RG,BY)=(1,0)$ produces the sensation of red, (-1,0) is green, (0,1) is blue, and (0,-1) is yellow. There is various psychological evidence for this model, e.g., no color is perceived as reddish-green or yellowish-blue. Roughly speaking, what seems to be happening is that the eye-brain system is taking differences between signal levels of different cone cells. This sort of makes sense because, for example, the red and green pigments have response curves that overlap a lot, so if you want to place a pure-wavelength color on the spectrum, the difference between them is more a more direct measure of what you want to know than the individual signals.
The $RG$ function actually has two different peaks, one at the red end of the spectrum and one, surprisingly, at the blue end. This implies that by mixing blue and red, you can produce an $(RG,BY)$ pair similar to what you would have gotten with monochromatic violet. If you look at other sources, e.g., this one (figure 3.3), they seem to agree on the secondary short-wavelength peak of the $RG$ function, but the details of how the two functions are drawn at the short wavelengths are different and seem to make for a less convincing explanation of the observed perceptual similarity between violet and a red-blue mixture.
I don't know if there is a valid reductionist explanation of the short-wavelength peak of the $RG$ function. Like a lot of things produced by evolution, it may basically be an accident that got frozen in. However, it's possible that it serves the evolutionary purpose of helping us to distinguish different shades of blue and violet. If the $RG$ function was simply zero over the whole short-wavelength end of the spectrum, then the $BY$ function would be the only information we'd get for those wavelengths. But the $BY$ function has a maximum, simply because the eye's sensitivity to light fades out as you get into the UV. Near this maximum, the ability of the $BY$ function to discriminate between colors becomes zero. In the York University graph, it appears that the short-wavelength extrema of the $RG$ and $BY$ functions are offset from one another, which would allow some color discrimination in this region. The physical information being preserved by the $BY$ function would then be the difference in response between the blue and green cones. But the Briggs graphs don't appear to show any such offset of the extrema, so it's possible that the explanation I'm giving is a bogus "just-so story."
There may be a good analogy here with sound. The sound spectrum is linear, but there is a psychological phenomenon of octave identification, which makes the spectrum "wrap around," so that frequencies $f$ and $2f$ are perceptually similar and can often be mistaken for one another even by trained musicians. Similarly, the predictive power of the "color wheel" model shows that to some approximation we can think of the trichromatic/opponent process model as resulting in a wrapping around of the visible segment of the EM spectrum into a circle. But in both cases, the wrap-around is only an approximation. In terms of pitch, $f$ and $2f$ are perceptually similar but not indistinguishable. For color, we have the 1976 CIELUV color color diagram, which is a modification of the 1931 diagram meant to represent at least somewhat accurately the degree of perceptual similarity between different points based on the distance between them. The monochromatic spectrum constitutes part of the outer boundary of this diagram, and is more of a "V" than a circle; there is quite a large gap between monochromatic violet and monochromatic red.
It is trivially true that any such diagram has a boundary that is a closed curve. If the diagram is not constrained to give any accurate depiction of the sizes of the perceptual differences between colors, then it can be distorted arbitrarily, and we can arbitrarily define it such that its boundary is a circle. In this sense, the success of the color wheel model is guaranteed, and it follows from nothing more than the fact that humans are trichromats, so that the color space is three-dimensional, and controlling for luminance produces a two-dimensional space. But this fails to explain why there is some degree of perceptual similarity between the red and violet ends of the monochromatic spectrum; for that you need the opponent processing model.
There is also a slight variation in the absorbance of the pigment in the red cones at the blue end of the spectrum. I don't think this is sufficient to explain the perceptual similarity between violet and red, or the even closer similarity between violet and a mixture of red and blue light, i.e., I don't think you can explain these facts using only the trichromatic theory without opponent processing. The classic direct measurements of the filter curves of cone-cell pigments were done with cone cells from carp by Tomita ca. 1965, but AFAIK the only direct measurement using human cone cells was Bowmaker 1981. Bowmaker's red-cell absorbance curve has a very slight rise at short wavelengths, but it's not very pronounced at all. You will see various other curves on the internet, often without any attribution or explanation of where they came from, and some of these show a much more pronounced bump rather than Bowmaker's slight rise. Possibly some of these are from people using the CIE 1931 curves, which were never intended to be physical models of the actual human cone-cell pigments. It should be clear, however, that the red and green pigments' curves must have some variation near the violet end of the spectrum. If they did not, then the dimensionality of the color space would be reduced there, and the human eye would be unable to distinguish different wavelengths in this region, which is contrary to fact.
Bowmaker, "Visual pigments and colour vision in man and monkeys," J R Soc Med. 1981 May; 74(5): 348, freely accessible at http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1438839/
You are confusing additive and subtractive colour mixing. If you mix paints together you should get black, not white.
In additive mixing (as used in TVs and monitors), you create light, which is then mixed. When you mix the three primary colours (red, green and blue), you produce white. Other mixes produce other colours, for example red and green combine to produce yellow.
When you use paints, you are using an external light source (the sun or a light bulb) and each paint reflects some of the wavelengths and absorbs others. For example, yellow paint absorbs the blue wavelengths, leaving red and green, which mix to yellow. This is called subtractive mixing, and the primaries are cyan, magenta and yellow; when you mix paints of these colours, the result is black. Adding additional colours to this mix keeps the result black, as there is no more light to reflect. Other colours are made up by mixing the primaries.
With both additive and subtractive mixing, the result of mixing colours depends on the purity of the primaries. No paints are "perfect" cyan, magenta or yellow, and as a result the mix will not be completely black. You may get a dark brown or purple, depending on the paints you use. This is one (of several) reasons why printers use black as well as CMY.
The same goes for monitors: you never get "pure white" - which is typically defined as light with a colour temperature of 5500K, about the same as sunlight. Some monitors can be set for different temperatures. Some are set to 9000K, giving white a bluish cast. Interestingly, the colours that can be displayed on a monitor do not match those of a printer (or paint). A monitor can display colours that a printer cannot print, and vice versa. Every device has its own colour gamut, usually smaller than the eye's gamut, so with any device there are colours we can see but which the device cannot produce.
The reason why all this mixing occurs is because our retina has sensors for red, green and blue, and the brain mixes these inputs to tell us what colour we are seeing. This is why the primaries are RGB, or CMY.
Best Answer
Yes - we are surrounded by a "sea of photons".
An individual object that reflects light (let's assume a Lambertian reflector - something that reflects incident photons in all directions) sends some fraction of the incident photons in all directions. "Some fraction" because the surface will absorb some light (there is no such thing as 100% white).
The propagation of photons follows linear laws (at normal light intensities) so that two photons, like waves, can travel on intersecting paths and continue along their way without disturbing each other.
Finally it is worth calculating how many photons hit a unit area per unit time. If we assume sunlight, we know that the intensity of the light is about 1 kW / m$^2$. For the purpose of approximation, if we assume every photon had a wavelength of 500 nm, it would have an energy of $E = \frac{h}{\lambda} = 3.97 \cdot 10^{-19}\ J$. So one square meter is hit with approximately $2.5\cdot 10^{21}$ photons. Let's assume your grey column reflects just 20% of these and that the visible component of light is about 1/10th of the total light (for the sake of this argument I can be off by an order of magnitude... this is for illustration only).
At a distance of 200 m, these photons would have spread over a sphere with a surface of $4\pi R^2 \approx 500,000\ m^2$, or $10^{14}$ photons per square meter per second.
If your pupil has a diameter of 4 mm, an area of $12\ mm^2$, it will be hit by about $12\cdot 10^8$ photons per second from one square meter of grey surface illuminated by the sun from 200 m away.
At that distance, the angular size of that object is about 1/200 th of a radian. "Normal" vision is defined as the ability to resolve objects that are about 5 minutes of arc (there are 60 minutes to a degree and about 57 degrees to a radian). in other words, you should be able to resolve 1/(57*(60/5)) or about 1/600 of a radian. That's still lots of photons...
Finally you ask "how do we distinguish what photons are reflected from what"? For this we have to thank the lens in our eye. A photon has a particular direction, and thanks to the lens its energy ends up on a particular part of the retina (this is what we call "focusing"). Photons from different directions end up in a different place. Nerves on the back of the retina tell us where the photons landed - and even what color they were. The visual cortex (part of the brain) uses that information to make a picture of the surrounding world in our mind.
It's nothing short of miraculous.