Speech sounds can be either periodic, like "aaah," or nonperiodic, like "sh." Periodic means that the pattern repeats over and over with a certain frequency. Here's a graph of sound pressure versus time for me singing the vowel "ah" at a fixed pitch:
This kind of graph is referred to as a "time domain" representation of the sound, because it has time on the x axis. Because it repeats and therefore has a definite frequency, it also has a sense of pitch. A sound like "sh" doesn't have a clearly defined pitch. Usually the sounds that are periodic are the ones that they tell schoolchildren in the US are vowels, although, e.g., "r" is periodic (and is used as a vowel in Mandarin, for example).
Most musical instruments are designed to have a definite pitch, so they produce periodic waves. However, there are also unpitched instruments, such as most drums.
It's also possible to view a sound on a graph where the x axis is frequency rather than time. This is similar to what you'd see on a graphical equalizer, but with higher resolution. Here's a sample, which, IIRC, is also me singing "ah."
This is called a frequency-domain graph. Whenever the graph is periodic in the time domain, the frequency-domain representation looks like this: an evenly spaced "picket fence." The bottom frequency is called the fundamental. The higher ones, which are multiples of it, are called the harmonics. A musician would call these the overtone series. Although all these different frequencies are present, your ear-brain system hears them fused into a single sensation of tone; you can't "hear out" the overtones.
If you make a graph like this for a musical instrument that produces periodic waves, it will also be a picket fence. However, the pattern of intensity of the peaks will be different. If you look at the graph of me singing, you'll notice that the peaks have an envelope that starts high on the left, then goes down, comes back up, and goes down again. The humps in this envelope are called formants, and they're caused by resonances in the vocal tract. I believe the resonances are roughly analogous to Helmholtz resonances, which are what you get when you blow over the mouth of a beer bottle. Their frequency depends on parameters such as the length of the bottle's neck and the volume of the bottle; this is different from examples like a flute, where the frequency is almost entirely determined by the length of the air column.
The different vowel sounds have different formants. The formant structure is what your ear-brain system uses in order to detect that what it's hearing is human speech, that it's a vowel, and which vowel it is.
To change what vowel you're making, you do things like raising and lowering your tongue. The vocal tract contains several different resonating cavities, one of which is the mouth. Oversimplifying a lot, you could imagine that raising your tongue would decrease the volume of your mouth, and if it was acting like a Helmholtz resonator, the decreased volume would cause its resonant frequency to go up (like a smaller beer bottle). If you do this while continuing to sing the same note, the picket fence in the frequency domain will keep its peaks at the same frequencies, but we could imagine (in this simplified analysis) that one of the formants would move upward, so that the relative intensities of the peaks would change.
For one ear alone, the sound you hear will be the sum of all sounds at that point. So there's not really such a concept as "out of phase" there... "out of phase" relative to what? If you have two speakers generating two sounds that are identical except for a 180 degree phase offset when they arrive at one ear, the perceived volume at that ear will be zero or close to it.
The story becomes a bit different when you talk about two ears. If your ears hear the same sound but it has a slight phase offset in one ear compared to the other, your brain uses this (plus a few other factors) to judge the direction to the source of the sound. E.g. a sound coming from a single point off to your side is likely to have a slightly different phase at each ear, this is part of the set of info your brain uses to figure out where the sound came from (as well as e.g. frequency filtering from the shape of your ears, amplitude differences, visual information, logical conclusions about what "makes sense" in the current situation, etc.)
So, if you're asking if one ear can distinguish between "sounds of different phase" from a single source that doesn't really make any sense.†
If you're asking if one ear can distinguish between "sounds of different phase" from multiple sources, not really, you're only really aware of the end result.
If you're asking if you can distinguish between "sounds of different phase" across both your ears, yes, you do it all the time, it's one of the things that helps you locate the source of a sound.
For the one ear, case, though, it's a lot easier to identify two sounds whose phase is changing relative to eachother, it's a common, distinctly recognizable audio effect especially with guitars, you probably recognize the sound, e.g. https://youtu.be/pvScdOldfc8?t=154.
† By "doesn't really make any sense", I actually mean: You wouldn't be able to tell unless you knew what the sound was supposed to sound like "normally" as a reference for comparison. There'd be nothing inherently identifiable about such a sound, you'd need a mental reference. If I played two identical waveforms, overlapped but with one shifted slightly, of a sound that was completely unfamiliar to you, you would not be able to identify that as any kind of "phase shifting" - it's just a waveform like any other, but if I did it to a human voice, you'd be able to tell something is odd, because you know what a voice should sound like.
Best Answer
All these modes are oscillations in the conserved densities (particle number, energy, momentum, etc.) of an interacting many-body system in approximate thermal equilibrium.
Consider first ordinary (first) sound. An ordinary fluid has five conserved densities, mass (particle number), energy, and momentum. The corresponding hydrodynamic theory is then describes five modes of excitation. Three of these are diffusive (non-propagating): heat, and the two transverse components of the momentum density (shear modes). The longitudinal component of the momentum density couples to mass and energy density forms a pair ($\omega=\pm k$) of sound modes that propagate with velocity $c_s^2=(\partial P)/(\partial\rho)|_{s/n}$. This sound modes is weakly damped, but the damping grows at the mean free path increases (typically, as the temperature is lowered).
At low temperature most fluids solidify, but some substances (most notably $^4He$, a boson, and $^3He$, a fermion) remain liquid and become quantum fluids. A Bose fluid (like $^4He$) eventually becomes superfluid, and the sound mode of the normal fluid continues into the superfluid, where we can eventually understand it as a quantized excitation of the superfluid (a phonon). In a Fermi fluid (like $^3He$) ordinary sound becomes strongly damped, but there is a particle-hole mode (which we can understand as an oscillation of the Fermi surface) that has the quantum numbers of ordinary sound, and is known as zero sound. The transition is smooth, but it is apparanant from the behavior of sound velocity and attenuation.
The image below is for liquid helium three, and shows a transition from zero to first sound at approximately 10 mK.
In a superfluid there is another hydrodynamic mode, associated with the Goldstone mode $\varphi$ related to spontaneous $U(1)$ breaking. The superfluid velocity is the gradient of this field, $\vec{v}_s=\hbar\vec\nabla\varphi/m$. Excitations of this mode mix with ordinary sound, and diagonalizing the system of equations leads to two propagating modes, known as first and second sound. The first sound mode is the one that connect to ordinary sound at the critical temperature $T_c$, whereas the speed of second sound goes to zero.
In liquid helium ordinary sound is to good approximation a density wave, whereas second sound is an oscillation of the normal fluid against the superfluid in which the density is approximately constant, but the entropy density oscillates. As a result, I can excite first sound with a vibrating plate, and second sound with a pulsed heater (and then check that the velocities behave as expected).
Below is an image of the two modes in an ultracold atomic gas (top panel: first sound; lower panel: second sound). We observe that second sound is slower, and that it cannot enter the normal fluid regime (dashed line).
Finally, third and fourth sound only exist in special geometries (thin films and channels).