Your voice, like any sound, is a combination of many frequencies.
Physically, your voice consists of pressure waves. If we plot the pressure as a function of time, we see that it goes up and down in a way that looks somewhat random.
You can measure these pressure waves with a microphone, then visualize them with an oscilloscope. Here's a Youtube video where they do this, starting 4:50 into the video.
You may be able to do this at home using the microphone on your computer and some software like Audacity.
The data collected by your microphone is a time series. The pressure is a function of time.
If you sang a pure note (or a reasonable approximation thereof), like you hear from an electronic tuner, the pressure would just be a sine wave.
You could imagine a more complicated sound that was two sine waves on top of each other. This could produce beats.
As you add more and more frequencies, more and more complicated sounds become possible.
It is a remarkable result that in fact any sound can be represented as a sum of infinitely-many sines waves of different periods added up on top of each other. This is Fourier's Theorem.
A human voice thus consists of many sine waves combined simultaneously. Presumably, each individual voice has some special patterns to the way these frequencies are combined, assisting us in recognizing voices. However, speaker recognition is probably based on other information as well. I don't know too much about it, but you can check out the Wikipedia article.
We frequently try to isolate the different frequencies in a sound. This is done electronically through electronic filters. A crude example is "turning up the bass" - amplifying the low-frequency components of a sound. Of course, a professional music studio has far more sophisticated control of the various frequencies. This control can also be mimicked digitally through music sequencers.
On a cruder level, you could simply talk directly into an open piano. The string in the piano will be excited by your voice. The strings each have a specific frequency, so the strings that are excited the most tell you that their particular frequency is present the most in your voice.
Your ear accomplishes a similar task. The cochlea has many small hairs, similar to piano strings, which are tuned to different frequencies. When they vibrate, they mechanically trigger an ion channel to open, beginning an action potential that is eventually interpreted as sound by your brain. So, in essence, you are distinguishing the various frequencies in people's voices already.
It isn't possible to create an audio source in mid-air using the method you've described. This is because the two ultrasonic waves would create an audible source if the listener were standing at that spot, but those waves would continue to propagate in the same direction afterwards. You would need, as I point out below, some sort of medium which scattered the waves in all directions to make it seem as if the sound were coming from the point at which you interfered the two waves.
It is possible, however, to make the user percieve the sound as coming from a specific location, but it isn't as easy as the author makes it seem. I can think of two different ways. First of all, as described by @reirab, you can get audio frequencies by interfering two sound waves of high frequency. When they interfere they will generate a beat note which has the frequency of the difference between the two frequencies. I.E. if you send a sound beam with frequency $f_1=200\ \text{kHz}$ and another beam with $f_2=210\ \text{kHz}$, the frequency heard in the region where they combine will be $\Delta f-=f_2-f_1=10\ \text{kHz}$ which is in the audio band of humans.
There is an additional difficulty. You will need the sound to come out in a well-defined, narrow (collimated) beam, and this is not terribly easy to do. A typical speaker emits sound in all directions. There are many techniques for generating such beams, but one is to use a phased array.
How can you use this to make a person perceive the sound as coming from a specific point?
Sending Two Different Volumes to the Two Ears
What does it mean to perceive sound as coming from a specific location? Our ears are just microphones with cones which accept sound mostly from one direction (excepting low frequencies). A large part of the way we determine where the sound came from is just the relative volume in our two ears. So, you could use the interference effect described above with beams which are narrow enough that you can target each ear. By using two separate sets of beams targeting each ear with different volumes, you could make the person perceive the sound as coming from a specific location; at least as well as a 3D movie makes a person perceive images in 3D.
Hitting a Material Which Scattered the Sound Isotropically
The second method is to use the same interference effect, but this time combining the two beams at a point where a material scattered the sound waves in all directions. I'm going to be honest, I'm not sure how realistic such materials are, but lets assume they exist for now. If you did so, the two sound beams would be scattered with equal amplitude in all directions and the person you are trying to fool would percieve the sound as coming from this point. This method has the advantage of truly sounding to the person as if the sound came from that direction in all respects including reflections, phasing, etc.
In summary, the idea is definitely possible (maybe there are more ways than I've given), but it isn't as simple as the passage in the book makes it out to be.
Best Answer
The human voice box produces a fundamental frequency and its harmonics because the mechanism is like that of a relaxation oscillator. However, we have limited control over the relative amplitude of the harmonics (we do have some - that is how we change the "color" of a tone we sing, and the sound of vowels).
In order to produce the Shepard scale, you need to be able to control the relative amplitude of the different harmonics - especially the ratio of the lowest two harmonics. To a limited extent we do this when we change the vowel that we sing - with the "oo" sound having few "really high" harmonics, while the "ah" has lots. For example, from the hyperphysics site we get this image:
showing that there is a lot or harmonic content in the voice. But it's not "evenly distributed" - so if you were to drop by an octave, you are creating a sound that is sufficiently different that you don't really get the feeling that you have an "eternal" scale.
I suspect the most important problem is that you would want to re-introduce the lowest harmonic with a slowly increasing amplitude, so that the note "returns to the lower range" without ever appearing to jump there. But the mechanism of the vocal chords is too simple to allow it.
Incidentally, when sopranos sing very high notes, many people lose the ability to distinguish what vowel they are singing since the harmonics are further apart, and the ear distinguishes between vowels by estimating the shape of the frequency envelope in the range up to a few kHz; when there are very few harmonics in that range, the shape cannot be determined. The "high C" (C7) has a frequency of 2093 Hz, so there might be just a couple of harmonics available to figure out the sound. That makes vowels in the highest register hard to distinguish.