It isn't possible to create an audio source in mid-air using the method you've described. This is because the two ultrasonic waves would create an audible source if the listener were standing at that spot, but those waves would continue to propagate in the same direction afterwards. You would need, as I point out below, some sort of medium which scattered the waves in all directions to make it seem as if the sound were coming from the point at which you interfered the two waves.
It is possible, however, to make the user percieve the sound as coming from a specific location, but it isn't as easy as the author makes it seem. I can think of two different ways. First of all, as described by @reirab, you can get audio frequencies by interfering two sound waves of high frequency. When they interfere they will generate a beat note which has the frequency of the difference between the two frequencies. I.E. if you send a sound beam with frequency $f_1=200\ \text{kHz}$ and another beam with $f_2=210\ \text{kHz}$, the frequency heard in the region where they combine will be $\Delta f-=f_2-f_1=10\ \text{kHz}$ which is in the audio band of humans.
There is an additional difficulty. You will need the sound to come out in a well-defined, narrow (collimated) beam, and this is not terribly easy to do. A typical speaker emits sound in all directions. There are many techniques for generating such beams, but one is to use a phased array.
How can you use this to make a person perceive the sound as coming from a specific point?
Sending Two Different Volumes to the Two Ears
What does it mean to perceive sound as coming from a specific location? Our ears are just microphones with cones which accept sound mostly from one direction (excepting low frequencies). A large part of the way we determine where the sound came from is just the relative volume in our two ears. So, you could use the interference effect described above with beams which are narrow enough that you can target each ear. By using two separate sets of beams targeting each ear with different volumes, you could make the person perceive the sound as coming from a specific location; at least as well as a 3D movie makes a person perceive images in 3D.
Hitting a Material Which Scattered the Sound Isotropically
The second method is to use the same interference effect, but this time combining the two beams at a point where a material scattered the sound waves in all directions. I'm going to be honest, I'm not sure how realistic such materials are, but lets assume they exist for now. If you did so, the two sound beams would be scattered with equal amplitude in all directions and the person you are trying to fool would percieve the sound as coming from this point. This method has the advantage of truly sounding to the person as if the sound came from that direction in all respects including reflections, phasing, etc.
In summary, the idea is definitely possible (maybe there are more ways than I've given), but it isn't as simple as the passage in the book makes it out to be.
To expand on Xcheckr's answer:
The full equation for a single-frequency traveling wave is
$$f(x,t) = A \sin(2\pi ft - \frac{2\pi}{\lambda}x).$$
where $f$ is the frequency, $t$ is time, $\lambda$ is the wavelength, $A$ is the amplitude, and $x$ is position. This is often written as
$$f(x,t) = A \sin(\omega t - kx)$$
with $\omega = 2\pi f$ and $k = \frac{2\pi}{\lambda}$. If you look at a single point in space (hold $x$ constant), you see that the signal oscillates up and down in time. If you freeze time, (hold $t$ constant), you see the signal oscillates up and down as you move along it in space. If you pick a point on the wave and follow it as time goes forward (hold $f$ constant and let $t$ increase), you have to move in the positive $x$ direction to keep up with the point on the wave.
This only describes a wave of a single frequency. In general, anything of the form
$$f(x,t) = w(\omega t - kx),$$
where $w$ is any function, describes a traveling wave.
Sinusoids turn up very often because the vibrating sources of the disturbances that give rise to sound waves are often well-described by
$$\frac{\partial^2 s}{\partial t^2} = -a^2 s.$$
In this case, $s$ is the distance from some equilibrium position and $a$ is some constant. This describes the motion of a mass on a spring, which is a good model for guitar strings, speaker cones, drum membranes, saxophone reeds, vocal cords, and on and on. The general solution to that equation is
$$s(t) = A\cos(a t) + B\sin(a t).$$
In this equation, one can see that $a$ is the frequency $\omega$ in the traveling wave equations by setting $x$ to a constant value (since the source isn't moving (unless you want to consider Doppler effects)).
For objects more complicated than a mass on a spring, there are multiple $a$ values, so that object can vibrate at multiple frequencies at the same time (think harmonics on a guitar). Figuring out the contributions of each of these frequencies is the purpose of a Fourier transform.
Best Answer
Reflections from hardish surfaces and diffraction. The housing and the object on which the loudspeaker stands will also vibrate and produce sound waves.