I've been doing experiments related to this back in 1994, so it's going to take a bit of recall.
The idea of a flute is that you create standing waves, which have a frequency that depends on the (variable) geometry. The reason they're standing waves is because you fix specific boundary conditions. In particular, p=0 at an open end.
Now, consider that you have a standing wave in a flute, with an wavelength that is a fraction of your flute length. That means that you have several nodes in the middle. If you would open a key at a node, there would be no effect. If you'd open one near a node, the pitch would change slightly.
The human voice box produces a fundamental frequency and its harmonics because the mechanism is like that of a relaxation oscillator. However, we have limited control over the relative amplitude of the harmonics (we do have some - that is how we change the "color" of a tone we sing, and the sound of vowels).
In order to produce the Shepard scale, you need to be able to control the relative amplitude of the different harmonics - especially the ratio of the lowest two harmonics. To a limited extent we do this when we change the vowel that we sing - with the "oo" sound having few "really high" harmonics, while the "ah" has lots. For example, from the hyperphysics site we get this image:
showing that there is a lot or harmonic content in the voice. But it's not "evenly distributed" - so if you were to drop by an octave, you are creating a sound that is sufficiently different that you don't really get the feeling that you have an "eternal" scale.
I suspect the most important problem is that you would want to re-introduce the lowest harmonic with a slowly increasing amplitude, so that the note "returns to the lower range" without ever appearing to jump there. But the mechanism of the vocal chords is too simple to allow it.
Incidentally, when sopranos sing very high notes, many people lose the ability to distinguish what vowel they are singing since the harmonics are further apart, and the ear distinguishes between vowels by estimating the shape of the frequency envelope in the range up to a few kHz; when there are very few harmonics in that range, the shape cannot be determined. The "high C" (C7) has a frequency of 2093 Hz, so there might be just a couple of harmonics available to figure out the sound. That makes vowels in the highest register hard to distinguish.
Best Answer
The physiology of human ear (and perhaps brain) makes sounds with frequency ~3000 Hz sound louder than higher and lower frequencies, for same sound wave pressure perturbation; see https://en.wikipedia.org/wiki/Equal-loudness_contour