I know when we speak to the microphone, the pitch of our voice cause the vibration of magnet in the microphone, thus causing generation of different voltages of electrical signal.

But my question is: when we speak, our voice not only contains the pitchs, but also the content of our words. How are the words (such as word "hello" and "how are you") transferred into electrical signals?

I am trying to figure out how our voice data delivered as packets on the Internet when we are having voice chat through Internet. What does the packets contain specifically?

Here is an image of the waveform of my voice saying "hello":

hello sound waveform

The blue line corresponds to a vibration in the air (pressure wave) but it's easier to imagine it as the amount a speaker cone needs to be displaced at a given time or the amount your eardrum will be displaced at a given time.

The simplest common (digital) way to encode this sound as data is a technique called "Pulse Code Modulation" and this is the format the common "wave" or ".wav" files use. To encode something as PCM, you slice up time into very short intervals and for each interval you record (also called sample) the displacement of the microphone membrane by the sound at that point in time. By reproducing these displacements with a speaker cone the sound is reproduced.

Packets transmitted with this waveform information just contain long lists of number for each value of the displacement of the waveform at that time. No "meaning" is transmitted. Your brain is what interprets meaning from the vibrations.

