Last Updated: April 1, 2009 by Pascal Clark ()
*Every sound file on this page was generated using the Matlab script "application1_speechAnalysis," which is included with the Modulation Toolbox.
We acknowledge the support of the U.S. Air Force Office of Scientific Research and the U.S. Office of Naval Research in the development of this toolbox.
This page presents an informal comparison of how coherent and incoherent demodulation methods represent
the linguistic content in recorded human speech. Specifically, how much of speech intelligibility is
in the envelopes (or "modulators"), and how much is in the fine structure (or "carriers")? As shown next, the
answer depends on which modulation decomposition we use. The harmonic coherent model cleanly
defines envelopes and carriers such that only envelopes contain speech information, leaving the carriers to
represent the speaker's pitch. By contrast, the distinction is more dubious in the incoherent Hilbert
envelope decomposition, in which the carriers are extremely noisy and even intelligible when listened to by
themselves. The following is a series of demonstrations to support these claims.
Harmonic Speech Reconstruction
Comparison with Incoherent Demodulation
Try Your Own Experiments
We begin by using harmonic coherent analysis to decompose a recorded speech signal into harmonically-related carriers and complex-valued envelopes. Rather than play the original speech signal, we will instead reconstruct the signal from its harmonic modulation decomposition, one component at a time. Below is the spectrogram for one carrier, which is a frequency-modulated tone following the fundamental of the speaker's pitch.
Next, we multiply the carrier with its corresponding modulator in the time-domain. The result is a temporal pattern in the spectrogram, as well as audible speech-like rhythm in the sound playback.
The second component has its own unique envelope and a carrier frequency that is twice that of the fundamental, seen below.
Continuing in this fashion, we arrive at four modulated harmonics. The audio is recognizable as speech, but is it intelligible?
Eight components might be sufficient for recognition by native speakers.
The Nyquist sampling constraint limits us to 16 harmonic components, which are all added together in the spectrogram below.
What we have seen up until now is the synthesis of speech from a time-varying filterbank in which individual subbands track harmonics in voiced speech. Each harmonic "subband signal" consists of a high-frequency carrier (the "fine structure") that multiplies a low-frequency modulator (the "envelope"). How much does speech intelligibility depend upon the carriers? Replacing each modulator with a flat DC term, we can listen to the "carriers-only" signal below:
The carriers-only representation is not intelligible and therefore contains no linguistic information (aside from secondary pitch cues). The harmonic decomposition has thus reduced the speech signal to a high-frequency, harmonic buzz plus a collection of low-bandwidth, information-bearing modulators.
Perhaps a better test for intelligibility of the envelopes is to replace the carriers with fixed-frequency sinusoids that themselves contain no speech information. Despite occasional tonal inflections in the modulators (a result of imperfect fundamental-tracking), the fixed-carrier synthesis is completely intelligible. This implies that the envelopes, not the carriers, contain the information in speech.
As a final test, we compute broadband spectrograms for the 16-harmonic original-carrier synthesis and the 16-harmonic fixed-carrier synthesis. In the side-by-side plots below, the spectrograms depict very similar resonant structure in the time-frequency plane (only the first 2 seconds are shown). This is a visual confirmation that replacing the original carriers with neutral tones has not changed the speech content of the signal.
The conventional demodulation method for speech is to use a fixed filterbank to obtain subband signals, and then rectify and smooth each subband to find its envelope. Such methods are "incoherent" because they do not use an explicit carrier estimate to demultiply the subband signal; instead, the envelope is defined as the subband magnitude and the carrier takes on whatever fine structure remains. The often-used Hilbert envelope is one such method of incoherent envelope detection. We now analyze the same speech signal as above using incoherent Hilbert demodulation.
Using 16 fixed subbands spaced evenly between 0 Hz and the Nyquist rate of 8000 Hz, we sum together the Hilbert carriers to produce the incoherent carriers-only signal below. From the spectrogram and the audio, it is clear that the Hilbert carriers contain a large amount of broadband noise. More surprising is the fact that the speech itself is still audible in this carriers-only synthesis!
Finally we examine the broadband spectrogram of the incoherent carriers-only synthesis, in which spectral resonances can be seen. This is a visual confirmation of the presence of speech information in the incoherent carriers even without modulation. Thus the incoherent decomposition does not clearly distinguish between envelopes and carriers with respect to intelligibility, as was done in the coherent decomposition.
Modulation Toolbox for Matlab
is publicly available for non-profit research purposes. All of the sound files on this page were
generated using the toolbox, which includes the
modulation spectrogram GUI, as well as standalone functions that can be used for a variety of