Marnix Van Soom - Speaker

Bart De Boer - Contributor

Formants are characteristic frequency components in human speech that are caused
by resonances in the vocal tract during speech production. Fant (1960) systematized
the then relatively young science of acoustic phonetics with his acoustic theory of
speech production, often called the source-filter model, which has since become the
dominant paradigm. The source-filter model provides the theoretical justification for
deriving formants from power spectra as prominent local maxima in the power
spectral envelope of appropriately windowed and processed speech signals. From
this point of view each formant is characterized by three parameters describing the
local maximum associated with it: the maximum's center frequency (called the
formant frequency), its bandwidth and its peak amplitude.
The concept of a formant is fundamental to phonetics and automated speech
processing. For example, formants are considered to be primary features for
distinguishing vowel classes, speech perception and for inferring speaker identity,
sex and age. Despite this fundamental status – and despite a long history of work on
vowel formants – the issue of making accurate measurements of the formant
parameters, which we dub "the formant measurement problem" for convenience, is
as yet not considered to be fully resolved (e.g. Maurer 2016). Accordingly, a large
amount of formant measurement methods exist in the literature. The fundamental
reason underlying the formant measurement problem is the fact that most of these
methods yield formant frequency estimates (the main quantity of interest) that are
sensitive to various user-made choices, such as the form and length of the tapering
window or the number of poles in linear predictive analysis. In other words,
measuring formants requires careful fine-tuning while speech is notorious for its
variability. In addition, there currently seems to be no way to put error bars on the
formant frequency, bandwidth and amplitude measurements; in fact, the bandwidth is
typically considered to represent the accuracy of the formant frequency, a small
bandwidth indicating high accuracy and vice versa.
We believe that perhaps it is time for a fresh start on remedying these issues. Our
approach to the formant measurement problem consists of replacing the source-filter
model by the arguably more general transient theory of voice production (e.g.
Ladefoged 1996) in order to open up the possibility of applying Jaynes’ Bayesian
spectrum analysis (Bretthorst 1988). We expected a priori that a Bayesian analysis
would yield quite sharp formant frequency estimates – now equipped with error bars
– since acoustic phonetics gives a wealth of prior information about the problem at
hand, including parametric models of the speech waveform, and our preliminary
results confirm this expectation. Two important examples of the use of high-accuracy
measurements of the formant parameters are forensic speaker identification and
medical diagnosis.
1 Jul 2019

Event (Conference)

TitleMAXENT 2019
Web address (URL)
CityGarching bei München
Degree of recognitionInternational event

ID: 46264058