Sigma-Lognormal Modeling of Speech

Carmona-Duarte, C.; Ferrer, M. A.; Plamondon, R.; Gómez-Rodellar, A.; Gómez-Vilda, P.

doi:10.1007/s12559-020-09803-8

Sigma-Lognormal Modeling of Speech

Open access
Published: 07 February 2021

Volume 13, pages 488–503, (2021)
Cite this article

Download PDF

You have full access to this open access article

Cognitive Computation Aims and scope Submit manuscript

Sigma-Lognormal Modeling of Speech

Download PDF

C. Carmona-Duarte ORCID: orcid.org/0000-0002-4441-6652¹,
M. A. Ferrer¹,
R. Plamondon²,
A. Gómez-Rodellar³ &
…
P. Gómez-Vilda³

1824 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

Human movement studies and analyses have been fundamental in many scientific domains, ranging from neuroscience to education, pattern recognition to robotics, health care to sports, and beyond. Previous speech motor models were proposed to understand how speech movement is produced and how the resulting speech varies when some parameters are changed. However, the inverse approach, in which the muscular response parameters and the subject’s age are derived from real continuous speech, is not possible with such models. Instead, in the handwriting field, the kinematic theory of rapid human movements and its associated Sigma-lognormal model have been applied successfully to obtain the muscular response parameters. This work presents a speech kinematics-based model that can be used to study, analyze, and reconstruct complex speech kinematics in a simplified manner. A method based on the kinematic theory of rapid human movements and its associated Sigma-lognormal model are applied to describe and to parameterize the asymptotic impulse response of the neuromuscular networks involved in speech as a response to a neuromotor command. The method used to carry out transformations from formants to a movement observation is also presented. Experiments carried out with the (English) VTR-TIMIT database and the (German) Saarbrucken Voice Database, including people of different ages, with and without laryngeal pathologies, corroborate the link between the extracted parameters and aging, on the one hand, and the proportion between the first and second formants required in applying the kinematic theory of rapid human movements, on the other. The results should drive innovative developments in the modeling and understanding of speech kinematics.

Kinematic Modelling of Dipthong Articulation

Relating Facial Myoelectric Activity to Speech Formants

Application of the Lognormal Model to the Vocal Tract Movement to Detect Neurological Diseases in Voice

Introduction

For decades, human movement studies and analyses have been fundamental in many scientific domains, ranging from neuroscience to education, pattern recognition to robotics, health care to sports, and beyond. The primary goal of these studies has always been to parameterize and assess human movements, providing information on the basic processes involved in fine motor control and their variability. In speech, computational systems that synthesize and assess speech motor control provide answers to some questions regarding the articulator movements used by humans to produce speech sounds, speech rate effects, or for example, how infants acquire the motor skills needed to produce the speech sounds of their native language [1]. However, many questions regarding the modeling and automatic assessment of natural neuromotor decline in healthy speech or the parameterization of neuromotor commands and muscular responses from fast continuous speech are still open.

Many neurocomputational models have been proposed to understand how speech movement is produced and how the resulting speech varies when changing some parameters [2]. To this end, motor control models inspired by computer programs have often been used. Under this paradigm, motor commands are generated based on a central motor plan [2] and executed by a speech generator. Several speech motor models, such as the GEPETTO [3, 4], ACT [5], DIVA [6] and Task Dynamics [7], State Feedback [8], and FACTS [9] models, have been developed in this context in recent years. These models start from an action plan (planner) and then adjust a set of parameters, moving a set of articulators, to get the ideal output (feedforward models). In some of them, the acoustic output signal is then compared with the reference input signal from the planner to generate an error signal that allows to correct the movement (feedback models).

Previous works have been oriented toward the modeling of learned speech of healthy speakers. Some of the above models (DIVA, FACTS, and ACT) can, following adjustments of some parameters, model certain aspects of development and aging. However, to the best of our knowledge, neuromotor decline has never been modeled by such systems [2]. Moreover, the inverse approach, in which the muscular response parameters are derived from real continuous speech, is not possible with such models.

Among the models that study human movement production in general [10], the kinematic theory of rapid human movements and its associated Sigma-lognormal model [11,12,13] have been applied successfully in several fields [14] to model numerous human movements such as handwriting and signatures, as well as eye, finger, wrist, hand, head, and trunk movements [14,15,16,17]. It has also been used to evaluate the effect of exercise on global neuromotor control [18], on the detection and monitoring of neuromuscular disorders, and to study and synthesize handwriting motor control changes in humans with age [19,20,21]. The Sigma-lognormal model has thus demonstrated its capacity to obtain a muscular response and neuromotor command parameters from online handwriting, to assess neuromotor aging and synthesize new handwriting samples.

In handwriting, the Sigma-lognormal model decomposes a complex movement, obtained from the temporal trajectory captured with a digital tablet, into a sum of simple time-overlapped primitives with a lognormal velocity profile. This method provides information about how every single movement is generated and synchronized, modeling the end effector (set of muscles involved in the movement) as a black box. Thus, the lognormal-shaped impulse response of the end effector, used as a primitive, is not linked to any specific articulation, but rather, to a large number of coupled subsystems. Moreover, the movement primitive is not necessarily confined to movements with a single velocity peak, as is still often assumed in many models [22].

Given the above advantages, in this paper, we propose a novel methodology based on the Sigma-lognormal model to parameterize the speech kinematics and the muscular response produced by the complex set of muscles involved in achieving the target sound, as well as to study aging effects. One question that does arise though is how the kinematic theory can be applied to speech modeling. The answer to this question is by no means straightforward. As a first proof of concept, preliminary works directly applied the kinematic theory of rapid human movements to diphthongs and sustained vowels uttered in neuromotor disease analysis [23,24,25,26], suggesting the possibility of applying the Sigma-lognormal model to speech. However, obtaining a general model would require a representation of a target’s map, a trajectory mapping, and a velocity representation, all assuming a lognormal impulse response that would need to be related to some speech features. To address these issues, we assume a high-level goal as the target map (a map of sound that can be discriminated between them), inspired by the work on the spatial model proposed by Moser et al. [27, 28], instead of a fixed desired position of each individual speech articulator. As such, the velocity representation can be obtained from the sound transitions (trajectory map). The model is explained in detail in “Sigma-Lognormal Parameterization Method”.

To test the validity of the proposed method, we present two sets of experiments. The first one aims to illustrate the meaning of the lognormal decomposition in simple movements in a continuous speech signal. In the second one, the goal is to evaluate the model’s ability to identify significant differences in some parameters when modeling aging in the speech of subjects with or without laryngeal pathologies. In certain studies related to handwriting [19], it has been observed that the time between lognormals and their number increases with age. Timing effects have also been reported in speech, where an fMRI study suggested that the motor control of timing during speech production declines with age [29]. So, if the proposed Sigma-lognormal model describes the speech kinematics well, then we should expect results obtained in speech to be similar to those obtained in handwriting [19] if proper experiments are run. Moreover, since laryngeal dysfunction only affects the sound source (glottis), and not the global end effector movements, the time between lognormals should not be affected in this case, unlike in the case of aging. In the experiment section, these hypotheses will also be tested.

The present work is structured as follows. After an overview of the kinematic theory of rapid human movements in “Overview of the Sigma-Lognormal Model”, “Sigma-Lognormal Parameterization Method” describes the method for estimating speech kinematics and how it is parameterized. “Evaluation, Results, and Discussion” evaluates the model and discusses the results obtained. Finally, we summarize our findings in “Conclusions”.

Overview of the Sigma-Lognormal Model

The Sigma-lognormal model explains how an action plan comprised a sequence of circumference arcs between virtual target points (VTP) can be activated to generate a spatiotemporal trajectory. Virtual target points are defined as the positions targeted by a lognormal, but that are not necessarily reached because of the temporal overlapping of the next lognormal [30]. Virtual targets are thus related to the learning process and how the movement is programmed by the brain. A starting and an ending angle define each arc linking virtual target points. Each ending VTP is the starting VTP of the next arc. To generate smooth movements from this discontinuous action plan, the instantiation of a command at a given VTP must start before the previous stroke reaches that VTP. In other words, each arc has a starting time but finishes later than the starting time of the next one. Therefore, successive resulting strokes are temporally overlapped. Each arc is executed following a lognormal-shaped velocity curve, and the whole trajectory is made up of the vector summation of the individual strokes.

Mathematically, the lognormal velocity profile of a simple movement is defined by [7]

$$\left|\overrightarrow{v}_{j}(t;{t}_{oj})\right|={D}_{j}{\wedge}_{j}\left(t;{t}_{\mathrm{oj}},{\mu }_{\mathrm{j}},{\sigma }_{\mathrm{j}}\right)={\frac{{D}_{{j}}}{{\sigma }_{\mathrm{j}}\sqrt{2\pi }(t-{t}_{{oj}})}e}^{\frac{{-\left(\mathrm{ln}\left(t-{t}_{{oj}}\right)-{\mu }_{\mathrm{j}}\right)}^{2}}{{2\sigma }_{{j}}^{2}}}$$

(1)

where $D_{{\text{j}}}$ is the length of the movement, $t_{{{\text{oj}}}}$ is the time occurrence of the movement command, $\mu_{{\text{j}}}$ is the log time delay, $\sigma_{{\text{j}}}$ is the log response time, and j indicates the index of the movement. The velocity profile of a complex movement $\overrightarrow {{v_{{_{{{r}}} }} }} (t)$ is given by the time superposition of NbLog lognormals [9] as follows:

$$\overrightarrow{{v}_{{r}}}(t)=\begin{array}{c}NbLog\\ \sum \\ j=1\end{array}\overrightarrow{{v}_{\mathrm{j}}}\left(t\right)=\begin{array}{c}NbLog\\ \sum \\ j=1\end{array}{D}_{\mathrm{j}}(t)\left[\begin{array}{c}\mathrm{cos}{\phi }_{\mathrm{j}}(t)\\ \mathrm{sin}{\phi }_{\mathrm{j}}(t)\end{array}\right]{\wedge }_{\mathrm{j}}\left(t;{t}_{\mathrm{oj}},{\mu }_{j},{\sigma }_{\mathrm{j}}\right)$$

(2)

where $\phi_{{\text{j}}} (t)$ is the angular position, defined as

$${\phi }_{j}\left(t\right)={\Theta }_{\mathrm{sj}}\left(t\right)+\frac{{\Theta }_{\mathrm{ej}}\left(t\right)-{\Theta }_{\mathrm{sj}}(t)}{2}\left[1+erf\left(\frac{\mathrm{ln}\left(t-{t}_{\mathrm{oj}}\right)-{\mu }_{\mathrm{j}}}{{\sigma }_{\mathrm{j}}\sqrt{2}}\right)\right]$$

(3)

and $\Theta_{{{\text{sj}}}} {(}t)$ and $\Theta_{{{\text{ej}}}} (t)$ are the starting and the end angular directions of the j^th simple movement or stroke, and erf is the error function.

Finally, the trajectory is worked out as

$$\overrightarrow {{s_{{_{{\text{r}}} }} }} (t) = \left[ {\begin{array}{*{20}c} {x_{{\text{r}}} (t)} \\ {y_{{\text{r}}} (t)} \\ \end{array} } \right] = \sum\limits_{j = 1}^{NbLog} {\frac{{D_{{\text{j}}} }}{{\Theta_{{{\text{ej}}}} - \Theta_{{{\text{sj}}}} }}} \left[ {\begin{array}{*{20}c} {\sin \phi_{{\text{j}}} (t) - \sin \Theta_{{{\text{sj}}}} } \\ { - \cos \phi_{{\text{j}}} (t) + \cos \Theta_{{{\text{sj}}}} } \\ \end{array} } \right]$$

(4)

This expression converts angles into arcs of circumferences that are temporally overlapped. Specifically, the j^th term of the summation represents the arc that links consecutive virtual target points, VTP_j-1 and VTP_j, which are defined by

$$VTP_{{\text{j}}} = VTP_{{{\text{j}} - 1}} + \frac{{D_{{\text{j}}} }}{{\Theta_{{{\text{ej}}}} - \Theta_{{{\text{sj}}}} }}\left[ {\begin{array}{*{20}c} {\sin \phi_{{\text{j}}} (T) - \sin \Theta_{{{\text{sj}}}} } \\ { - \cos \phi_{{\text{j}}} (T) + \cos \Theta_{{{\text{sj}}}} } \\ \end{array} } \right]$$

(5)

with T being the total temporal duration of the spatiotemporal sequence.

A sequence of virtual target points, along with their starting and ending angles and their lognormal velocity parameters, can be analytically extracted through reverse engineering (Fig. 1). Using the extracted action plan, the corresponding spatiotemporal sequence can be reconstructed from its set of parameters:

$${P=\{{D}_{j},{t}_{\mathrm{oj}},{\mu }_{\mathrm{j}},{\sigma }_{\mathrm{j}},{\Theta }_{\mathrm{ej}}, {\Theta }_{\mathrm{sj}},{VTP}_{\mathrm{j}-1}\}}_{j=1}^{NbLog}$$

(6)

Classically, these parameters are calculated from the sampled 2D spatiotemporal sequence with software such as ScriptStudio [31] or iDeLog [32].

Once the original velocity v_o(t) has been reconstructed as a summation of lognormals ($\overrightarrow {{v_{{_{{\text{r}}} }} }} (t)$), the quality of the reconstruction can be evaluated using the signal-to-noise-ratio (SNR) between them. Specifically, the SNR is defined as [30]

$$SNR = 20\log \left( {\frac{{\int_{0}^{T} {v_{{\text{o}}}^{{2}} (t)} }}{{\int_{0}^{T} {|v_{{\text{o}}} (t) - v_{{\text{r}}} (t)|^{2} dt} }}} \right)$$

(7)

It is commonly accepted that when SNR < 15 dB, the reconstruction is not appropriate due to either ScriptStudio [31] or iDeLog [32] not having managed to find an adequate solution or to the spatiotemporal sequence not corresponding to the model [30]. In the latter case, as the lognormal is accepted as a neuromotor model, we could also say that the spatiotemporal sequence does not correspond to the timing conditions under which lognormals emerge, as predicted by the central limit theorem [33].

Sigma-Lognormal Parameterization Method

The scheme for applying the Sigma-lognormal model in speech is presented in Fig. 2. The model divides the speech generation into two steps: planning of the sequence of sounds (effector-independent) and execution of the sequence via the end effector (effector-dependent) [20, 21, 34]. Firstly, in the effector-independent step, a sound map (higher-level goal) is defined, assuming that each simple learned sound has a corresponding position on a hexagonal grid. Note that in this map (different for each person), the targets are sounds, and not phonemes, since a phoneme can be defined either as a simple sound or as a group of different sounds. Processing a sequence of sounds (for example, [uiau] in (Fig. 2) involves moving through different positions on the grid and generating a trajectory through the selected sounds from a series of commands. Secondly, the effector-dependent module is linked to the neuromuscular system itself (end effector) and is defined by its impulse response to each command. The end effector movement causes the vocal track shape to vary, thus changing the resonance frequencies, and therefore, the formants (resonant frequencies of the oronasopharyngeal tract) over time [35,36,37].

In this work, we use a reverse engineering approach, starting from the formants and moving up to the sound trajectory; from variations in the formant, we estimate both the parameters that model the commands that determine the transition from one grid position to another (simple movement) and the muscular response to each command. Thus, we will model neither the initial positions on the grid nor the physical constraints of the vocal track.

To perform a Sigma-lognormal analysis of speech, a spatiotemporal sequence that globally represents the speech kinematics is required, as previously depicted in Fig. 2. To this end, we rely on the resonance tube model. Indeed, in speech synthesis, the vocal tract (from the glottis to lips) can be represented as a concatenation of lossless acoustic tubes, where the shape and the volume of the vocal tract vary for each sound [36, 38]. An increment or decrement of the section and length of the tubes produces a change in the resonance frequencies, and accordingly, a change of formants in the output speech, as we can see in Fig. 3. This means that each motor command that the brain produces to generate synchronous muscle movement required to go from one acoustic position to another changes the resonant cavities. Thus, a relationship can be established between an increment of the formant and the increment of the resonant areas or between the formant tracks and the movements of muscles [39]. Then, if the estimated velocity is integrated, a kinematic trajectory to be analyzed by our model can be obtained. Note that the resonant cavities of each subject are different, depending on the morphology and length of the organs that comprise it. Therefore, when a language is learned, the articulatory position of each sound is set as a function of the resonant cavities needed to produce the sound closest to the ideal one the person is trying to learn, and of how the sound is perceived [40,41,42].

This kinematic trajectory of the formants can be considered as the movement of a reference center (RC) of a speech end effector over the acoustic space, much the same as the movement of the pencil tip over paper represents the movement produced by the end effector during handwriting.

According to this analogy, the model parameters could be recovered from a speech signal in three steps: (1) tracking of the formants; (2) from the formant sequence, obtention of the end effector kinematics, and (3) parameterization of the resulting trajectory using the Sigma-lognormal model.

Formant Estimation

To estimate the speech kinematics from the acoustic space in a non-invasive fashion, the formants are evaluated from the speech recorded with a microphone. This procedure is similar to the one used in handwriting, where the movements of the pencil tip are captured with a digitizing tablet.

Formants can be tracked using many methods proposed in the literature. In this paper, we use some of the methods implemented in the PRAAT software [43] to ensure experimental repeatability and to test the dependence of the proposed methodology on the formant estimation procedure.

Since there are no clear formants for unvoiced consonants, in fluent speech, they are usually co-articulated with a voiced sound [35], and so we assume that the missing formant information can be interpolated as a movement from the positions of the previous and posterior voiced phonemes.

Formants to End Effector Kinematics

A speech kinematics can be computed from its speech formants since the formant track is related to the movement in the tube resonance model and its velocity. Usually, the first two formants of the voice (F₁ and F₂) can give a spatial representation of the most frequent sounds and can be used to estimate the movement of the end effector needed to go from one sound to another. As can be seen in Fig. 4 (left), increments or decrements in the first or the second formant are related to changes in the pronounced sound. These changes can be represented as a trajectory drawn on an imaginary axis (Fig. 4 (right)). Since the proportion of the contribution of the first and second formants to the kinematic space is an ill-posed problem [39, 44,45,46], a transfer coefficient c_i is added to the mapping equation. Hence, the conversion from the acoustic space to the kinematics space can be approximated by a linear transform such as

$$\left\{ {\begin{array}{*{20}c} {\frac{\partial y(t)}{{\partial t}} = c_{1} \frac{{\partial F_{1} (t)}}{\partial t}} \\ {\frac{\partial x(t)}{{\partial t}} = c_{2} \frac{{\partial F_{2} (t)}}{\partial t}} \\ \end{array} } \right.$$

(8)

where F₁(t) and F₂(t) are the tracks in the first and second formants, c_i are the transfer coefficients, x(t) and y(t) are the trajectories along the two imaginary axes in the kinematic space, ${{\partial x(t)} \mathord{\left/ {\vphantom {{\partial x(t)} {\partial t}}} \right. \kern-\nulldelimiterspace} {\partial t}}$ denotes the derivative of the generic sequence x(t), and ${{\partial y(t)} \mathord{\left/ {\vphantom {{\partial y(t)} {\partial t}}} \right. \kern-\nulldelimiterspace} {\partial t}}$ denotes the derivative of the generic sequence y(t).

Once x(t) and y(t) are calculated from the formants, the approximate velocity v_f(t) is estimated as

$$v_{{\text{f}}} (t) = \sqrt {\left( {\frac{\partial x(t)}{{\partial t}}} \right)^{2} + \left( {\frac{\partial y(t)}{{\partial t}}} \right)^{2} } = \sqrt {\left( {\frac{{\partial (c_{2} F_{2} ))}}{\partial t}} \right)^{2} + \left( {\frac{{\partial (c_{1} F_{1} ))}}{\partial t}} \right)^{2} }$$

(9)

The end effector reference center (RC) trajectory can thus be obtained by integrating Eq. 8, which leads to:

$$\left\{ {\begin{array}{*{20}c} {y(t) = c_{1} F_{1} (t)} \\ {x(t) = c_{2} F_{2} (t)} \\ \end{array} } \right.$$

(10)

We assume that the initial conditions, which are irrelevant for the Sigma-lognormal analysis, are equal to zero. Then, x(t) and y(t) refer to the spatiotemporal sequence that represents the end effector movement. Thus, c₁ and c₂ can be seen as the weights that map the formants F₁ and F₂ into their spatial representation. To allow evaluating the proportion between F₁ and F₂ that can give more information regarding the articulatory movement to which the kinematic theory of rapid human movements should be applied, in this work, we novelly calculate these weights using

$$\left\{ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {c_{1} = k(1 - \alpha )} \\ {c_{2} = k\alpha } \\ \end{array} } & {0 \le \alpha \le 1} \\ \end{array} } \right.$$

(11)

where k is the scale constant and α is the proportion parameter that defines the relative contribution of F₁ and F₂ to the kinematic space. It depends on the shape of the vocal tract.

To calculate α and k, we assume that the acoustic space could be transformed into a hexagonal kinematic space (Fig. 4, right). Based on previous handwriting synthesis studies [20], and inspired by the hexagonal grid cell distribution proposed by Moser et al. [27, 28], the vowel triangle limits are fitted with an equilateral triangle. In this work, we hypothesize that α can be approximated as the value that keeps L₁ = L₂, with L₁ being the distance between the /i/ position and /a/, and L₂, the distance between the /i/ and /u/ (Fig. 4, right).

To this end, as the external vowels of the kinematic space are usually /a/, /i/, and /u/ (see Fig. 4), we define F_1a, the first formant of the vowel /a/, F_1i and F_2i, the first and second formants of the vowel /i/, respectively, and F_2u, the second formant of the vowel /u/.

The height of the triangle can be calculated as

$$h = c_{1} (F_{1a} - F_{1i} )$$

(12)

Considering the triangle as an equilateral triangle, it means that

$$L_{1} = \frac{2}{\sqrt 3 }h = \frac{2}{\sqrt 3 }c_{1} (F_{1a} - F_{1i} )$$

(13)

And

$$L_{2} = c_{2} (F_{{2{\text{i}}}} - F_{{2{\text{u}}}} )$$

(14)

As L₁ = L₂,

$$\frac{2}{\sqrt 3 }c_{1} (F_{1a} - F_{1i} ) = c_{2} (F_{{2{\text{i}}}} - F_{{2{\text{u}}}} )$$

(15)

Replacing c₁ and c₂ by their values:

$$\frac{2}{\sqrt 3 }k(1 - \alpha )(F_{{1{\text{a}}}} - F_{{{\text{1i}}}} ) = k\alpha (F_{{{\text{2i}}}} - F_{{{\text{2u}}}} )$$

(16)

Obtaining α (the proportion between F₁ and F₂):

$$\alpha = \frac{{(F_{1a} - F_{1i} )}}{{(F_{1a} - F_{1i} ) + \frac{\sqrt 3 }{2}(F_{2i} - F_{2u} )}}$$

(17)

It should be noted that both the value of α and the formant values are speaker-dependent. Table 1 shows the α values obtained with Eq. 17 using the formant values given by Hillenbrand et al. in [47] (English vowels) and by Pätzold et al. [48] (German vowels). We can see that the values range from 0.25 to 0.33.

Table 1 Value estimated from different previous works

Full size table

The constant k is a scale factor that converts the estimated values of x(t) and y(t) to centimeters. Unlike with the proportion parameter α, this constant is not necessary for the Sigma-lognormal model. However, a reasonable value of k facilitates understanding of the model.

To find this reasonable value, we can use the already known relationship between ${L}_{2}$ and $k$ given by:

$$L_{2} = c_{2} (F_{{2{\text{i}}}} - F_{{{\text{2u}}}} ) = k\alpha (F_{{{\text{2i}}}} - F_{{2{\text{u}}}} )$$

(18)

$k$ is thus obtained as

$$k = \frac{{L_{2} }}{{\alpha (F_{{{\text{2i}}}} - F_{{{\text{2u}}}} )}}$$

(19)

To calculate the k value associated with a real movement, the L₂ value can be obtained from the results presented by Whitfield et al. [49], where the movement needed to utter the sentence “It’s time to shop for two new suits” was measured in 20 subjects with sensors. We take the values obtained with the tongue front marker (TF) in mm, the mean of the range of F₂, and the parameter α rounded to 0.3. This leads to a k factor of about 0.04 mm/Hz, which keeps the peak velocities similar to the ones presented in [50].

Sigma-Lognormal Analysis

Once the trajectory has been estimated, it is modeled with the kinematic theory of rapid human movements through the Sigma-lognormal model, as is explained in “Overview of the Sigma-Lognormal Model”. The kinematic theory is applied in an attempt to model the speech kinematics as a synchronized summation of simple overlapped movements, inspired by how the brain issues time-spaced commands to the articulatory organs. As such, speech is modeled as a global movement instead of a single muscle or group of muscles modeled independently.

The hypothesis underlining the application of this model to speech posits that a lognormal in speech has a similar meaning as in handwriting, a primitive that has been widely tried and tested. Therefore, in the case of speech, the number of lognormals would be related to the number of simple articulatory movements for a natural and healthy speech. Hence, the number of lognormals should be related to the number of speech sounds uttered and their timing. Obviously, it is expected that a neuromotor dysfunction will affect the number, shape, and time of occurrence of the lognormals, as is the case in handwriting [19]. These neuromotor dysfunctions can be due to normal aging or neurodegenerative diseases. In the special case of laryngeal pathologies, which affect only the closing of the glottis and voice source, they should not affect the timing parameters and the lognormal shape for subjects of the same age, but they could result in more simple movements due to the effort needed to talk and to the pauses in the pronunciation of a sentence.

Beyond the sequence of lognormal parameters $P = \left\{ {D_{{\text{j}}} ,t_{{{\text{oj}}}} \mu_{{\text{j}}} ,\sigma_{{\text{j}}} ,\Theta_{{{\text{ej}}}} ,\Theta_{{{\text{sj}}}} ,VTP_{{\text{j - 1}}} } \right\}_{j = 1}^{NbLog}$, it makes sense to define and use supplementary parameters related to the timing intervals between lognormals and lognormal shapes. Such parameters can help improve our understanding of some diseases. Examples of these parameters include:

$\overline{{\Delta t_{{\text{o}}} }}$: the mean of the time between successive lognormals, that is, the mean of the time difference between the current lognormal and the previous one:

$$\overline{{\Delta t_{{\text{o}}} }} = \frac{{\sum\nolimits_{{{\text{j}} = 2}}^{NbLog} {(t_{{{\text{oj}}}} - t_{{{\text{o}}(j - 1)}} )} }}{NbLog}$$

(20)

$\overline{{V_{{\text{p}}} }}$: the average of the maximum velocity of the Nblog lognormals:
$$\overline{{V_{{\text{p}}} }} = \frac{{\sum\nolimits_{j = 2}^{NbLog} {\max (\overrightarrow {{v_{{_{{\text{j}}} }} }} (t))} }}{NbLog}$$
(21)

$\overline{\mu }$: the mean of the log time delay:
$$\overline{\mu } = \frac{{\sum\nolimits_{{{\text{j}} = 1}}^{NbLog} {\mu_{{\text{j}}} } }}{NbLog}$$
(22)

$\overline{\sigma }$: the mean of the lognormal response time:
$$\overline{\sigma } = \frac{{\sum\nolimits_{{{\text{j = }}1}}^{NbLog} {\sigma_{{\text{j}}} } }}{NbLog}$$
(23)

$\overline{D}$: the mean of the lognormal distance covered in the kinematic space:
$$\overline{D} = \frac{{\sum\nolimits_{{{\text{j}} = 1}}^{NbLog} {|D_{{\text{j}}} |} }}{NbLog}$$
(24)

Evaluation, Results, and Discussion

The evaluation of the model is aimed at answering the following three questions:

1.
What is the meaning of each lognormal in speech?
2.
Which range of α (the proportion between F₁ and F₂) is adequate to apply the Sigma-lognormal model?
3.
Do the speech lognormal parameters model aging phenomena in speech?

Databases

In handwriting, a lognormal expresses a primitive movement, related to a simple stroke. If a lognormal in speech retains a similar meaning, strokes should be associated with simple speech movements that are linked to the movements needed to pronounce a speech sound. To check this hypothesis, we used the VTR-TIMIT database [51, 52]. The advantage of this database is that the formants it contains have been manually annotated and the phonemes labeled, providing the background that allows correlating lognormals to phonemes.

The VTR-TIMIT [51] database is composed of 538 English utterances from the TIMIT corpus [52], with phonetically compact sentences (SX) and phonetically diverse sentences (SI). The VTR-TIMIT database is labeled by phonemes. In this experiment, we use the complete dataset of 197 speakers and 538 utterances in total. The database is balanced in terms of speakers, dialects, gender, and phonemes [51].

Furthermore, as the Sigma-lognormal model links lognormals to the impulse response of a neuromotor system, it is assumed that only neurodegenerative or neuromotor diseases will affect the lognormal parameters in fluent speech. To assess this premise, we used the Saarbruecken Voice Database [53]. This database contains healthy speakers as well as speakers with laryngeal pathologies. The database is labeled with the speaker’s age and the kinds of pathologies they have.

The Saarbruecken Voice Database [53] is a collection of German speech recordings from more than 2000 speakers. The sentence recorded is “Guten Morgen, wie geht es Ihnen?” (“Good morning, how are you?”). For our experiments, we divided this database into three groups:

Young speakers’ group, which encompassed speakers aged between 20 and 30. It included both healthy speakers and speakers with laryngeal pathologies. There was a total of 609 speakers, with 236 males and 373 females.
Middle speakers’ group, which encompassed speakers aged between 40 and 50, containing healthy speakers and speakers and with laryngeal pathologies. There were a total of 352 speakers: 177 males and 175 females.
Older speakers’ group, which included all speakers aged 60 to 80 years old. All in all, there were 466 speakers in this group: 262 males and 204 females.

All the recordings were made in a controlled environment at a sampling frequency of 50 kHz and a 16-bit resolution. The recordings contained 71 different laryngeal pathologies, including some organic and functional members.

The term laryngeal pathologies (LP) comprises a wide range of disorders, the most frequent ones being organic, and affecting the morphology of the excitation organs and producing irregular vibration patterns [54]. Some examples of these disorders are polyps, nodules, edemas, and carcinomas. The phonation in these cases is characterized by noisy bands in the spectrogram, instability in the vibration frequency of the vocal cords, irregular airflow, and the presence of turbulent noise.

Experiment 1: Meaning of Lognormal in Speech and $\alpha$ Empirical Estimation

To assess the meaning of a lognormal in speech, the first experiment aimed to study the relationship between lognormals and phonemes. Additionally, as the velocity is a function of α (the proportion between formants), the optimum values of this constant are estimated in this experiment to be compared with the theoretical estimation in “Formants to End Effector Kinematics.”

For this assessment, we employed the publicly available VTR-TIMIT database of continuous speech, which is labeled by phonemes, thus providing the number of phonemes (N_p) in each sentence. All the sentences of this database were analyzed by ScriptStudio [31] and decomposed into a sequence of lognormals. The number and timing of lognormals were compared with the phonemic labels of the database. The velocity was obtained from the formant track provided by the dataset.

An example of such an analysis is shown in Fig. 5. It corresponds to an excerpt (“Their records”) from the sentence “How permanent are their records?” in English. In this figure, we can observe the speech waveform, the spectrogram with its formant track, and the lognormal decomposition of the velocity. In Fig. 5, it can be seen that there are almost as many phonemes as there are lognormals. Besides, the lognormals are temporally ahead of the phoneme as the movement between two phonemes precedes the sound. This is shown in Fig. 6, which is zoomed in Fig. 5. Further, we can observe how the velocity peak usually appears alongside the phoneme transition, since a fast change in the resonance cavities is required to pronounce the next sound. As well, we can see that when the duration of the phonemes is long or the articulation of the phoneme requires the pronunciation of more than one simple movement, more than one lognormal appears, as is the case between 1.1 and 1.2 s in Fig. 5.

To illustrate how the correspondence between the phonemes and lognormals is obtained from a sentence, a study was carried, looking at one phoneme after the other. For each phoneme, four possibilities were considered:

1.
True positive (TP): A lognormal of the sentence that overlaps the phoneme is assigned to it. In this case, TP_i = 1, with i being the index of the phoneme.
2.
False positive (FP): Other lognormals of the sentence that overlap the phoneme in study. In this case, FP_i is set to the number of lognormals that overlap the phoneme minus 1.
3.
False negative (FN): If no one lognormal overlaps the phoneme, FN_i is set to 1.
4.
True negative (TN): The set of lognormals that belong to the sentence do not overlap the phoneme. In this case, TN_i is set to the number of lognormals that do not overlap the phoneme.

Note that TP_i + FP_i + TN_i is equal to the number of lognormals of the sentence. The bounds of the lognormals are considered at 5% of its peak value. The measurements of the matching between the phonemes of the sentence and the lognormals obtained with the sentence are given in terms of the true positive rate and true negative rate of the sentence and are calculated as $TPR_{{\text{s}}} = {{\left( {\sum\nolimits_{i = 1}^{{N_{{\text{p}}} }} {TP_{i} } } \right)} \mathord{\left/ {\vphantom {{\left( {\sum\nolimits_{i = 1}^{{N_{{\text{p}}} }} {TP_{i} } } \right)} {N_{{\text{p}}} }}} \right. \kern-\nulldelimiterspace} {N_{{\text{p}}} }}$ and $TNR_{{\text{s}}} = {{\left( {\sum\nolimits_{i = 1}^{{N_{{\text{p}}} }} {TN_{{\text{i}}} } } \right)} \mathord{\left/ {\vphantom {{\left( {\sum\nolimits_{i = 1}^{{N_{{\text{p}}} }} {TN_{{\text{i}}} } } \right)} {\left( {N{{{\text{bLog}}}} - 1} \right)}}} \right. \kern-\nulldelimiterspace} {\left( {N{{{\text{bLog}}}} - 1} \right)}}$, respectively. The TPR and TNR of the VTR-TIMIT dataset are obtained by averaging the TPRs and TNRs of all the sentences in it. Figure 7 shows TPR and TNR curves per gender and the mean value of both as a function of α. Although this α value used to work out the velocity from the formant track Eqs. 9-11 was obtained theoretically in Eq. 17, it can be empirically validated to obtain the TPR and TNR for different α values.

Moreover, to see the correlation between the velocity peak occurrence (t_v) and the phoneme transition (t_p), the relationship between them, as seen in Fig. 8, is obtained through the error rate ($\varepsilon_{{\text{t}}}$) as:

$$\varepsilon_{{\text{t}}} = \sqrt {\frac{1}{Np}\sum\limits_{i = 1}^{{N{\text{p}}}} {(t_{{{\text{vi}}}} - t_{{{\text{pi}}}} )^{2} } }$$

(25)

For the experiments, although the value of k (see Eq. 19) does not affect the velocity profile shape or the result, it is approximated to 0.04 to keep a velocity peak close to 200 mm/s as measured by the sensors in [50].

We can see in Fig. 7 that the TPR curves get the maximum values of α around to 0.35. Further, as seen in Fig. 8, the $\varepsilon_{{\text{t}}}$ gets minimum values for α lying between 0.2 and 0.4, which means that the lognormal peak is closest to the phoneme in healthy adults. These results show that for both males and females, this procedure is effective and not overly sensitive to the value of α in the 0.2 ≤ α ≤ 0.4 range, which is similar to the value proposed in “Formants to End Effector Kinematics.”. Note that to pronounce some phonemes, more than one simple movement is required, and each subject could need a different α value.

Experiment 2: Speech Lognormals, Aging, and Laryngeal pathologies

As speakers get older, their neuromotor systems deteriorate and movements require additional effort and become slower. In handwriting, this implies additional short strokes and slow handwriting. The same should apply to speech: a greater number of short movements or small lognormals and more time between these lognormals than in young speakers.

In this context, to gain insights into the meaning of a lognormal representation in speech, the second experiment compared the lognormals detected in young and older speakers, including subjects with laryngeal pathologies.

The experiment was run with the Saarbruecken Voice Database, which labels recorded sentences with the age of the speakers and allows comparisons between the results obtained with the groups of young and older speakers. In the cases where result shows a significant difference (NbLog, $\overline \Delta t_o$,$\overline{\mu }$, SNR) the experiments were repeated in order to evaluate the evolution of the parameters along three age groups (young, middle, and older) (Table 5). Gender is omitted in the analyses that follow since the experiments in “Experiment 1: Meaning of Lognormal in Speech and Empirical Estimation” show similar results for males and females. Moreover, gender is reasonably balanced in the database, and the effect of age and gender cannot be confounded [19].

As the Saarbruecken Voice Database does not provide formant tracks, these were obtained with the following two formant estimation methods (available in the Praat software package [43]):

- “From speech to formant (sl)”: This algorithm is based on the implementation of the Split Levinson algorithm proposed by Willems [55]. It always finds the requested number of formants in every frame, even if they do not exist.

- “From speech to formant (keep all)”: In this case, Praat applies a Gaussian-like window and computes the formant from the LPC spectrum obtained through the Burg algorithm [56, 57].

The following settings were used in the Praat software for both methods to determine the first two formants in all the sentences of the Saarbruecken Voice Database: time step of 0.01 s, maximum number of formants of 5, and window length of 0.025 s.

To calculate the speech kinematics, based on the previous result, the parameter α was set to 0.3 and k to 0.04. The speech trajectory was processed with ScriptStudio® [30] to decompose the speech kinematics into lognormals.

The results are graphically shown in Figs. 9 and 10, and numerically in Tables 2-5. These tables also include the averaged values and the standard deviation of all the lognormal parameters, along with a one-way ANOVA (analysis of variance) [58]. Multiple comparison tests with Bonferroni correction are used when three classes (young, middle, and older) are analyzed. In this type of analysis, two groups are considered as statistically different if the residual p value is below 0.05 and statistically similar if the p value is above 0.05 [58].

Table 2 Averaged and standard deviation value of the lognormal parameters for young and older groups (parameter with statistically significant difference in italic)

Full size table

Table 3 Averaged value of the lognormal parameters for young healthy speakers and speakers with laryngeal pathologies (parameter with statistically significant difference in italic)

Full size table

Table 4 Averaged value of the lognormal parameters for older healthy speakers and speakers with laryngeal pathologies

Full size table

Table 5 Averaged and STD values of the lognormal parameters for young, middle, and older speakers with laryngeal pathologies (“To formant (sl)”)

Full size table

The findings can be summarized as follows:

1.
$\overline{{\Delta t_{{\text{o}}} }}$ is sensitive to the speaker’s age. While there is a significant difference between the young versus the older speakers (p value <0.001), there is no significant statistical variation between healthy speakers and speakers with laryngeal pathologies for this parameter (Tables 3-5). This means that the time between commands increases with age [19], due to the increase in ${\Delta t}_{0}$, but not with laryngeal pathologies. This is consistent with the well-known fact regarding slower reaction times in these conditions.
2.
The number of lognormals (NbLog) is greater for the group of older speakers than for the group of young speakers (Tables 2 and 5; Fig. 9). The p value is lower than 0.001 with both formant estimation algorithms. This is consistent with the results observed in handwriting, where the kinematic theory was used to evaluate aging. The results might suggest that the deterioration of motor control with aging is associated with the development of compensatory strategies such as emitting more motor commands to generate an adequate movement for a given task [19]. A significant difference is also observed between young healthy and LP speakers in the number of lognormals Fig. 10, Table 3). This type of disease should, therefore, influence the number of lognormals due to increases in the number of simple movements following necessary pauses and silences in a sentence.
3.
The SNR parameter decreases and the number of lognormals (NbLog) increases in both older people and the LP group in young people (Table 3), with the difference being more significant with age than with laryngeal disease.
4.
The $\overline{\mu }$ parameter increases with age, indicating that the impulse response of the system is slower in the case of older speakers. This difference is only appreciated with the “from speech to formant (sl)” method (Table 2), and this could be because this formant extraction method always gives the requested number of formants in every frame, allowing the best interpolation of the complete movement in the case of consonants.
5.
Regarding the parameters $\overline{\sigma }$, $\overline{{V_{{\text{p}}} }}$, and $\overline{D}$, in Tables 2-4, no significant difference can be seen between the two age groups.
6.
If we compare the three age groups (Table 5) only with the NbLog parameter, significant differences are found between the three classes (Fig. 9).

Discussion

The results show how the Sigma-lognormal model can be applied to model neuromuscular aging in speech. When the speech is modeled, each of the lognormals obtained reflects a group of commands and their end muscular response shapes. Neurological diseases, learning processes, or aging can affect this command sequence, changing the proportion of final movements, the speech rate, or the muscular response shape, which is consistent with the lognormality principle [19].

In the above results, the parameters related to the time between commands ($\overline{{\Delta t_{{\text{o}}} }}$) and the delay in the muscular response ($\overline{\mu }$) are longer in older people, as the movements become slower with age. Moreover, the experiments show a clear relationship between the number of simple movements found by the model and the number of pronounced phonemes. This relationship is conditional on the proportion between the first and second formants used to estimate the trajectory, as we tested with the experiments. Also, the method used to detect the formants can affect the parameters obtained, providing the Sigma-lognormal method with information on how the formant extractor is able to follow muscular movements.

The model was tested with two different languages (English and German) and seems to be language-independent, as has also been observed when the lognormal model is applied in handwriting [59].

Conclusions

A Sigma-lognormal representation for modeling speech kinematics has been presented. The speech kinematics is estimated from the formant tracks and decomposed into simple lognormal movements by applying the kinematic theory of rapid human movements. Moreover, besides the Sigma-lognormal parameters, a set of derived parameters is proposed to describe the timing and the neuromotor impulse response.

The experiments conducted illustrate the lognormal meaning in speech and indicate the adequate relation between first and second formants in order to get the kinematic information. The first experiment shows the link between a lognormal and a transition between phonemes, where the number of the lognormals is similar to that of phonemes. In this experiment, that the optimum proportion between the first and second formants was also verified. The second experiment links the lognormal to the generation of each end effector movement, showing that the parameter $\overline{{\Delta t_{{\text{o}}} }}$, as in handwriting, increases significantly from young to older speakers, and that it is independent of dysfunction, such as problems in the larynx or glottis closure. This allows modeling aging in speech production as a delay between commands and the end effector responses.

The results show that it is possible to model speech with the kinematic theory, which provides biological information about the simple movements involved in speech.

As future lines of research, the model could be applied to speech synthesis, speech recognition, speech rehabilitation, as well as to the design of systems to help in the screening and monitoring of some neurodegenerative diseases. The model could also permit the use of features similar to those obtained from studying other human movements, such as handwriting. Moreover, investigating the use of more formants to estimate speech kinematics is an unresolved issue that is yet to be addressed.

References

Guenther FH. Speech sound acquisition, coarticulation, and rate effects in a neural network model of speech production. Psychol Rev. 1995;102(3):594–621.
Article Google Scholar
Parrell B, Lammert AC, Ciccarelli G, Quatieri TF. Current models of speech motor control: a control-theoretic overview of architectures and properties. J Acoust Soc Am. 2019;145(3):1456–81.
Article Google Scholar
Perrier P, Ma L, Payan Y. Modeling the production of VCV sequences via the inversion of a biomechanical model of the tongue. 9th Eur Conf Speech Commun Technol. 2019;1041–4.
Patri JF, Diard J, Perrier P. Optimal speech motor control and token-to-token variability: a Bayesian modeling approach. Biol Cybern. 2015;109(6):611–26.
Article MathSciNet MATH Google Scholar
Kröger BJ, Kannampuzha J, Neuschaefer-Rube C. Towards a neurocomputational model of speech production and perception. Speech Commun. 2009;51(9):793–809.
Article Google Scholar
Tourville JA, Guenther FH. The DIVA model: a neural theory of speech acquisition and production. Lang Cogn Process. 2011;26(7):952–81.
Article Google Scholar
Saltzman EL, Munhall KG. A dynamical approach to gestural patterning in speech production. Ecol Psychol. 1989;1(4):333–82.
Article Google Scholar
Houde JF, Nagarajan SS. Speech production as state feedback control. Front Hum Neurosci. 2011;5(October):1–14.
Google Scholar
Parrell B, Ramanarayanan V, Nagarajan S, Houde J. The FACTS model of speech motor control: fusing state estimation and task-based control. PLoS Comput Biol [Internet]. 2019;15(9):1–26. Available from: https://doi.org/10.1371/journal.pcbi.1007321.
Plamondon R, O’Reilly C, Galbally J, Almaksour A, Anquetil É. Recent developments in the study of rapid human movements with the kinematic theory: applications to handwriting and signature synthesis. Pattern Recognit Lett. 2014;35(1):225–35.
Article Google Scholar
Plamondon R. A kinematic theory of rapid human movements. Part I: Movement representation and generation. Biol Cybern [Internet]. 1995;72(4): 295–307. Available from: https://www.ncbi.nlm.nih.gov/pubmed/7748959.
Plamondon R. A kinematic theory of rapid human movements. Part II: Movement time and control Biol Cybern. 1995;72(4):309–20.
Article MATH Google Scholar
Plamondon R. A kinematic theory of rapid human movements. Part III: Kinematic Outcomes Biol Cybern. 1998;78(2):133–45.
MATH Google Scholar
Plamondon R, Pirlo G, Anquetil É, Rémi C, Teulings HL, Nakagawa M. Personal digital bodyguards for e-security, e-learning and e-health: a prospective survey. Pattern Recognit. 2018;81:633–59.
Article Google Scholar
Leiva LA, Martín-Albo D, Plamondon R. The kinematic theory produces gestures. Human-like Stroke Interact Comput. 2017;29(4):552–65.
Google Scholar
Lebel K, Nguyen H, Duval C, Plamondon R, Boissy P. Capturing the cranio-caudal signature of a turn with inertial measurement systems: methods, parameters robustness and reliability. Front Bioeng Biotechnol [Internet]. 2017;5:1–13. Available from: http://journal.frontiersin.org/article/10.3389/fbioe.2017.00051/full.
Martín-Albo D, Leiva LA, Huang J, Plamondon R. Strokes of insight: user intent detection and kinematic compression of mouse cursor trails. Inf Process Manag. 2016;52(6):989–1003.
Article Google Scholar
Nadeau A, Lungu O, Duchesne C, Robillard MÈ, Bore A, Bobeuf F, et al. A 12-Week cycling training regimen improves gait and executive functions concomitantly in people with parkinson’s disease. Front Hum Neurosci [Internet]. 2017;10:1–10. Available from: http://journal.frontiersin.org/article/10.3389/fnhum.2016.00690/full.
Plamondon R, O’Reilly C, Rémi C, Duval T. The lognormal handwriter: learning, performing, and declining. Front Psychol. 2013;4:1–14.
Carmona-Duarte C, Ferrer MA, Parziale A, Marcelli A. Temporal evolution in synthetic handwriting. Pattern Recognit 2017;68.
Ferrer MA, Diaz M, Carmona C, Morales A. A behavioral handwriting model for static and dynamic signature synthesis. IEEE Trans Pattern Anal Mach Intell [Internet]. 2016;8828(c): 1. Available from: http://ieeexplore.ieee.org/document/7494603/.
Woch A, Plamondon R. Using the framework of the kinematic theory for the definition of a movement primitive. Mot Control. 2004;8(4):547–57.
Article Google Scholar
Carmona-Duarte C, Góme-Vilda P, Ferrer MA, Plamondon R, Londral A. Study of several parameters for the detection of amyotrophic lateral sclerosis from articulatory movement. Loquens. 2017;4(January):1–5.
Google Scholar
Carmona-Duarte C, Ferrer M, Gómez-Vilda P, Gemmert AWA Van. Plamondon R. A common framework to evaluate Parkinson’s disease in voice and handwriting. In: ICPRAI 2018 - International Conference on Pattern Recognition and Artificial Intelligence. 2018.
Carmona-Duarte C, Plamondon R, Gómez-Vilda P, Ferrer MA, Alonso JB, Londral ARM. Application of the lognormal model to the vocal tract movement to detect neurological diseases in voice. In: Chen YW, Tanaka S, Howlett RJL, editors. Innovation in Medicine and Healthcare 2016 Smart Innovation, Systems and Technologies. Switzerland: Springer; 2016. p. 25–35.
Google Scholar
Carmona-Duarte C, Alonso JB, Diaz M, Ferrer MA, Gómez-Vilda P, Plamondon R, et al. Kinematic modeling of diphthong articulation. In: Esposito A, Faundez-Zanuy M, Esposito AM, Cordasco G, Drugman T, Solé-Casals J, et al., editors. Recent Advances in Nonlinear Speech Processing. Cham: Springer; 2016. p. 53–60.
Chapter Google Scholar
Hafting T, Fyhn M, Molden S, Moser M, Moser EI. Microstructure of a spatial map in the entorhinal cortex. Nature. 2005;436(7052):801–6.
Article Google Scholar
Moser EI, Moser MB, Roudi Y. Network mechanisms of grid cells. Philos Trans R Soc B Biol Sci. 2014;369:1635.
Tremblay P, Sato M, Deschamps I. Age differences in the motor control of speech: an fMRI study of healthy aging. Hum Brain Mapp. 2017;38(5):2751–71.
Article Google Scholar
O’Reilly C, Plamondon R. Development of a sigma-lognormal representation for on-line signatures. Pattern Recognit [Internet]. 2009;42(12):12:3324–37. Available from: https://doi.org/10.1016/j.patcog.2008.10.017.
Djioua M, Plamondon R. A new algorithm and system for the characterization of handwriting strokes with delta-lognormal parameters. IEEE Trans Pattern Anal Mach Intell. 2009;31(11):2060–72.
Article Google Scholar
Ferrer MA, Diaz M, Carmona-Duarte C, Plamondon R. iDeLog: Iterative dual spatial and kinematic extraction of sigma-lognormal parameters. IEEE Trans Pattern Anal Mach Intell. 2018;PP(c):1.
Plamondon R, Feng C, Woch A. A kinematic theory of rapid human movement. Part IV: A formal mathematical proof and new insights. Biol Cybern 2003;89(2):126–38.
Marcelli A, Parziale A, Senatore R. Some observations on handwriting from a motor learning perspective. CEUR Workshop Proc. 2013;1022:6–10.
Google Scholar
Deng L, Acero A, Bazzi I. Tracking vocal tract resonances using a quantized nonlinear function embedded in a temporal constraint. IEEE Trans Audio, Speech Lang Process. 2006;14(2):425–34.
Article Google Scholar
Rabiner LR. Digital Processing of Speech Signal. Prentice - Hall; 1978.
Schroeder MR. Determination of the Geometry of the Human Vocal Tract by Acoustic Measurements. J Acoust Soc Am [Internet]. 1967;41(5):1283–94. Available from: https://doi.org/10.1121/1.1910429.
Atal BS, Chang JJ, Mathews M V., Tukey JW. Inversion of articulatory‐ to‐ acoustic transformation in the vocal tract by a computer‐ sorting technique. J Acoust Soc Am [Internet]. 1978;63(5):1535–55. Available from: https://doi.org/10.1121/1.381848.
Gómez-Vilda P, Gómez-Rodellar A, Vicente JMF, Mekyska J, Palacios-Alonso D, Rodellar-Biarge V, et al. Neuromechanical modelling of articulatory movements from surface electromyography and speech formants. Int J Neural Syst. 2019;29(02):1850039.
Article Google Scholar
Gómez-Vilda P, Ferrández-Vicente JM, Rodellar-Biarge V. Simulating the phonological auditory cortex from vowel representation spaces to categories. Neurocomputing. 2013;114:63–75.
Article Google Scholar
Gómez-Vilda P, Ferrández-Vicente JM, Rodellar-Biarge V, Álvarez-Marquina A, Mazaira-Fernández LM, Martínez Olalla R, et al. Neuromorphic detection of speech dynamics. Neurocomputing. 2011;74(8):1191–202.
Article Google Scholar
Gómez-Vilda P, Ferrández-Vicente JM, Rodellar-Biarge V, Fernández-Baíllo R. Time-frequency representations in speech perception. Neurocomputing. 2009;72(4–6):820–30.
Article Google Scholar
Boersma, Paul & Weenink D. Praat: doing phonetics by computer [Internet]. 2019. Available from: http://www.praat.org/.
Dromey C, Jang GO, Hollis K. Assessing correlations between lingual movements and formants. Speech Commun [Internet]. 2013;55(2):315–28. Available from: http://dx.doi.org/10.1016/j.specom.2012.09.001.
Gómez P, Mekyska J, Gómez A, Palacios D, Rodellar V, Álvarez A. Characterization of Parkinson’s disease dysarthria in terms of speech articulation kinematics. Biomed Signal Process Control [Internet]. 2019;52:312–20. Available from: https://doi.org/10.1016/j.bspc.2019.04.029.
Gómez-Vilda P, Londral ARM, Rodellar-Biarge V, Ferrández-Vicente JM, de Carvalho M. Monitoring amyotrophic lateral sclerosis by biomechanical modeling of speech production. Neurocomputing [Internet]. 2015;151(P1):130–8. Available from: https://doi.org/10.1016/j.neucom.2014.07.074.
Hillenbrand J, Getty LA, Clark MJ, Wheeler K. Acoustic characteristics of American English vowels. J Acoust Soc Am [Internet]. 1995;97(5):3099–111. Available from: http://asa.scitation.org/doi/10.1121/1.411872.
Pätzold M, Simpson AP. Acoustic analysis of German vowels in the Kiel Corpus of Read Speech. Arbeitsberichte des Instituts für Phonetik und Digit Sprachverarbeitung Univ Kiel [Internet]. 1997;32(1978):215–47. Available from: http://www.ipds.uni-kiel.de/kjk/pub_exx/aipuk32/mpas.pdf.
Whitfield J, Dromey C, Palmer P. Examining acoustic and kinematic measures of articulatory working space: effects of speech intensity. J Speech, Lang Hear Res. 2018;61(May):1–14.
Google Scholar
Kuberski SR, Gafos AI. The speed-curvature power law in tongue movements of repetitive speech. PLoS ONE. 2019;14(3):1–25.
Article Google Scholar
Li Deng, Xiaodong Cui, Pruvenok R, Huang J, Momen S, Yanyi Chen et al. A database of vocal tract resonance trajectories for research in speech processing. 2006;I-369-I–372.
Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS, Dahlgren NL. TIMIT acoustic-phonetic continuous speech corpus LDC93S1. Philadelphia: Linguistic Data Consortium; 1993.
Google Scholar
Barry WJ, Putzer M. Saarbruecken Voice Database [Internet]. Institute of Phonetics, Univ. of Saarland; Available from: http://www.stimmdatenbank.coli.uni-saarland.de/.
Godino-Llorente JI, Gomez-Vilda P, Blanco-Velasco M. Dimensionality reduction of a pathological voice quality assessment system based on Gaussian mixture models and short-term cepstral parameters. IEEE Trans Biomed Eng. 2006;53(10):1943–53.
Article Google Scholar
Willems L. Robust formant analysis. IPO Rep. 1986;529:1–25.
Google Scholar
Childers DG. Modern spectrum analysis. IEEE Press; 1978. p. 252–255.
Press WH, Teukolsky SA, Vetterling WT, Flannery BP. Numerical recipes in C: The art of scientific computing. 2nd ed. Cambridge University Press 1992.
Hogg RV, Ledolter J. Engineering Statistics. New York: MacMillan; 1987.
MATH Google Scholar
Bhattacharya U, Plamondon R, Dutta Chowdhury S, Goyal P, Parui SK. A sigma-lognormal model-based approach to generating large synthetic online handwriting sample databases. Int J Doc Anal Recognit. 2017;20(3):155–71.
Article Google Scholar

Download references

Funding

This study was funded by the Spanish government’s MIMECO TEC2016-77791 research project and European Union FEDER program/funds, Teca-Park/MonParLoc FGCSIC CENIE-0348_CIE_6_E (InterReg Programme) to Pedro Gomez-Vilda, the NSERC-Canada Grant RGPIN-2015-06409 to R. Plamondon. C. Carmona-Duarte was supported by a Juan de la Cierva contract (IJCI-2016-27682), Viera y Clavijo grant from ULPGC and the “José Castillejo” mobility grant from the Spanish government CAS18/00315.

Author information

Authors and Affiliations

Instituto Universitario Para El Desarrollo Tecnológico Y La Innovación en Comunicaciones, Universidad de Las Palmas de Gran Canaria, Las Palmas de Gran Canaria, Spain
C. Carmona-Duarte & M. A. Ferrer
Laboratoire Scribens, Département de Génie Électrique, Polytechnique Montréal, Montreal, QC, Canada
R. Plamondon
Facultad de Informática, Universidad Politécnica de Madrid, Campus de Monte-Gancedo, s/n, 28660 Boadilla del Monte, Madrid, Spain
A. Gómez-Rodellar & P. Gómez-Vilda

Authors

C. Carmona-Duarte
View author publications
You can also search for this author in PubMed Google Scholar
M. A. Ferrer
View author publications
You can also search for this author in PubMed Google Scholar
R. Plamondon
View author publications
You can also search for this author in PubMed Google Scholar
A. Gómez-Rodellar
View author publications
You can also search for this author in PubMed Google Scholar
P. Gómez-Vilda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to C. Carmona-Duarte.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Ethical Standards

This article does not contain any studies with human participants performed by any of the authors.

Informed Consent

Informed consent was obtained from all individual participants included in the study.

Research Involving Human and Animal Rights

This paper does not contain any studies with animals performed by any of the authors.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Carmona-Duarte, C., Ferrer, M.A., Plamondon, R. et al. Sigma-Lognormal Modeling of Speech. Cogn Comput 13, 488–503 (2021). https://doi.org/10.1007/s12559-020-09803-8

Download citation

Received: 04 September 2020
Accepted: 30 November 2020
Published: 07 February 2021
Issue Date: March 2021
DOI: https://doi.org/10.1007/s12559-020-09803-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Sigma-Lognormal Modeling of Speech

Abstract

Similar content being viewed by others

Kinematic Modelling of Dipthong Articulation

Relating Facial Myoelectric Activity to Speech Formants

Application of the Lognormal Model to the Vocal Tract Movement to Detect Neurological Diseases in Voice

Introduction

Overview of the Sigma-Lognormal Model