1 Introduction

Speech is an integral component of how humans interact with digital devices these days - be it text to speech [1], keyword spotting [2], voice-controlled devices such as Apple’s SiriFootnote 1, Microsoft’s CortanaFootnote 2, and Google’s Google HomeFootnote 3, or smart gadgets. Speech, in contrast to gesture or touch-based systems, is a natural way of communicating with these devices. The accuracy of voice-controlled devices is highly dependent upon speech variability including speaking rate and speaking style. Speaking very fast or slow for instance, can easily lower the recognition accuracy in devices if they are not tuned for it.

Late 90s saw a lot of research on the topic of the speaking rate effects on speech recognition performance e.g. [3,4,5]. The studies at that time verified that the speech recognition performance, e.g. Word Error Rate (WER) degrades significantly at a fast speaking rate [6, 7]. Later, the performance degradation as a result of variations in speaking rate was also confirmed for speaker recognition in studies carried out in [8, 9]. Attempts were made to figure out precisely how speaking rate effects Automatic Speech Recognition (ASR) performance by showing a direct correlation between local average Hidden Markov Model (HMM) score and local speech rate [10].

After almost twenty years, however, the research is still in a preliminary stage for most speech-based applications such as in the case of speaker authentication [11], where it is argued that an important reason for performance degradation is due to a distorted spectrum caused by variations in speaking rate [12], particularly for slow speaking rates. The rate at which people speak depends on many characteristics related to the speaker such as gender, age, and the psychological state they are in. For instance, a study presented in [13] showed that on average older people speak slowly compared to young ones and females talk slower than the males. Additionally, deviating speaking rate is often observed in our daily life. People usually speak fast when in hurry or angry or they may speak slow if they are tired, sad, or sick [14]. Speaking rate patterns also differs between the native speakers and non-native speakers. Research has shown that non-native speakers talk much slower compared to native speakers [15]. More recent studies also revealed that non-native speakers exhibit more variation in speaking rate [16]. On the other hand, this suprasegmental characteristic between native and non-native speakers in spontaneous speech suggests that non-native speakers are less variable than native speakers [17].

Speaking rate variability affects the mapping between the acoustic properties of speech and the linguistic interpretation of the utterances [18]. ASR systems employing supervised machine learning techniques and deep learning methods can efficiently learn the phonetic patterns. However, speaking rate variability can drastically decrease the performance of the ASR systems if they are not tuned for it. While listeners can naturally adapt to the changes in speaking rate and can maintain phonetic constancy, applying rate normalization in ASR systems for understanding phonetic patterns can be a challenging task.

This paper, therefore, investigates speaking rate variability from two perspectives: (a) which speech features perform best under variable speaking rate conditions? and (b) which DNN architecture obtains the highest accuracy on a frame classification task for speech recognition?

The remainder of the paper is structured as follows. Section 2 presents related work, followed by Sect. 3 where we describe DNN. Section 4 gives an overview of the experimental setup. Results and their analysis are presented in Sect. 5, while Sect. 6 concludes the paper with insight into the future work.

2 Related Work

Meyer et al. reported one of the earliest works that exploited the logatome speech database discussed in Sect. 4.1. They conducted a study on the performance of ASR to Human Speech Recognition (HSR) for several intrinsic variabilities such as speaking rate, speaking effort and dialect [19]. Their ASR model based on HMM uses a three-state-model for each phoneme. Describing each phoneme by a binary voicing and ternary features defining manner and place of articulation, they observed that misclassification of voicing and manner of articulation were the major causes for recognition errors. In a similar work [20], authors address reducing the ASR and human listeners gap while having a particular emphasis on intrinsic variations of speech. The work was further extended to use DNN as the ASR backend where phoneme confusion matrices obtained by ASR models for Mel Frequency Cepstral Coefficients (MFCC), FBE, and Perceptual Linear Prediction (PLP) features were compared against those obtained by human subjects [21]. FBE and PLP showed the highest correlation coefficient score between ASR and human subjects for various Signal to Noise Ratio (SNR) values.

Varghese and Mathew in [22] used a reservoir computing technique in their two-layered Recurrent Neural Network (RNN) for classifying 39 phoneme classes on the TIMIT database. They used the Relative spectral transformation Perceptual Linear Prediction (Rasta-PLP) and MFCC as features for frame level classification where MFCC performed marginally better than Rasta-PLP. A comparison of MFCC and supervised Isomap on the task of phoneme recognition is carried out in [23]. The authors also proposed a supervised manifold learning algorithm that outperforms the baseline MFCC and the supervised Isomap. Authors in [24] compared the performance of MFCC, PLP, and Rasta-PLP using fuzzy logic and Deep Belief Networks (DBN) on the African language phoneme classification. MFCC and Rasta-PLP results were far better than PLP while fuzzy logic classified consonants better than vowels with respect to DBN. A similar study related to phonetic analysis on Arabic speech is presented in [25] which compares six acoustic features that include Linear Predictive Coding (LPC), MFCC, PLP, FBE, Mel-filter bank coefficients (MELSPEC), and Linear Prediction Reflection Coefficients (LREFC). A five-state HMM is used to model each phoneme with a mixture of sixty-four Gaussian distributions. FBE achieved the highest accuracy while MELSPEC results were marginally behind followed by PLP and MFCC.

Comparison between different acoustic features have been addressed on different datasets and for various speech related activities, e.g. on digits [26], for event detection [27], and on emotional speech classification [28] among others. Not much can be found in the literature on how they perform under variable speaking rates. This paper, therefore, addresses the question of how these acoustic features compare to each other on different architecture combination, context size, and for different speaking rates.

3 DNN

A deep neural network is a term used for artificial neural network with several hidden layers. A Multi Layer Perceptron (MLP) consisting of at least two or more hidden layers is often used as a baseline DNN, unlike a vanilla network that consists of a single hidden layer. An MLP is a feedforward neural network in which all the neurons in one layer typically are fully connected to the neurons in the adjacent layer. The model uses two phases for estimating the weights, first in an unsupervised method the initial values for the weights are found and then in the second phase, the initialized weights are updated by a supervised technique called “backpropagation”. The first phase is called pre-training, and the latter one is called fine-tuning. The training procedure of the DNN is described in the following subsections.

3.1 Pre-training

As we know initializing the weights when the network has multiple hidden layers is a bit challenging and will affect the convergence of the network weights. The main idea behind the pre-training is to find the initial weights which are estimated by fitting a generative DBN to the input data [29]. The DBN can be trained in a greedy layer by layer approach in which each pair of layers are considered as a Restricted Boltzmann Machine (RBM). An RBM has two layers, one of them contains visible nodes (\(v=[v_1, v_2, ..., v_K]^T\)) and the other one are hidden nodes (\(h=[h_1, h_2, ..., h_L]^T\)). There are different variations of RBM according to the data type available. When the input values are real-valued data, the Gaussian-Bernoulli RBMs are used, and when the input values are binary, the Bernoulli-Bernoulli RBMs are used. The difference between these two RBM is in the energy function definition. For the Bernoulli-Bernoulli RBMs, the energy function is defined as:

$$\begin{aligned} E(v,h)=-\sum _{k=1}^{K}{\sum _{l=1}^{L}}v_k h_l w_{kl} -\sum _{k=1}^{K} v_k a_k -\sum _{l=1}^{L} h_l b_l \end{aligned}$$
(1)

where the \(w_{kl}\) are the weights between the visible unit \(v_k\) and the hidden unit \(h_l\), \(a_k\) is the bias for the visible unit \(v_k\) and \(b_l\) is the bias for the hidden unit \(h_l\). The weights and biases are real-valued data, and the hidden and visible are binary-valued data. For the Gaussian-Bernoulli RBMs, the energy function is defined as:

$$\begin{aligned} E(v,h)=-\sum _{k=1}^{K}{\sum _{l=1}^{L}}\frac{v_k}{\sigma _k} h_l w_{kl} -\sum _{k=1}^{K} \frac{(v_k-a_k)^2}{2\sigma _k^2} -\sum _{l=1}^{L} h_l b_l \end{aligned}$$
(2)

where the \(\sigma _k\) is the standard deviation of the Gaussian noise for visible unit \(v_k\) which is a real-valued unit. The joint probability of the the visible and hidden units is defined as follows:

$$\begin{aligned} p(v,h)=\frac{exp(-E(v,h))}{Z} \end{aligned}$$
(3)

where Z is the partition function which is sum over all values of vh.

$$\begin{aligned} Z=\sum _{v,h} e^{-E(v,h)} \end{aligned}$$
(4)

The weights, biases and the standard deviations are estimated during the training by maximizing the expected log probability, given in (5), of the visible units with the contrastive divergence (CD) algorithm [29].

$$\begin{aligned} \hat{\varvec{\theta }}= \mathrm {arg}\mathrm { \max _{\varvec{\theta }}\ } \varvec{\mathrm E}[\mathrm {log\ }p(v)]=\mathrm {arg}\mathrm { \max _{\varvec{\theta }}\ } \varvec{\mathrm E}[\mathrm {log\ }\sum _{h}p(v,h)] \end{aligned}$$
(5)

where \(\varvec{\theta }\) contains the weights, biases and standard deviations, \(\hat{\varvec{\theta }}\) is the estimated values for the parameters and \(\varvec{\mathrm E}[.]\) is the expectation of the containing arguments. After training the first RBM on the input data which are visible units (v1), the hidden units (h1) are inferred. The inferred units are used as the visible units for the next RBM (v2 = h1) to estimate the hidden units (h2). For the number of hidden layers in DBN, RBMs are trained and stacked after each other. The Fig. 1 shows the stacked RBMs and the resulted DBN.

Fig. 1.
figure 1

Graphic model of DBN by stacking the RBMs.

3.2 Fine-Tuning

After unsupervised learning and estimating the initial values for the network parameters, supervised learning is performed by adding the labels as the output units on top of the DBN. The output weights are randomly initialized, and the cross-entropy cost function is considered to update the weights by minimizing the cross entropy between the estimated outputs and the labels by using the back-propagation algorithm. Because of the multiclass problem, a softmax function is considered at the output layer to estimate the probabilities of input samples classified to each class.

3.3 Architecture Configurations

Figure 2 shows the model architecture of a three-hidden layer DNN with 1024 neurons in each layer. For the experimentation, the following parameters of the DNN are used: loss function: categorical cross entropy, learning rate: 0.01, optimizer: Stochastic Gradient Descent (SGD), activation function: sigmoid, batch size for training and prediction: 1024. A softmax function is used at the output layer.

Fig. 2.
figure 2

Graphic model of DBN by stacking the RBMs.

4 Experimental Setup

4.1 Dataset

The Oldenburg Logatome (OLLO) corpus is used for the experiments. It is a speech database that contains simple non-sense combinations of consonants (C) and vowels (V), which are referred to as logatomes. 150 different CVCs and VCVs combination were spoken by 40 German and 10 French speakers. The VCVs are the combination of fourteen central consonants and five outer vowels. Also, eight consonants and ten vowels are combined to make the CVCs. In both combinations the outer phonemes are the same.

Four different dialects are covered by the German speakers: no dialect; Bavarian; East Frisian and East Phalian. The database contains logatome spoken at a normal pace, followed by variabilities such as, ‘fast’, ‘slow’, ‘loud’, ‘soft’ and ‘questioning’. These variabilities can be grouped into three categories: (i) speaking rate (fast, slow and normal), (ii) speaking style (question and statement), and (iii) speaking effort (loud, soft and normal). Each of 150 logatomes has been repeated three times by each speaker. The same number of male and female speakers is used to record the database to cover the gender variabilities. The sampling frequency of the utterances is 16 kHz. OLLO has mostly been used for comparison between HSR and ASR [19, 30]. We primarily chose to use this dataset for following reasons:

(a) Evaluating different variabilities and their effects on the ASR systems is possible by using this database.

(b) Also, OLLO may be useful in identifying how dialect and accent influence the speech recognition performance.

In the following experiments, the ten speakers with no dialect have been chosen. The variabilities fast, slow and normal are used.

4.2 Speech Features

This study uses the most popular acoustic features which include FBE, MFCC, LPC, and Line Spectral Frequencies (LSF). Features from multiple resolutions concerning both time and frequency domain are extracted, resulting in different frame shifts and different feature dimension respectively.

FBE. FBE features are extracted by a filter bank of 40 filters with uniform bandwidth on the mel frequency scale. The mel frequency scale closely resemble the frequency sensitivity of the human auditory system. FBE features are computed by taking a logarithm of the filterbank energies. MFCCs are then obtained by applying the DCT transformation on the FBE features. As a result of this transformation, the features become nearly uncorrelated. To preserve the information in both FBE and MFCC after using the DCT transform, all of the features are used and there is no dimensionality reduction.

MFCC. MFCCs are obtained by applying the DCT transformation on the FBE features. As a result of this transformation, the features become nearly uncorrelated. To preserve the information in both FBE and MFCC after using the DCT transform, all of the features are used and there is no dimensionality reduction.

LPC. According to the source-filter model of speech, the vocal tract acts as a filter on the excitation signal produced by lungs and vocal cords [31]. An all-pole filter is considered to model the vocal tract frequency response and the obtained coefficients as a result are the LPC features. These coefficients are extracted from a short time windowed signal to satisfy the quasi-stationarity of the modeled signal. The filter order is chosen as 40 to match the dimensionality of the FBE and MFCC input features, higher than for typical LP analysis of speech.

LSF. Line spectral frequencies or Line Spectral Pairs (LSP) is another variant of LPC features which is less sensitive to quantization noise compared to the LPC features. LSF order is kept as same as the LPC order which is 40.

4.3 Context Dependent Feature Representation

The input to the DNN is a context-dependent feature vector \(\varvec{x}_c(n)\) which is computed by considering the frames on the left and right side of the current frame x(n). M is the number of preceding and following frames that are concatenated with the current frame x(n) to constitute the DNN input vector. The left and right context size can vary, but for these experiments, both are kept same. The concatenated input vector shown in (6) is of size \(D\times (2M+1)\), where D is the feature vector dimension and M is the context size. In the experiments, four different context sizes (\(M=3, 5, 7\) and 10) are considered to assess the effect of context on the frame classification accuracy.

$$\begin{aligned} \varvec{x}_c(n)=[\varvec{x}(n-M)^T, \dots , \varvec{x}(n)^T, \dots , \varvec{x}(n+M)^T]^T, \end{aligned}$$
(6)

where the T is transpose operator.

5 Results and Discussion

To evaluate the effect of different speaking rate on the frame classification performance, several experiments are conducted. Four different feature types are extracted from slow, normal and fast speaking rate by using 25 ms frame length and 10 ms frame shift. The frame classification task is performed by sequentially selecting one speaker for testing and the remaining speakers for training the classifier which implies a speaker independent frame classification task. The number of phone classes is 24. DNN is chosen as the classifier for this task. The training procedure of DNN is according to Sect. 3. Several experiments were conducted by varying the number of neurons (128, 256, 512, 1024) and size of the hidden layers (2, 3, 4, 5) for all feature sets.

Table 1. Frame accuracy rate for different features and different structures for normal speaking style and context size \(M=3\)

Our findings revealed that 512 and 1024 neurons in each layer have the highest accuracy rate for 3 and 4 hidden layers network. This paper, therefore, presents the results for 3 and 4 layers architecture only having 512, and 1024 neurons. The Tables 1, 2, 3 and 4 show the frame accuracy rate for the training and test data with normal speaking style. The performance of FBE is higher than the other feature types whereas LPC has the worst performance. By looking at the effect of context size, we see that by moving from context size \(M=3\) to the higher values, the performance increases significantly. From context size \(M=5\) to \(M=10\) there is not that much increase in the performance for LSF and MFCC, somehow the performance is saturated, but for the FBE it has almost one percent better accuracy rate.

Table 2. Frame accuracy rate for different features and different structures for normal speaking style and context size \(M=5\)
Table 3. Frame accuracy rate for different features and different structures for normal speaking style and context size \(M=7\)
Table 4. Frame accuracy rate for different features and different structures for normal speaking style and context size \(M=10\)
Fig. 3.
figure 3

Rates for correct and incorrect classification of consonants and vowels. The green bar shows the correct classification rate for consonants, blue bar shows the misclassification of a consonant as another consonant, red bar shows the confusion of the consonants with vowels. The yellow bar shows the correct classification rate of vowels, cyan bar shows the misclassification of a vowel as a consonant and the purple bar shows the confusion of the vowels with another vowel. (Color figure online)

Fig. 4.
figure 4

Vowel part of the confusion matrix for different speaking rate

In order to examine whether speaking rate would have different impact on consonant and vowel recognition, we have looked at the average accuracy of the frame classification for vowels and consonants, respectively. In addition, the misclassifications were broken down into two separate categories for both vowels and consonants: confusions within the broad class (e.g. a consonant misclassified as another consonant) and confusions between the classes (e.g. a vowel misclassified as a consonant). Figure 3 shows these performance measures for test data with different speaking rates using FBE features as the input vector with context size \(M=10\) to a DNN with 1024 hidden nodes in each of the three hidden layers. By considering the normal speaking rate as the reference point, we can see that the true classification rate of the consonants in fast speaking rate is the lowest one, and it is confused more with vowels. Also the true classification rate of the vowels in slow speaking rate is the lowest one and the vowels confusion increased.

For justification of these claims, we look at the confusion matrices for slow and normal speaking rates. In Fig. 4 it can be easily found that the slow speaking rate causes the confusion between long and short vowels more than the normal speaking rate. It is worth mentioning here that the summation of the rows of confusion matrices in Fig. 4 are not equal to one, because the confusion with the consonants are not shown here.

The results in the Tables 5, 6, 7 and 8 are from the same networks as in the previous experiments. The only difference here is that the networks are trained on the ‘normal’ speaking rates while the performance is evaluated on the slow and fast speaking test data sets. In these experiments, the FBE results are superior among the other feature types. The context size does not have any effect on the LPC results, but for the other features the performance increases moderately. FBE has the higher performance even with the smaller context size. For the FBE the accuracy rate for fast speaking style is always better than slow speaking rate, but for the other feature types, the slow speaking rate has better performance than the fast speaking rate within the longer context size.

Table 5. Frame accuracy rate for fast (V1) and slow (V2) speaking styles on the networks trained on normal speaking style, context size \(M=3\)
Table 6. Frame accuracy rate for fast (V1) and slow (V2) speaking styles on the networks trained on normal speaking style, context size \(M=5\)
Table 7. Frame accuracy rate for fast (V1) and slow (V2) speaking styles on the networks trained on normal speaking style, context size \(M=7\)
Table 8. Frame accuracy rate for fast (V1) and slow (V2) speaking styles on the networks trained on normal speaking style, context size \(M=10\)

6 Conclusion

The paper provides a comparative analysis of acoustic features for LPC, LSF, MFCC, and FBE trained using DNN on slow, fast, and normal speaking utterances. Different combinations of DNN architectures by varying the number of layers and the number of nodes in each layer are tested. Three layers architecture with 512 and 1024 nodes in each layer performed well. Further experiments by varying the context window size for each feature are performed. Our initial findings revealed that on different context sizes, FBE achieved the highest frame classification accuracy for the normal speaking style. A similar trend was observed when the classifier was trained on the normal speaking rate and tested on slow and fast speaking rate. It was also observed that the bigger the context window, the better the classification accuracy.

Future work should focus on evaluating different deep learning classifiers such as those suited for predicting time series data, e.g., long short-term memory to see the effect of phoneme recognition on variable speaking rates.