To goal of this experiment is to develop a procedure for selecting utterances from a list of predicted responses and to evaluate the effects of different language models, pronunciation lexicons, and acoustic models.
4.1. Method
4.1.1. Material
The speech material for the present experiments was taken from the JASMIN speech corpus [45], which contains speech of children, non-natives, and elderly people. Since the non-native component of the JASMIN corpus was collected for the aim of facilitating the development of ASR-based language learning applications, it is particularly suited for our purpose. Speech from speakers with different mother tongues was collected, because this realistically reflects the situation in Dutch L2 classes. These speakers have relatively low proficiency levels, namely, A1, A2, and B1 of the Common European Framework (CEF), because it is for these levels that ASR-based CALL applications appear to be most needed.
The JASMIN corpus contains speech collected in two different modalities: read speech and human-machine dialogues. The latter were used for our experiments because they more closely resemble the situation we will encounter in our CALL application. The JASMIN dialogues were collected through a Wizard-of-Oz-based platform and were designed such that the wizard was in control of the dialogue and could intervene when necessary. In addition, recognition errors were simulated and difficult questions were asked to elicit some typical phenomena of human-machine interaction that are known to be problematic in the development of spoken dialogue systems, such as hyperarticulation, restarts, filled pauses, self-talk, and repetitions.
The material we used for the present experiments consists of speech from 45 speakers, 40% male and 60% female, with 25 different L1 backgrounds. Ages range from 19 to 55, with a mean of 33. The speakers each give answers to 39 questions about a journey. We first deleted the utterances that contain crosstalk, background noise, and whispering from the corpus. After deletion of these utterances the material consists of 1325 utterances. The mean signal-to-noise-ratio (SNR) of the material is 24.9 with a standard deviation of 5.1.
Considering all these characteristics, we can state that the JASMIN non-native dialogues are similar to the speech we will encounter in our CALL application for various reasons: (
) they contain answers to relatively constrained questions, (
) they contain semispontaneous speech, (
) of non-natives with different L1s, (
) which features spontaneous phenomena such as filled pauses and disfluencies. However, since hesitation phenomena were purposefully induced in the JASMIN dialogues, their incidence is probably higher than in typical non-native dialogues.
4.1.2. Speech Recognizer
The speech recognizer we used in this research is SPRAAK [46], an open source hidden markov model (HMM)-based ASR package. The input speech, sampled at 16 kHz, is divided into overlapping 32 milliseconds Hamming windows with a 10 milliseconds shift and preemphasis factor of 0.95. However, 12 Mel-frequency cepstral coefficients (MFCC) plus
, and their first and second order derivatives were calculated and cepstral mean subtraction (CMS) was applied. The constrained language models and pronunciation lexicons are implemented as finite state machines (FSM).
To simulate the ASR task in our CALL application, we generated lists of the answers given by each speaker to each of the 39 questions. These lists mimic the predicted responses in our CALL application task because they contain (a) responses to relatively closed questions and (b) morphologically and syntactically correct and incorrect responses.
4.1.3. Language Modelling
Our approach is to use a constrained language model (LM) to restrict the search space. In total 39 LMs were generated based on the responses to each of the 39 questions. These responses were manually transcribed at the orthographic level. Filled pauses, restarts, and repetitions were also annotated.
Filled pauses are common in everyday spontaneous speech and generally do not hamper communication. It seems therefore that students using a CALL application should be allowed to produce a limited amount of filled pauses. In our material 46% of the utterances contain one or more filled pauses and almost 13% of all transcribed units are filled pauses.
However, 11% of the utterances contain one or more other disfluencies such as restarts, repairs and repetitions. While these also occur in normal speech, albeit less frequently, we think that in a CALL application for training oral proficiency students should be stimulated to produce fluent speech. On these grounds, we decided not to tolerate restarts, repetitions and repairs and to ask the students to try again when one of these phenomena is produced. Therefore, in our research we did not focus on restarts, repairs, and repetitions, we only included their orthographic transcriptions in the LM and their manual phonetic transcriptions in the lexicon.
The LMs are implemented as FSMs with parallel paths of orthographic transcriptions of every unique answer to the question. A priori each path is equally likely. An example of such a question is "Hoe wilt u naar deze stad reizen?'' (How do you want to travel to this city?) and a small part of the responses is
-
(1)
/ik gaat met de vliegtuig/ (/I am going by plane/*),
-
(2)
/ik ga met detrein/ (/I am going by train/),
-
(3)
/met devliegtuig/ (/by plane/*),
-
(4)
/met hetvliegtuig/ (/by plane/).
The baseline LM that is generated from this list is depicted in Figure 2. Each of the parallel paths with words on the arcs represents a unique answer to a question. Silence is possible before and after each word (not shown).
To be able to decode possible filled pauses between words, we generated another LM with self-loops added in every node. Filled pauses are represented in the pronunciation lexicon as /@/ or /@m/, phonetic representations of the two most common filled pauses in Dutch. The filled pause loop penalty was empirically optimized. An example of this language model is depicted in Figure 3.
To examine whether filled pause loops are an adequate way of modelling filled pauses, we also experimented with an oracle LM. This is an LM containing the reference orthographic transcriptions, which include the manually annotated filled pauses without filled pause loops.
4.1.4. Acoustic Modelling
We trained three-state tied Gaussian Mixture Models (GMM). Baseline triphone models were trained on 42 hours of native read speech from the CGN corpus [47]. In total 11 660 triphones were created, using 32 738 Gaussians.
As discussed in Section 2.1, it has been observed in several studies that by adapting or retraining native acoustic models (AM) with non-native speech, decoding performance can be increased. To investigate whether this is also the case in a constrained task as described in this paper, we retrained the baseline acoustic models with non-native speech.
New AMs were obtained by doing a one-pass Viterbi training based on the native AMs with 6 hours of non-native read speech from the JASMIN corpus. These utterances were spoken by the same speakers as those in our test material (comparable to an enrollment phase).
Triphone AMs are the de facto choice for most researchers in speech technology. However, the expected performance gain from modelling context dependency by using triphones over monophones might be minimal in a constrained task. Therefore, we also experimented with non-native monophone AMs trained on the same non-native read speech.
4.1.5. Lexical Modelling
The baseline pronunciation lexicon contains canonical phonemic representations extracted from the CGN lexicon. The distribution of sizes of the 39 lexicons is depicted in Figure 4.
As explained in Section 2.1 non-native pronunciation generally deviates from native pronunciation, both at the phonetic and the phonemic level. To model pronunciation variation at the phonemic level, we added pronunciation variants to the lexicon.
To derive pronunciation variants, we extracted context-dependent rewrite rules from an alignment of canonical and realized phonemic representations of non-native speech from the JASMIN corpus (the test material was excluded). Prior probabilities of these rules were estimated by taking the relative frequency of rule applications in their context.
We generated pronunciation variants by successively applying the derived rewrite rules to the canonical representations in the baseline lexicon. Variant probabilities were calculated by multiplying the applied rule probabilities. Canonical representations have a standard probability of one. Afterwards, probabilities of pronunciation variants per word were normalized so that these probabilities sum to one.
By introducing a cutoff probability, pronunciation lexicons were created that contain only variants above this cutoff. In this way lexicons with on average 2, 3, 4, and 5 variants per word were created.
4.1.6. Evaluation
We evaluated the speech decoding setups using the utterance error rate (UER), which is the percentage of utterances where the 1-Best decoding result deviates from the transcription. Filled pauses are not taken into account during evaluation. That is, decoding results and reference transcriptions were compared after deletion of filled pauses. For each UER the 95% confidence interval was calculated to evaluate whether UERs between conditions were significantly different.
As explained in the introduction, we do not expect our method to carry out a detailed phonetic analysis in the first phase. Since it is not necessary to discriminate between phonetically close responses at this stage, a decoding result can be classified as correct when its phonetic distance to the corresponding transcription is below a threshold. The phonetic distance was calculated through an alignment program that uses a dynamic programming algorithm to align transcriptions on the basis of distance measures between phonemes represented as combinations of phonetic features [48]. These phonemic transcriptions were made using the canonical pronunciation variants from the words in the orthographic transcriptions.
4.2. Results
In Table 1, the UERs for the different language models and acoustic models can be observed. In all cases, the LM with filled pause loops performed significantly better than the LM without loops. Furthermore, the oracle LM with manually annotated filled pauses (with positions) did not perform significantly better than the LM with loops.
Table 1 This table shows the UERs for the different language models: without FP loops, with FP loops and with FP positions, and different acoustic models: trained on native speech (triphone) and retrained on non-native speech (triphone and monophone). All setups used the baseline canonical lexicon. The columns 0, 5, 10, 15 indicate at what phonetic distance to the reference transcription the decoding result is classified as correct. Decoding setups with AMs retrained on non-native speech performed significantly better than those with AMs trained on native speech. The performance difference between monophone and triphone AMs was not significant.
As expected, error rates are lower when evaluating using clusters of phonetically similar responses. To better appreciate the results in Table 1 it is important to get an idea of the meaning of these distances. The distances between the example responses in Section 4.1.3 are shown in Table 2. The density of the phonetic distances between all response pairs to all questions is depicted in Figure 5. Since there are only few responses with a phonetic distance smaller than 5, differences between 0 and 5 are marginal. Performance differences between 0 (equal to transcription) and 10 (one of the answers with a phonetic distance of 10 or smaller to the 1-Best equals the transcription) and between 5 and 15 were significant.
Table 2 Phonetic distances between the example responses: (1) "ik gaat met de vliegtuig'', (2) "ik ga met de trein'', (3) "met de vliegtuig'', (4) "met het vliegtuig''. As can be seen in Table 3, performance decreased using lexicons with pronunciation variants generated using data-driven methods. The more variants are added, the worse the performance. Furthermore, there is no significant difference between using equal priors or estimated priors.
Table 3 UERs for different lexicons: canonical, 2–5 variants with and without priors. These rates are obtained by using non-native triphone acoustic models and language models with filled pause loops. 4.3. Discussion
The results presented in the previous section indicate that large and significant improvements could be obtained by optimizing the language model and the acoustic models. On the other hand, pronunciation modelling at the level of the lexicon did not produce significant improvements. On the contrary, adding variants to the lexicon caused a decrease in performance. Adding estimated prior probabilities to the variants improved the results somewhat, but still the error rates remain higher than those for the canonical lexicon. These results might be surprising because, in general, adding a limited number of carefully selected pronunciation variants to the lexicon helps improve performance to a certain extent [29, 30]. However, in the case of non-native speech this strategy is not always successful [31]. Possible explanations might be sought in the nature of the variation that characterizes non-native speech. Non-native speakers are likely to replace target language phonemes by phonemes from their mother tongue [3, 5]. When the non-native speech is heterogeneous in the sense that it is produced by speakers with different mother tongues, as in our case, it may be extremely difficult to capture the rather diffuse pattern of variation by including variants in the lexicon (see also [4]).
The findings that better results are obtained with non-native acoustic models and with a language model with filled pause loops are not surprising, after all the utterances are spoken by non-natives, recorded in the same environment and contain a lot of filled pauses. In fact, these results do not differ significantly from the results obtained with an oracle language model, in which the exact position of the filled pauses is copied from the manual transcriptions. This is an important result because non-natives are known to produce numerous filled pauses in unprepared, extemporaneous speech [12]. From these results we can conclude that external filled pause detection, for which better results were found for a large vocabulary task [49], is not necessary in this case.
Another reassuring result is that performance improved using non-native acoustic models. These were obtained by retraining native models on a relatively small amount (around 8 minutes per speaker) of non-native read speech material. It appears that this was sufficient to obtain significantly better results. In the final application we might then use a relatively short enrolment phase and do acoustic model retraining (and/or online speaker adaptation), to obtain better recognition results.
While in this experiment the correct transcription of the response was always in the language model, our system must also be able to reject utterances when they are not present in the language model, while still accepting correctly recognized utterances. This is the topic of the experiment presented in the following section.