Automatic Phonetic Segmentation Using the Kaldi Toolkit

Matoušek, Jindřich; Klíma, Michal

doi:10.1007/978-3-319-64206-2_16

Jindřich Matoušek¹⁵ &
Michal Klíma¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10415))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

1567 Accesses
2 Citations

Abstract

In this paper we explore the possibilities of hidden Markov model based automatic phonetic segmentation with the Kaldi toolkit. We compare the Kaldi toolkit and the Hidden Markov Model Toolkit (HTK) in terms of segmentation accuracy. The well-tuned HTK-based phonetic segmentation framework was taken as the baseline and compared to a newly proposed segmentation framework built from the default examples and recipes available in the Kaldi repository. Since the segmentation accuracy of the HTK-based system was significantly higher than that of the Kaldi-based system, the default Kaldi setting was modified with respect to pause model topology, the way of generating phonetic questions for clustering, and the number of Gaussian mixtures used during modeling. The modified Kaldi-based system achieved results comparable to those obtained by HTK—slightly worse for small segmentation errors but better for gross segmentation errors. We also confirmed that, for both toolkits, the standard three-state left-to-right model topology was significantly outperformed by a modified five-state left-to-right topology, especially with respect to small segmentation errors.

This research was supported by the Technology Agency of the Czech Republic (TA CR), project No. TH02010307.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Aylett, M.P., Dall, R., Ghoshal, A., Henter, G.E., Merritt, T.: A Flexible Front-End for HTS. In: INTERSPEECH, pp. 1283–1287. Singapore (2014)
Google Scholar
Brognaux, S., Drugman, T.: HMM-based speech segmentation: improvements of fully automatic approaches. IEEE/ACM Trans. Audio Speech Lang. Process. 24(1), 5–15 (2016)
Article Google Scholar
Kala, J., Matoušek, J.: Very fast unit selection using Viterbi search with zero-concatenation-cost chains. In: IEEE International Conference on Acoustics Speech and Signal Processing, Florence, Italy, pp. 2569–2573 (2014)
Google Scholar
Matoušek, J.: Automatic pitch-synchronous phonetic segmentation with context-independent HMMs. In: Matoušek, V., Mautner, P. (eds.) TSD 2009. LNCS, vol. 5729, pp. 178–185. Springer, Heidelberg (2009). doi:10.1007/978-3-642-04208-9_27
Chapter Google Scholar
Matoušek, J., Romportl, J.: On building phonetically and prosodically rich speech corpus for text-to-speech synthesis. In: IASTED International Conference on Computational Intelligence, San Francisco, USA, pp. 442–447 (2006)
Google Scholar
Matoušek, J., Romportl, J.: Automatic pitch-synchronous phonetic segmentation. In: INTERSPEECH, Brisbane, Australia, pp. 1626–1629 (2008)
Google Scholar
Matoušek, J., Tihelka, D., Psutka, J.: Automatic segmentation for Czech concatenative speech synthesis using statistical approach with boundary-specific correction. In: INTERSPEECH, Geneva, Switzerland, pp. 301–304 (2003)
Google Scholar
Matoušek, J., Tihelka, D., Psutka, J.: Experiments with automatic segmentation for Czech speech synthesis. In: Matoušek, V., Mautner, P. (eds.) TSD 2003. LNCS, vol. 2807, pp. 287–294. Springer, Heidelberg (2003). doi:10.1007/978-3-540-39398-6_41
Chapter Google Scholar
Matoušek, J., Tihelka, D., Romportl, J.: Building of a speech corpus optimised for unit selection TTS synthesis. In: Language Resources and Evaluation Conference, Marrakech, Morocco, pp. 1296–1299 (2008)
Google Scholar
Ogbureke, K.U., Carson-Berndsen, J.: Improving initial boundary estimation for HMM-based automatic phonetic segmentation. In: INTERSPEECH, Brighton, Great Britain, pp. 884–887 (2009)
Google Scholar
Patc, Z., Mizera, P., Pollak, P.: Phonetic segmentation using KALDI and reduced pronunciation detection in causal Czech speech. In: Král, P., Matoušek, V. (eds.) TSD 2015. LNCS, vol. 9302, pp. 433–441. Springer, Cham (2015). doi:10.1007/978-3-319-24033-6_49
Chapter Google Scholar
Plátek, O., Jurčíček, F.: Integration of an on-line Kaldi speech recogniser to the Alex dialogue systems framework. In: Text, Speech and Dialogue. LNCS, vol. 9302, pp. 433–441. Springer, Heidelberg (2015)
Google Scholar
Potard, B., Aylett, M.P., Baude, D.A.: Idlak Tangle: an open source Kaldi based parametric speech synthesiser based on DNN. In: INTERSPEECH, San Francisco, USA, pp. 2293–2297 (2016)
Google Scholar
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The Kaldi speech recognition toolkit. In: IEEE Workshop on Automatic Speech Recognition and Understanding, Hawaii, USA, pp. 1–4 (2011)
Google Scholar
Psutka, J., Švec, J., Psutka, J.V., Vaněk, J., Pražák, A., Šmídl, L., Ircing, P.: System for fast lexical and phonetic spoken term detection in a Czech cultural heritage archive. EURASIP J. Audio Speech Music Process. 10 (2011)
Google Scholar
Rendel, A., Sorin, A., Hoory, R., Breen, A.: Towards automatic phonetic segmentation for TTS. In: IEEE International Conference on Acoustics Speech and Signal Processing, Kyoto, Japan, pp. 4533–4536 (2012)
Google Scholar
Švec, J., Šmídl, L.: On the use of phoneme lattices in spoken language understanding. In: Habernal, I., Matoušek, V. (eds.) TSD 2013. LNCS, vol. 8082, pp. 369–377. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40585-3_47
Google Scholar
Toledano, D., Gomez, L., Grande, L.: Automatic phonetic segmentation. IEEE Trans. Speech Audio Process. 11(6), 617–625 (2003)
Article Google Scholar
Young, S., Evermann, G., Gales, M.J.F., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P.: HTK Book (for HTK Version 3.4). The Cambridge University, Cambridge, U.K. (2006)
Google Scholar
Zen, H., Nose, T., Yamagishi, J., Sako, S., Masuko, T., Black, A.W., Tokuda, K.: The HMM-based speech synthesis system (HTS) version 2.0. In: Speech Synthesis Workshop, Bonn, Germany, pp. 294–299 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Cybernetics, Faculty of Applied Sciences, New Technology for the Information Society (NTIS), University of West Bohemia, Plzeň, Czech Republic
Jindřich Matoušek & Michal Klíma

Authors

Jindřich Matoušek
View author publications
You can also search for this author in PubMed Google Scholar
Michal Klíma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jindřich Matoušek .

Editor information

Editors and Affiliations

University of West Bohemia, Pilsen, Czech Republic
Kamil Ekštein
University of West Bohemia, Pilsen, Czech Republic
Václav Matoušek

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Matoušek, J., Klíma, M. (2017). Automatic Phonetic Segmentation Using the Kaldi Toolkit. In: Ekštein, K., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2017. Lecture Notes in Computer Science(), vol 10415. Springer, Cham. https://doi.org/10.1007/978-3-319-64206-2_16

Download citation

DOI: https://doi.org/10.1007/978-3-319-64206-2_16
Published: 29 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64205-5
Online ISBN: 978-3-319-64206-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics