Skip to main content

Automatic Phonetic Segmentation Using the Kaldi Toolkit

  • Conference paper
  • First Online:
Book cover Text, Speech, and Dialogue (TSD 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10415))

Included in the following conference series:

Abstract

In this paper we explore the possibilities of hidden Markov model based automatic phonetic segmentation with the Kaldi toolkit. We compare the Kaldi toolkit and the Hidden Markov Model Toolkit (HTK) in terms of segmentation accuracy. The well-tuned HTK-based phonetic segmentation framework was taken as the baseline and compared to a newly proposed segmentation framework built from the default examples and recipes available in the Kaldi repository. Since the segmentation accuracy of the HTK-based system was significantly higher than that of the Kaldi-based system, the default Kaldi setting was modified with respect to pause model topology, the way of generating phonetic questions for clustering, and the number of Gaussian mixtures used during modeling. The modified Kaldi-based system achieved results comparable to those obtained by HTK—slightly worse for small segmentation errors but better for gross segmentation errors. We also confirmed that, for both toolkits, the standard three-state left-to-right model topology was significantly outperformed by a modified five-state left-to-right topology, especially with respect to small segmentation errors.

This research was supported by the Technology Agency of the Czech Republic (TA CR), project No. TH02010307.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aylett, M.P., Dall, R., Ghoshal, A., Henter, G.E., Merritt, T.: A Flexible Front-End for HTS. In: INTERSPEECH, pp. 1283–1287. Singapore (2014)

    Google Scholar 

  2. Brognaux, S., Drugman, T.: HMM-based speech segmentation: improvements of fully automatic approaches. IEEE/ACM Trans. Audio Speech Lang. Process. 24(1), 5–15 (2016)

    Article  Google Scholar 

  3. Kala, J., Matoušek, J.: Very fast unit selection using Viterbi search with zero-concatenation-cost chains. In: IEEE International Conference on Acoustics Speech and Signal Processing, Florence, Italy, pp. 2569–2573 (2014)

    Google Scholar 

  4. Matoušek, J.: Automatic pitch-synchronous phonetic segmentation with context-independent HMMs. In: Matoušek, V., Mautner, P. (eds.) TSD 2009. LNCS, vol. 5729, pp. 178–185. Springer, Heidelberg (2009). doi:10.1007/978-3-642-04208-9_27

    Chapter  Google Scholar 

  5. Matoušek, J., Romportl, J.: On building phonetically and prosodically rich speech corpus for text-to-speech synthesis. In: IASTED International Conference on Computational Intelligence, San Francisco, USA, pp. 442–447 (2006)

    Google Scholar 

  6. Matoušek, J., Romportl, J.: Automatic pitch-synchronous phonetic segmentation. In: INTERSPEECH, Brisbane, Australia, pp. 1626–1629 (2008)

    Google Scholar 

  7. Matoušek, J., Tihelka, D., Psutka, J.: Automatic segmentation for Czech concatenative speech synthesis using statistical approach with boundary-specific correction. In: INTERSPEECH, Geneva, Switzerland, pp. 301–304 (2003)

    Google Scholar 

  8. Matoušek, J., Tihelka, D., Psutka, J.: Experiments with automatic segmentation for Czech speech synthesis. In: Matoušek, V., Mautner, P. (eds.) TSD 2003. LNCS, vol. 2807, pp. 287–294. Springer, Heidelberg (2003). doi:10.1007/978-3-540-39398-6_41

    Chapter  Google Scholar 

  9. Matoušek, J., Tihelka, D., Romportl, J.: Building of a speech corpus optimised for unit selection TTS synthesis. In: Language Resources and Evaluation Conference, Marrakech, Morocco, pp. 1296–1299 (2008)

    Google Scholar 

  10. Ogbureke, K.U., Carson-Berndsen, J.: Improving initial boundary estimation for HMM-based automatic phonetic segmentation. In: INTERSPEECH, Brighton, Great Britain, pp. 884–887 (2009)

    Google Scholar 

  11. Patc, Z., Mizera, P., Pollak, P.: Phonetic segmentation using KALDI and reduced pronunciation detection in causal Czech speech. In: Král, P., Matoušek, V. (eds.) TSD 2015. LNCS, vol. 9302, pp. 433–441. Springer, Cham (2015). doi:10.1007/978-3-319-24033-6_49

    Chapter  Google Scholar 

  12. Plátek, O., Jurčíček, F.: Integration of an on-line Kaldi speech recogniser to the Alex dialogue systems framework. In: Text, Speech and Dialogue. LNCS, vol. 9302, pp. 433–441. Springer, Heidelberg (2015)

    Google Scholar 

  13. Potard, B., Aylett, M.P., Baude, D.A.: Idlak Tangle: an open source Kaldi based parametric speech synthesiser based on DNN. In: INTERSPEECH, San Francisco, USA, pp. 2293–2297 (2016)

    Google Scholar 

  14. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The Kaldi speech recognition toolkit. In: IEEE Workshop on Automatic Speech Recognition and Understanding, Hawaii, USA, pp. 1–4 (2011)

    Google Scholar 

  15. Psutka, J., Švec, J., Psutka, J.V., Vaněk, J., Pražák, A., Šmídl, L., Ircing, P.: System for fast lexical and phonetic spoken term detection in a Czech cultural heritage archive. EURASIP J. Audio Speech Music Process. 10 (2011)

    Google Scholar 

  16. Rendel, A., Sorin, A., Hoory, R., Breen, A.: Towards automatic phonetic segmentation for TTS. In: IEEE International Conference on Acoustics Speech and Signal Processing, Kyoto, Japan, pp. 4533–4536 (2012)

    Google Scholar 

  17. Švec, J., Šmídl, L.: On the use of phoneme lattices in spoken language understanding. In: Habernal, I., Matoušek, V. (eds.) TSD 2013. LNCS, vol. 8082, pp. 369–377. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40585-3_47

    Google Scholar 

  18. Toledano, D., Gomez, L., Grande, L.: Automatic phonetic segmentation. IEEE Trans. Speech Audio Process. 11(6), 617–625 (2003)

    Article  Google Scholar 

  19. Young, S., Evermann, G., Gales, M.J.F., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P.: HTK Book (for HTK Version 3.4). The Cambridge University, Cambridge, U.K. (2006)

    Google Scholar 

  20. Zen, H., Nose, T., Yamagishi, J., Sako, S., Masuko, T., Black, A.W., Tokuda, K.: The HMM-based speech synthesis system (HTS) version 2.0. In: Speech Synthesis Workshop, Bonn, Germany, pp. 294–299 (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jindřich Matoušek .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Matoušek, J., Klíma, M. (2017). Automatic Phonetic Segmentation Using the Kaldi Toolkit. In: Ekštein, K., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2017. Lecture Notes in Computer Science(), vol 10415. Springer, Cham. https://doi.org/10.1007/978-3-319-64206-2_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-64206-2_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-64205-5

  • Online ISBN: 978-3-319-64206-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics