Automatic Phonetic Segmentation Using the Kaldi Toolkit

  • Jindřich MatoušekEmail author
  • Michal Klíma
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10415)


In this paper we explore the possibilities of hidden Markov model based automatic phonetic segmentation with the Kaldi toolkit. We compare the Kaldi toolkit and the Hidden Markov Model Toolkit (HTK) in terms of segmentation accuracy. The well-tuned HTK-based phonetic segmentation framework was taken as the baseline and compared to a newly proposed segmentation framework built from the default examples and recipes available in the Kaldi repository. Since the segmentation accuracy of the HTK-based system was significantly higher than that of the Kaldi-based system, the default Kaldi setting was modified with respect to pause model topology, the way of generating phonetic questions for clustering, and the number of Gaussian mixtures used during modeling. The modified Kaldi-based system achieved results comparable to those obtained by HTK—slightly worse for small segmentation errors but better for gross segmentation errors. We also confirmed that, for both toolkits, the standard three-state left-to-right model topology was significantly outperformed by a modified five-state left-to-right topology, especially with respect to small segmentation errors.


Automatic phonetic segmentation HTK Kaldi Hidden Markov models 


  1. 1.
    Aylett, M.P., Dall, R., Ghoshal, A., Henter, G.E., Merritt, T.: A Flexible Front-End for HTS. In: INTERSPEECH, pp. 1283–1287. Singapore (2014)Google Scholar
  2. 2.
    Brognaux, S., Drugman, T.: HMM-based speech segmentation: improvements of fully automatic approaches. IEEE/ACM Trans. Audio Speech Lang. Process. 24(1), 5–15 (2016)CrossRefGoogle Scholar
  3. 3.
    Kala, J., Matoušek, J.: Very fast unit selection using Viterbi search with zero-concatenation-cost chains. In: IEEE International Conference on Acoustics Speech and Signal Processing, Florence, Italy, pp. 2569–2573 (2014)Google Scholar
  4. 4.
    Matoušek, J.: Automatic pitch-synchronous phonetic segmentation with context-independent HMMs. In: Matoušek, V., Mautner, P. (eds.) TSD 2009. LNCS, vol. 5729, pp. 178–185. Springer, Heidelberg (2009). doi: 10.1007/978-3-642-04208-9_27 CrossRefGoogle Scholar
  5. 5.
    Matoušek, J., Romportl, J.: On building phonetically and prosodically rich speech corpus for text-to-speech synthesis. In: IASTED International Conference on Computational Intelligence, San Francisco, USA, pp. 442–447 (2006)Google Scholar
  6. 6.
    Matoušek, J., Romportl, J.: Automatic pitch-synchronous phonetic segmentation. In: INTERSPEECH, Brisbane, Australia, pp. 1626–1629 (2008)Google Scholar
  7. 7.
    Matoušek, J., Tihelka, D., Psutka, J.: Automatic segmentation for Czech concatenative speech synthesis using statistical approach with boundary-specific correction. In: INTERSPEECH, Geneva, Switzerland, pp. 301–304 (2003)Google Scholar
  8. 8.
    Matoušek, J., Tihelka, D., Psutka, J.: Experiments with automatic segmentation for Czech speech synthesis. In: Matoušek, V., Mautner, P. (eds.) TSD 2003. LNCS, vol. 2807, pp. 287–294. Springer, Heidelberg (2003). doi: 10.1007/978-3-540-39398-6_41 CrossRefGoogle Scholar
  9. 9.
    Matoušek, J., Tihelka, D., Romportl, J.: Building of a speech corpus optimised for unit selection TTS synthesis. In: Language Resources and Evaluation Conference, Marrakech, Morocco, pp. 1296–1299 (2008)Google Scholar
  10. 10.
    Ogbureke, K.U., Carson-Berndsen, J.: Improving initial boundary estimation for HMM-based automatic phonetic segmentation. In: INTERSPEECH, Brighton, Great Britain, pp. 884–887 (2009)Google Scholar
  11. 11.
    Patc, Z., Mizera, P., Pollak, P.: Phonetic segmentation using KALDI and reduced pronunciation detection in causal Czech speech. In: Král, P., Matoušek, V. (eds.) TSD 2015. LNCS, vol. 9302, pp. 433–441. Springer, Cham (2015). doi: 10.1007/978-3-319-24033-6_49 CrossRefGoogle Scholar
  12. 12.
    Plátek, O., Jurčíček, F.: Integration of an on-line Kaldi speech recogniser to the Alex dialogue systems framework. In: Text, Speech and Dialogue. LNCS, vol. 9302, pp. 433–441. Springer, Heidelberg (2015)Google Scholar
  13. 13.
    Potard, B., Aylett, M.P., Baude, D.A.: Idlak Tangle: an open source Kaldi based parametric speech synthesiser based on DNN. In: INTERSPEECH, San Francisco, USA, pp. 2293–2297 (2016)Google Scholar
  14. 14.
    Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The Kaldi speech recognition toolkit. In: IEEE Workshop on Automatic Speech Recognition and Understanding, Hawaii, USA, pp. 1–4 (2011)Google Scholar
  15. 15.
    Psutka, J., Švec, J., Psutka, J.V., Vaněk, J., Pražák, A., Šmídl, L., Ircing, P.: System for fast lexical and phonetic spoken term detection in a Czech cultural heritage archive. EURASIP J. Audio Speech Music Process. 10 (2011)Google Scholar
  16. 16.
    Rendel, A., Sorin, A., Hoory, R., Breen, A.: Towards automatic phonetic segmentation for TTS. In: IEEE International Conference on Acoustics Speech and Signal Processing, Kyoto, Japan, pp. 4533–4536 (2012)Google Scholar
  17. 17.
    Švec, J., Šmídl, L.: On the use of phoneme lattices in spoken language understanding. In: Habernal, I., Matoušek, V. (eds.) TSD 2013. LNCS, vol. 8082, pp. 369–377. Springer, Heidelberg (2013). doi: 10.1007/978-3-642-40585-3_47 Google Scholar
  18. 18.
    Toledano, D., Gomez, L., Grande, L.: Automatic phonetic segmentation. IEEE Trans. Speech Audio Process. 11(6), 617–625 (2003)CrossRefGoogle Scholar
  19. 19.
    Young, S., Evermann, G., Gales, M.J.F., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P.: HTK Book (for HTK Version 3.4). The Cambridge University, Cambridge, U.K. (2006)Google Scholar
  20. 20.
    Zen, H., Nose, T., Yamagishi, J., Sako, S., Masuko, T., Black, A.W., Tokuda, K.: The HMM-based speech synthesis system (HTS) version 2.0. In: Speech Synthesis Workshop, Bonn, Germany, pp. 294–299 (2007)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Department of Cybernetics, Faculty of Applied SciencesNew Technology for the Information Society (NTIS), University of West BohemiaPlzeňCzech Republic

Personalised recommendations