Comparison of Speech Recognition Performance Between Kaldi and Google Cloud Speech API

  • Takashi Kimura
  • Takashi Nose
  • Shinji Hirooka
  • Yuya Chiba
  • Akinori ItoEmail author
Conference paper
Part of the Smart Innovation, Systems and Technologies book series (SIST, volume 110)


In recent years, many systems having a speech interface have grown. The speech interface includes spoken dialogue function and high performance of a spoken dialogue system has been required. The spoken dialogue system consists of a speech recognition module. In this study, we focus on the speech recognition module of the spoken dialogue system and aim for improving the spoken dialogue system by enhancing the performance of the speech recognition system. Among several speech recognition systems, Kaldi is a widely used speech recognition system in many kinds of researches. On the other hand, several speech recognition services that are Web API is also provided, such as IBM Watson Speech to Text, Microsoft Bing Speech API, and Google Cloud Speech API, which is known that it has high performance. This paper compares speech recognition performance between Kaldi and Google Cloud Speech API in WER and RTF and confirms the recognition performance of each recognition system.


Speech recognition Kaldi Google Cloud Speech API 



Part of this work was supported by JSPS KAKENHI Grant Numbers JP17H00823.


  1. 1.
  2. 2.
    The “nnet3” setup.
  3. 3.
    Baumann, T., Kennington, C., Hough, J., Schlangen, D.: Recognising conversational speech: what an incremental asr should do for a dialogue system and how to get there. In: Dialogues with Social Robots: Enablements, Analyses, and Evaluation. pp. 421–432. Springer, Singapore (2017)Google Scholar
  4. 4.
    Itou, K., Yamamoto, M., Takeda, K., Takezawa, T., Matsuoka, T., Kobayashi, T., Shikano, K., Itahashi, S.: JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research. J. Acoust. Soc. Jpn. (E) 20(3), 199–206 (1999)CrossRefGoogle Scholar
  5. 5.
    Kudo, T., Yamamoto, K., Matsumoto, Y.: Applying conditional random fields to Japanese morphological analysis. In: Proceedings of EMNLP, pp. 230–237 (2004)Google Scholar
  6. 6.
    Maekawa, K., Hanae, K., Sadaoki, F., Isahara, H.: Spontaneous speech corpus of Japanese. In: Proceedings of the Second International Conference of Language Resources and Evaluation (LREC 2000), pp. 947–952 (2000)Google Scholar
  7. 7.
    Morbini, F., Audhkhasi, K., Sagae, K., Artstein, R., Can, D., Georgiou, P.G., Narayanan, S., Leuski, A., Traum, D.R.: Which ASR should I choose for my dialogue system? In: Proceedings of SIGDIAL Conference, pp. 394–403 (2013)Google Scholar
  8. 8.
    Mori, H., Satake, T., Nakamura, M., Kasuya, H.: Constructing a spoken dialogue corpus for studying paralinguistic information in expressive conversation and analyzing its statistical/acoustic characteristics. Speech Commun. 53(1), 36–50 (2011)CrossRefGoogle Scholar
  9. 9.
    Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al.: The Kaldi speech recognition toolkit. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 1–4 (2011)Google Scholar
  10. 10.
    Takeishi, E., Nose, T., Chiba, Y., Ito, A.: Construction and analysis of phonetically and prosodically balanced emotional speech database. In: Proceedings of Oriental COCOSDA, pp. 16–21 (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Takashi Kimura
    • 1
  • Takashi Nose
    • 1
  • Shinji Hirooka
    • 2
    • 3
  • Yuya Chiba
    • 1
  • Akinori Ito
    • 1
    Email author
  1. 1.Graduate School of EngineeringTohoku UniversitySendai-shiJapan
  2. 2.R&D Center, Hmcomm Co., Ltd.TokyoJapan
  3. 3.Faculty of ScienceChiba UniversityChiba-shiJapan

Personalised recommendations