Comparison of Speech Recognition Performance Between Kaldi and Google Cloud Speech API
In recent years, many systems having a speech interface have grown. The speech interface includes spoken dialogue function and high performance of a spoken dialogue system has been required. The spoken dialogue system consists of a speech recognition module. In this study, we focus on the speech recognition module of the spoken dialogue system and aim for improving the spoken dialogue system by enhancing the performance of the speech recognition system. Among several speech recognition systems, Kaldi is a widely used speech recognition system in many kinds of researches. On the other hand, several speech recognition services that are Web API is also provided, such as IBM Watson Speech to Text, Microsoft Bing Speech API, and Google Cloud Speech API, which is known that it has high performance. This paper compares speech recognition performance between Kaldi and Google Cloud Speech API in WER and RTF and confirms the recognition performance of each recognition system.
KeywordsSpeech recognition Kaldi Google Cloud Speech API
Part of this work was supported by JSPS KAKENHI Grant Numbers JP17H00823.
- 1.JEIDA Noise Database. http://research.nii.ac.jp/src/en/JEIDA-NOISE.html
- 2.The “nnet3” setup. http://kaldi-asr.org/doc/dnn3.html
- 3.Baumann, T., Kennington, C., Hough, J., Schlangen, D.: Recognising conversational speech: what an incremental asr should do for a dialogue system and how to get there. In: Dialogues with Social Robots: Enablements, Analyses, and Evaluation. pp. 421–432. Springer, Singapore (2017)Google Scholar
- 5.Kudo, T., Yamamoto, K., Matsumoto, Y.: Applying conditional random fields to Japanese morphological analysis. In: Proceedings of EMNLP, pp. 230–237 (2004)Google Scholar
- 6.Maekawa, K., Hanae, K., Sadaoki, F., Isahara, H.: Spontaneous speech corpus of Japanese. In: Proceedings of the Second International Conference of Language Resources and Evaluation (LREC 2000), pp. 947–952 (2000)Google Scholar
- 7.Morbini, F., Audhkhasi, K., Sagae, K., Artstein, R., Can, D., Georgiou, P.G., Narayanan, S., Leuski, A., Traum, D.R.: Which ASR should I choose for my dialogue system? In: Proceedings of SIGDIAL Conference, pp. 394–403 (2013)Google Scholar
- 9.Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al.: The Kaldi speech recognition toolkit. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 1–4 (2011)Google Scholar
- 10.Takeishi, E., Nose, T., Chiba, Y., Ito, A.: Construction and analysis of phonetically and prosodically balanced emotional speech database. In: Proceedings of Oriental COCOSDA, pp. 16–21 (2016)Google Scholar