Advertisement

Data Augmentation and Teacher-Student Training for LF-MMI Based Robust Speech Recognition

  • Asadullah
  • Tanel Alumäe
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11107)

Abstract

Deep neural networks (DNN) have played a key role in the development of state-of-the-art speech recognition systems. In recent years, lattice-free MMI objective (LF-MMI) has become a popular method for training DNN acoustic models. However, domain adaptation of DNNs from clean to noisy data still remains a challenging problem. In this paper, we compare and combine two methods for adapting LF-MMI-based models to a noisy domain that do not require transcribed noisy data: multi-condition training and teacher-student style domain adaptation. For teacher-student training, we use lattices obtained via decoding untranscribed clean speech as supervision for adapting the model to noisy domain. We use in-domain noise extracted from a large untranscribed speech corpus using voice activity detection for noise-augmentation in multi-condition training and teacher-student training. We show that combining multi-condition training and lattice-based teacher-student training gives better results than either of the methods alone. Furthermore, we show the benefits of using in-domain noise instead of general noise profiles for noise augmentation. Overall, we obtain 7.4% relative improvement in word error rate over a standard multi-condition baseline.

Keywords

Speech activity detection Noise augmentation Domain adaptation Weighted prediction error Deep neural networks 

References

  1. 1.
    Alumäe, T., et al.: The 2016 BBN Georgian telephone speech keyword spotting system. In: ICASSP, pp. 5755–5759 (2017)Google Scholar
  2. 2.
    Barker, J., Marxer, R., Vincent, E., Watanabe, S.: The third CHiME speech separation and recognition challenge dataset, task and baselines. In: ASRU (2015)Google Scholar
  3. 3.
    Harper, M.: The automatic speech recognition in reverberant environments (ASpIRE) challenge. In: ASRU (2015)Google Scholar
  4. 4.
    Hinton, G.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29, 82–97 (2012)CrossRefGoogle Scholar
  5. 5.
    Hsiao, R., et al.: Robust speech recognition in unknown reverberant and noisy conditions. In: ASRU (2015)Google Scholar
  6. 6.
    Kinoshita, K., et al.: The REVERB challenge: a common evaluation framework for dereverberation and recognition of reverberant speech. In: Applications of Signal Processing to Audio and Acoustics (2013)Google Scholar
  7. 7.
    Ko, T., Peddinti, V., Povey, D., Seltzer, M.L., Khudanpur, S.: A study on data augmentation of reverberant speech for robust speech recognition. In: ICASSP (2017)Google Scholar
  8. 8.
    Li, J., Seltzer, M., Wang, X., Zhao, R., Gong, Y.: Large-scale domain adaptation via teacher-student learning. In: INTERSPEECH (2017)Google Scholar
  9. 9.
    Lippmann, R., Martin, E., Paul, D.: Multi-style training for robust isolated-word speech recognition. In: ICASSP (1987)Google Scholar
  10. 10.
    Manohar, V., Hadian, H., Povey, D., Khudanpur, S.: Semi-supervised training of acoustic models using lattice-free MMI. In: ICASSP (2018)Google Scholar
  11. 11.
    Nakatani, T., Yoshioka, T., Kinoshita, K., Miyoshi, M., Juang, B.H.: Speech dereverberation based on variance normalized delayed linear prediction. IEEE Trans. Audio, Speech Lang. Process. 18, 1717–1731 (2010)CrossRefGoogle Scholar
  12. 12.
    Peddinti, V., Chen, G., Manohar, V., Ko, T., Povey, D., Khudanpur, S.: JHU ASpIRE system: robust LVCSR with TDNNS, iVector adaptation and RNN-LMS. In: ASRU (2015)Google Scholar
  13. 13.
    Peddinti, V., Povey, D., Khudanpur, S.: A time delay neural network architecture for efficient modeling of long temporal contexts. In: INTERSPEECH (2015)Google Scholar
  14. 14.
    Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE Workshop on ASRU (2011)Google Scholar
  15. 15.
    Povey, D., et al.: Purely sequence-trained neural networks for ASR based on lattice-free MMI. In: INTERSPEECH (2016)Google Scholar
  16. 16.
    Sak, H., Senior, A., Rao, K., Beaufays, F.: Fast and accurate recurrent neural network acoustic models for speech recognition. In: INTERSPEECH (2015)Google Scholar
  17. 17.
    Synder, D., Chen, G., Povey, D.: MUSAN: a music, speech, and noise corpus. arXiv (2015)Google Scholar
  18. 18.
    Yu, D., Yao, K., Su, H., Li, G., Seide, F.: KL-divergence regularized deep neural network adapation for improved large vocabulary speech recognition. In: ICASSP (2013)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Laboratory of Language TechnologyTallinn University of TechnologyTallinnEstonia

Personalised recommendations