A Deep Neural Network Approach for Missing-Data Mask Estimation on Dual-Microphone Smartphones: Application to Noise-Robust Speech Recognition

  • Iván López-Espejo
  • José A. González
  • Ángel M. Gómez
  • Antonio M. Peinado
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8854)


The inclusion of two or more microphones in smartphones is becoming quite common. These were originally intended to perform noise reduction and few benefit is still being taken from this feature for noise-robust automatic speech recognition (ASR). In this paper we propose a novel system to estimate missing-data masks for robust ASR on dual-microphone smartphones. This novel system is based on deep neural networks (DNNs), which have proven to be a powerful tool in the field of ASR in different ways. To assess the performance of the proposed technique, spectral reconstruction experiments are carried out on a dual-channel database derived from Aurora-2. Our results demonstrate that the DNN is better able to exploit the dual-channel information and yields an improvement on word accuracy of more than 6% over state-of-the-art single-channel mask estimation techniques.


Dual-microphone Robust speech recognition Mask estimation Smartphone Deep neural network Missing data imputation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    López-Espejo, I., et al.: Feature Enhancement for Robust Speech Recognition on Smartphones with Dual-Microphone. In: EUSIPCO, Lisbon (2014)Google Scholar
  2. 2.
    Zhang, J., et al.: A Fast Two-Microphone Noise Reduction Algorithm Based on Power Level Ratio for Mobile Phone. In: ISCSLP, Hong-Kong, pp. 206–209 (2012)Google Scholar
  3. 3.
    Hinton, G., et al.: Deep Neural Networks for Acoustic Modeling in Speech Recognition. IEEE Signal Processing Magazine 29(6) (2012)Google Scholar
  4. 4.
    Seltzer, M.L., Yu, D., Wang, Y.: An Investigation of Deep Neural Networks for Noise Robust Speech Recognition. In: ICASSP, Vancouver, pp. 7398–7402 (2013)Google Scholar
  5. 5.
    Wang, Y., Wang, D.L.: Towards Scaling Up Classification-Based Speech Separation. IEEE Trans. on Audio, Speech, and Language Processing 21(7) (2013)Google Scholar
  6. 6.
    Narayanan, A., Wang, D.L.: Ideal Ratio Mask Estimation Using Deep Neural Networks for Robust Speech Recognition. In: ICASSP, Vancouver (2013)Google Scholar
  7. 7.
    Raj, B., Seltzer, M.L., Stern, R.M.: Reconstruction of Missing Features for Robust Speech Recognition. Speech Comm. 48(4), 275–296 (2004)CrossRefGoogle Scholar
  8. 8.
    González, J.A., Peinado, A.M., Ma, N., Gomez, A.M., Barker, J.: MMSE-Based Missing-Feature Reconstruction with Temporal Modeling for Robust Speech Recognition. IEEE Trans. on Audio, Speech and Language Proc. 21(3) (2013)Google Scholar
  9. 9.
    Cooke, M., et al.: Robust Automatic Speech Recognition with Missing Data and Unreliable Acoustic Data. Speech Communication 34, 267–285 (2001)CrossRefzbMATHGoogle Scholar
  10. 10.
    Pearce, D., Hirsch, H.G.: The Aurora Experimental Framework for the Performance Evaluation of Speech Recognition Systems Under Noisy Conditions. In: ICSLP, Beijing (2000)Google Scholar
  11. 11.
    Roweis, S.T.: Factorial Models and Refiltering for Speech Separation and Denoising. In: EUROSPEECH, Geneva, pp. 1009–1012 (2003)Google Scholar
  12. 12.
    Hinton, G., Salakhutdinov, R.: Reducing the Dimensionality of Data with Neural Networks. Science 313(5786) (2006)Google Scholar
  13. 13.
    Hinton, G.: Training Products of Experts by Minimizing Contrastive Divergence. Neural Computation 14, 1771–1800 (2002)CrossRefzbMATHGoogle Scholar
  14. 14.
    ETSI ES 201 108 - Distributed speech recognition; Front-end feature extraction algorithm; Compression algorithmsGoogle Scholar
  15. 15.
    Ephraim, Y., Malah, D.: Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator. IEEE Trans. on Acoustics, Speech, and Signal Processing ASSP-32(6), 1109–1121 (1984)CrossRefGoogle Scholar
  16. 16.
    Hinton, G.: A Practical Guide to Training Restricted Boltzmann Machines. UTML TR 2010-003 (2010)Google Scholar
  17. 17.
    Tanaka, M.: Deep Neural Network Toolbox for MatLab (2013)Google Scholar
  18. 18.
    ETSI ES 202 050 - Distributed speech recognition; Advanced front-end feature extraction algorithm; Compression algorithmsGoogle Scholar
  19. 19.
    Deng, L., et al.: Large-Vocabulary Speech Recognition Under Adverse Acoustic Environments. In: ICSLP, Beijing, pp. 806–809 (2000)Google Scholar
  20. 20.
    González, J.A., et al.: Efficient MMSE Estimation and Uncertainty Processing for Multienvironment Robust Speech Recognition. IEEE Trans. on Audio, Speech, and Language Proc. 19(5), 1206–1220 (2011)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Iván López-Espejo
    • 1
  • José A. González
    • 2
  • Ángel M. Gómez
    • 1
  • Antonio M. Peinado
    • 1
  1. 1.Dept. of Signal Theory, Telematics and CommunicationsUniversity of GranadaSpain
  2. 2.Dept. of Computer ScienceUniversity of SheffieldUK

Personalised recommendations