Representation Sharing and Transfer in Deep Neural Networks

Chapter
Part of the Signals and Communication Technology book series (SCT)

Abstract

We have emphasized in the previous chapters that in deep neural networks (DNNs) each hidden layer is a new representation of the raw input to the DNN. The representation at higher layers is more abstract than that at lower layers. In this chapter, we show that these feature representations can be shared and transferred across related tasks through techniques such as multitask and transfer learning. We will use multilingual and crosslingual speech recognition as the main example, which uses a shared-hidden-layer DNN architecture, to demonstrate these techniques.

Keywords

Acoustics Extractor 

References

  1. 1.
    Association, I.P., et al.: Report on the 1989 Kiel convention. J. Int. Phonetic Assoc. 19(2), 67–80 (1989)CrossRefGoogle Scholar
  2. 2.
    Athineos, M., Ellis, D.P.: Frequency-domain linear prediction for temporal features. In: Proceedings of the IEEE Workshop on Automfatic Speech Recognition and Understanding (ASRU), pp. 261–266 (2003)Google Scholar
  3. 3.
    Caruana, R.: Multitask learning. Mac. Learn. 28(1), 41–75 (1997)CrossRefMathSciNetGoogle Scholar
  4. 4.
    Chen, D., Mak, B., Leung, C.C., Sivadas, S.:  Joint acoustic modeling of triphones and trigraphemes by multi-task learning deep neural networks for low-resource speech recognition. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2014)Google Scholar
  5. 5.
    Chen, T., Rao, R.R.: Audio-visual integration in multimodal communication. Proc. IEEE 86(5), 837–852 (1998)CrossRefMathSciNetGoogle Scholar
  6. 6.
    Chibelushi, C.C., Deravi, F., Mason, J.S.: A review of speech-based bimodal recognition. Multimedia IEEE Trans. 4(1), 23–37 (2002)Google Scholar
  7. 7.
    Dahl, G.E., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(1), 30–42 (2012)CrossRefGoogle Scholar
  8. 8.
    Dupont, S., Luettin, J.: Audio-visual speech modeling for continuous speech recognition. Multimedia IEEE Trans. 2(3), 141–151 (2000)CrossRefGoogle Scholar
  9. 9.
    Fiscus, J.G.: A post-processing system to yield reduced word error rates: recognizer output voting error reduction (ROVER). In: Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 347–354 (1997)Google Scholar
  10. 10.
    Garofolo, J.S.: Darpa Timit: Acoustic-Phonetic Continuous Speech Corps CD-ROM. US Department of Commerce, National Institute of Standards and Technology, Gaithersburg (1993)Google Scholar
  11. 11.
    Ghoshal, A., Swietojanski, P., Renals, S.: Multilingual training of deep-neural netowrks. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013)Google Scholar
  12. 12.
    Heigold, G., Vanhoucke, V., Senior, A., Nguyen, P., Ranzato, M., Devin, M., Dean, J.: Multilingual acoustic models using distributed deep neural networks. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013)Google Scholar
  13. 13.
    Hermansky, H.: Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87, 1738 (1990)CrossRefGoogle Scholar
  14. 14.
    Hinton, G., Deng, L., Yu, D., Dahl, G.E.,Mohamed, A.r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)Google Scholar
  15. 15.
    Huang, J., Kingsbury, B.: Audio-visual deep learning for noise robust speech recognition. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7596–7599 (2013)Google Scholar
  16. 16.
    Huang, J.T., Li, J., Yu, D., Deng, L., Gong, Y.: Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013)Google Scholar
  17. 17.
    Kim, M.W., Ryu, J.W., Kim, E.J.: Speech recognition by integrating audio, visual and contextual features based on neural networks. Advances in Natural Computation, pp. 155–164. Springer, Berlin (2005)CrossRefGoogle Scholar
  18. 18.
    Lee, K.F., Hon, H.W.: Speaker-independent phone recognition using hidden Markov models. IEEE Trans. Speech Audio Process. 37(11), 1641–1648 (1989)CrossRefGoogle Scholar
  19. 19.
    Lewis, T.W., Powers, D.M.: Audio-visual speech recognition using red exclusion and neural networks. J. Res. Pract. Inf. Technol. 35(1), 41–64 (2003)Google Scholar
  20. 20.
    Lin, H., Deng, L., Yu, D., Gong, Y.f., Acero, A., Lee, C.H.: A study on multilingual acoustic modeling for large vocabulary ASR. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4333–4336 (2009)Google Scholar
  21. 21.
    Lu, Y., Lu, F., Sehgal, S., Gupta, S., Du, J., Tham, C.H., Green, P., Wan, V.: Multitask learning in connectionist speech recognition. In: Proceedings of the  Australian International Conference on Speech Science and Technology (2004)Google Scholar
  22. 22.
    Martens, J.: Deep learning via Hessian-free optimization. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 735–742 (2010)Google Scholar
  23. 23.
    Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 689–696 (2011)Google Scholar
  24. 24.
    Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010)CrossRefGoogle Scholar
  25. 25.
    Plahl, C., Schluter, R., Ney, H.: Cross-lingual portability of chinese and english neural network features for french and german LVCSR.In: Proceedings of the IEEE Workshop on Automfatic Speech Recognition and Understanding (ASRU), pp. 371–376 (2011)Google Scholar
  26. 26.
    Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE 91(9), 1306–1326 (2003)CrossRefGoogle Scholar
  27. 27.
    Qian, Y., Liu, J.: Cross-lingual and ensemble MLPs strategies for low-resource speech recognition. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH) (2012)Google Scholar
  28. 28.
    Schultz, T., Waibel, A.: Multilingual and crosslingual speech recognition. In: Proceedings of the DARPA Workshop on Broadcast News Transcription and Understanding, pp. 259–262 (1998)Google Scholar
  29. 29.
    Seide, F., Li, G., Yu, D.: Conversational speech transcription using context-dependent deep neural networks. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 437–440 (2011)Google Scholar
  30. 30.
    Seltzer, M.L., Droppo, J.: Multi-task learning in deep neural networks for improved phoneme recognition. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6965–6969 (2013)Google Scholar
  31. 31.
    Sumby, W.H., Pollack, I.: Visual contribution to speech intelligibility in noise. J. Acoust. Soc. Am. (JASA) 26(2), 212–215 (1954)CrossRefGoogle Scholar
  32. 32.
    Thomas, S., Ganapathy, S., Hermansky, H.: Cross-lingual and multi-stream posterior features for low resource LVCSR systems. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 877–880 (2010)Google Scholar
  33. 33.
    Thomas, S., Ganapathy, S., Hermansky, H.: Multilingual MLP features for low-resource LVCSR systems. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4269–4272 (2012)Google Scholar
  34. 34.
    Yu, D., Deng, L., Liu, P., Wu, J., Gong, Y., Acero, A.: Cross-lingual speech recognition under runtime resource constraints. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4193–4196 (2009)Google Scholar

Copyright information

© Springer-Verlag London 2015

Authors and Affiliations

  1. 1.Microsoft ResearchBothellUSA
  2. 2.Microsoft ResearchRedmondUSA

Personalised recommendations