Skip to main content

Ensemble Learning Approaches in Speech Recognition

  • Chapter
  • First Online:
Speech and Audio Processing for Coding, Enhancement and Recognition

Abstract

An overview is made on the ensemble learning efforts that have emerged in automatic speech recognition in recent years. The approaches that are based on different machine learning techniques and target various levels and components of speech recognition are described, and their effectiveness is discussed in terms of the direct performance measure of word error rate and the indirect measures of classification margin, diversity, as well as bias and variance. In addition, methods on reducing storage and computation costs of ensemble models for practical deployments of speech recognition systems are discussed. Ensemble learning for speech recognition has been largely fruitful, and it is expected to continue progress along with the advances in machine learning, speech and language modeling, as well as computing technology.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. K. Audhkhasi, A.M. Zavou, P.G. Georgiou, S.S. Narayanan, Theoretical analysis of diversity in an ensemble of automatic speech recognition systems. IEEE Trans. ASLP 22(3), 711–726 (2014)

    Google Scholar 

  2. P. Biuhlmann, Bagging, subagging and bragging for improving some prediction algorithms, in Recent Advances and Trends in Nonparametric Statistics, ed. by E.G. Akritas, D.N. Politis (Elsevier, Amsterdam, 2003)

    Google Scholar 

  3. J.K. Bradley, R.E. Schapire, FileterBoost: regression and classification on large datasets, in Advances in Neural Information Processing Systems, ed. by J.C. Platt et al., vol. 20 (MIT Press, Cambridge, 2008)

    Google Scholar 

  4. L. Breiman, Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)

    MathSciNet  MATH  Google Scholar 

  5. L. Breiman, Random forests. Mach. Learn. 45, 5–32 (2001)

    Article  MATH  Google Scholar 

  6. C. Bresline, M.J.F. Gales, Generating complimentary systems for large vocabulary continuous speech recognition, in Proceeding of Interspeech (2006)

    Google Scholar 

  7. C. Bresline, M.J.F. Gales, Building multiple complementary systems using directed decision tree, in Proceeding of Interspeech (2007), pp. 1441–1444

    Google Scholar 

  8. G. Brown, An information theoretic perspective on multiple classifier system, in Proceedings of MCS (2009), pp. 344–353

    Google Scholar 

  9. S.F. Chen, J.T. Goodman, An empirical study of smoothing techniques for language modeling, in Proceedings of ACI (1996)

    Google Scholar 

  10. X. Chen, Y. Zhao, Data sampling based ensemble acoustic modeling, in Proceedings of ICASSP (2009), pp. 3805–3808

    Google Scholar 

  11. X. Chen, Y. Zhao, Integrating MLP feature and discriminative training in data sampling based ensemble acoustic modeling, in Proceeding of Interspeech (2010), pp. 1349–1352

    Google Scholar 

  12. X. Chen, Y. Zhao, Building acoustic model ensembles by data sampling with enhanced trainings and features. IEEE Trans. ASLP 21(3), 498–507 (2013)

    Google Scholar 

  13. G. Cook, A. Robinson, Boosting the performance of connectionist large vocabulary speech recognition, in Proceeding of ICSLP (1996), pp. 1305–1308

    Google Scholar 

  14. X. Cui, J. Xue, B. Xiang, B. Zhou, A study of bootstrapping with multiple acoustic features for improved automatic speech recognition, in Proceeding of Interspeech (2009), pp. 240–243

    Google Scholar 

  15. X. Cui, J. Huang, J.-T. Chien, Multi-view and multi-objective semi-supervised learning for HMM-based automatic speech recognition. IEEE Trans. ASLP 20(7), 1923–1935 (2012)

    Google Scholar 

  16. X. Cui, J. Xue, X. Chen, P. Olsen, P.L. Dognin, V.C. Upendra, J.R. Hershey, B. Zhou, Hidden Markov acoustic modeling with bootstrap and restructuring for low-resourced languages. IEEE Trans. ASLP 20(8), 2252–2264 (2012)

    Google Scholar 

  17. G.E. Dahl, D. Yu, L. Deng, A. Acero, Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. ASLP 20(1), 30–42 (2012)

    Google Scholar 

  18. L. Deng, D. Sun, A statistical approach to automatic speech recognition using the atomic speech units constructed from overlapping articulatory feature. J. Acoust. Soc. Am 95(5), 2702–2719 (1994)

    Article  Google Scholar 

  19. L. Deng, D. Yu, Deep convex network: a scalable architecture for speech pattern classification, in Proceeding of Interspeech (2011)

    Google Scholar 

  20. L. Deng, D. Yu, J. Platt, Scalable stacking and learning for building deep architectures, in Proceeding of ICASSP (2012a)

    Google Scholar 

  21. L. Deng, G. Tur, X. He, D. Hakkani-Tur, Use of Kernel deep convex networks and end-to-end learning for spoken language understanding, in IEEE workshop on spoken language technologies (2012b)

    Google Scholar 

  22. L. Deng, J. Li, J.-T. Huang, K. Yao, D. Yu, F. Seide, M. Seltzer, G. Zweig, X. He, J. Williams, Y. Gong, A. Acero, Recent advances in deep learning for speech research at Microsoft, in Proceeding of ICASSP (2013a)

    Google Scholar 

  23. L. Deng, X. He, J. Gao, Deep stacking networks for information retrieval, in Proceeding of ICASSP (2013b)

    Google Scholar 

  24. T.G. Dietterich, An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting and randomization. Mach. Learn. 1(22), 139–157 (1998)

    Google Scholar 

  25. T.G. Dietterich, Ensemble methods in machine learning, in Proceeding of MCS (2000), pp. 1–15

    Google Scholar 

  26. C. Dimitrakakis, S. Bengio, Boosting HMMs with an application to speech recognition, in Proceeding of ICASSP (2004), pp. V-621–624

    Google Scholar 

  27. C. Dimitrakakis, S. Bengio, Phoneme and sentence-level ensembles for speech recognition. Eurasip J. ASMP (2011). doi:10.1155/2011/426792

    Google Scholar 

  28. J. Du, Y. Hu, H. Jiang, Boosted mixture learning of Gaussian mixture HMMs for speech recognition, in Proceeding of Interspeech (2010), pp. 2942–2945

    Google Scholar 

  29. S. Dupont, H. Bourlard, Using multiple time scales in a multi-stream speech recognition system, in Proceeding of Eurospeech (1997), pp. 3–6

    Google Scholar 

  30. G. Evermann, P.C. Woodland, Posterior probability decoding, confidence estimation and system combination, in Proceeding of speech transcription workshop (2000)

    Google Scholar 

  31. J.G. Fiscus, A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER), in Proceeding of IEEE ASRU Workshop (1997), pp. 347–352

    Google Scholar 

  32. A. Fred and A. K. Jain, Combining multiple clusterings using evidence accumulation, IEEE Trans. PAMI, 27(6), 835–850 (2005)

    Google Scholar 

  33. J. Friedman, P. Hall, On bagging and nonlinear estimation. J. Stat. Plan. Inference 137(3), 669–683 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  34. J. Friedman, T. Hastie, R. Tibshirani, Additive logistic regression: a statistical view of boosting. Ann. Stat. 28(2), 337–407 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  35. Y. Freund, R.E. Schapire, Experiments with a new boosting algorithm, in Proceeding of ICML (1996), pp. 1–9

    Google Scholar 

  36. Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  37. M. Gales, D. Y. Kim, P. C. Woodland, H. Y. Chan, D. Mrva, R. Sinha, and S. E. Tranter, Progress in the CU-HTK broadcast news transcription system, IEEE Trans. ASLP, 14(5), 1513–1525, (2006)

    Google Scholar 

  38. A.K. Halberstadt, J.R. Glass, Heterogeneous measurements and multiple classifiers for speech recognition, in Proceeding of ICSLP (1998), pp. 995–998

    Google Scholar 

  39. T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer, New York, 2001)

    Book  Google Scholar 

  40. D. Hillard, B. Hoffmeister, M. Ostendorf, R. Schluter, H. Ney, iROVER: improving system combination with classification, in Proceeding of HLT (2007)

    Google Scholar 

  41. T.K. Ho, The random subspace method for constructing decision forests. IEEE Trans. PAMI 20(8), 832–844 (1998)

    Article  Google Scholar 

  42. HTK Toolkit, U.K. http://htk.eng.cam.ac

  43. R. Hu, X. Li, Y. Zhao, Acoustic model training using greedy EM, in Proceeding of ICASSP (2005), pp. I697–700

    Google Scholar 

  44. B. Hutchinson, L. Deng, D. Yu, Tensor deep stacking networks. IEEE Trans. PAMI, 35(8) (2013), 1944–1957

    Google Scholar 

  45. D. Jurafsky, J.H. Martin, Speech and Language Processing, 2nd ed., (Pearson-Prentice Hall, Englewood Cliffs, 2008)

    Google Scholar 

  46. B. Kingsbury, N. Morgan, Recognizing reverberant speech with Rasta-PLP, in Proceeding of ICASSP (1997), pp. 1259–1262

    Google Scholar 

  47. K. Kirchhoff, G.A. Fink, G. Sagerer, Combining acoustic and articulatory feature information for robust speech recognition. Speech Commun. 37, 303–319 (2002)

    Article  MATH  Google Scholar 

  48. A. Krogh, J. Vedelsby, Neural network ensembles, cross validation, and active learning, in Advances in Neural Information Processing Systems, ed. by G. Tesauro, D.S. Touretzky, T.K. Leen (MIT Press, Cambridge, 1995), pp. 231–238

    Google Scholar 

  49. L.I. Kuncheva, C.J. Whitaker, Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Mach. Learn. 51(2), 181–207 (2003)

    Article  MATH  Google Scholar 

  50. L.I. Kuncheva, Combining pattern classifiers – methods and algorithms (Wiley, Hoboken, NJ, 2004)

    Book  MATH  Google Scholar 

  51. A. Lazarevic, Z. Obradovic, Effective pruning of neural network classifier ensembles, in Proceeding of ICNN (2001), pp. 796–801

    Google Scholar 

  52. K. Livescu, E. Fosler-Lussier, F. Metze, Subword modeling for automatic speech recognition. IEEE SPM 29(6), 44–57 (2012)

    Article  Google Scholar 

  53. C. Ma, H.-K.J. Kuo, H. Soltan, X. Cui, U. Chaudhari, L. Mangu, C.-H. Lee, in Proceeding of ICASSP (2010), pp. 4394–4397

    Google Scholar 

  54. D.D. Margineantu, T.G. Dietterich, Pruning adaptive boosting, in Proceeding of ICML (1997), pp. 211–218

    Google Scholar 

  55. G. Martinez-Munoz, A. Suarez, Aggregation ordering in bagging, in Proceeding of ICAIA (2004), pp. 258–263

    Google Scholar 

  56. P. McMahon, P. McCourt, S. Vaseghi, Discriminative weighting of multi-resolution sub-band cepstral features for speech recognition, in Proceeding of ICSLP (1998), pp. 1055–1058

    Google Scholar 

  57. C. Meyer, H. Schramm, Boosting HMM acoustic models in large vocabulary speech recognition. Speech Commun. 48, 532–548 (2006)

    Article  Google Scholar 

  58. T. Mikolov, A. Deoras, S. Kombrink, L. Burget, J. Cernocky, Empirical evaluation and combination of advanced language modeling techniques, in Proceeding of Interspeech (2011)

    Google Scholar 

  59. D. Povey, B. Kingsbury, L. Mangu, G. Saon, H. Soltau, G. Zweig, FMPE: discriminatively trained features for speech recognition, in Proceeding of ICASSP (2005), pp. I-961–964

    Google Scholar 

  60. Y. Qian, J. Liu, Cross-lingual and ensemble MLPs strategies for low-resource speech recognition, in Proceeding of Interspeech (2012)

    Google Scholar 

  61. L. Rabiner, F. Juang, Fundamentals of Speech Recognition (Prentice Hall, Englewood Cliffs, 1993)

    Google Scholar 

  62. T. Robinson, M. Hochberg, S. Renals, The use of recurrent neural networks in continuous speech recognition, in Automatic Speech and Speaker Recognition – Advanced Topics, ed. by C.H. Lee, K.K. Paliwal, F.K. Soong (Kluwer Academic Publishers, Boston, 1995). Chapter 19

    Google Scholar 

  63. J.J. Rodriguz, L.I. Kuncheva, C.J. Alonso, Rotation forest: a new classifier ensemble method. IEEE Trans. PAMI 28(10), 1619–1630 (2006)

    Article  Google Scholar 

  64. G. Saon, H. Soltau, Boosting systems for large vocabulary continuous speech recognition. Speech Commun. 54(2), 212–218 (2012)

    Google Scholar 

  65. R.E. Schapire, The strength of weak learnability. Mach. Learn. 5(2), 197–227 (1990)

    Google Scholar 

  66. H. Schwenk, Using boosting to improve a hybrid HMM/neural network speech recognition, in Proceeding of ICASSP, pp. 1009–1012 (1999)

    Google Scholar 

  67. T. Shinozaki, S. Furui, Spontaneous speech recognition using a massively parallel decoder, in Proceeding of ICSLP (2004), pp. 1705–1708

    Google Scholar 

  68. O. Siohan, B. Ramabhadran, B. Kingsbury, Constructing ensembles of ASR systems using randomized decision trees, in Proceeding of ICASSP (2005), pp. I-197–I-200

    Google Scholar 

  69. A. Strehl, J. Ghosh, Cluster ensembles – a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3, 583–617 (2002)

    MathSciNet  Google Scholar 

  70. E.K. Tang, P.N. Suganthan, X. Yao, An analysis of diversity measures. Mach. Learn. 65(1), 247–271 (2006)

    Article  Google Scholar 

  71. H. Tang, M. Hasegawa-Johnson, T. Huang, Toward robust learning of the Gaussian mixture state emission densities for hidden Markov models, in Proceeding of ICASSP (2010), pp. 2274–2277

    Google Scholar 

  72. K. Tumer, J. Ghosh, Analysis of decision boundaries in linearly combined neural classifiers. Pattern Recognit. 29(2), 341–348 (1996)

    Article  Google Scholar 

  73. G. Tur, L. Deng, D. Hakkani-Tur, X. He, Towards deeper understanding deep convex networks for semantic utterance classification, in Proceeding of ICASSP (2012)

    Google Scholar 

  74. N. Ueda, R. Nakano, Generalization error of ensemble estimators, in Proceeding of ICNN (1996), pp. 90–95

    Google Scholar 

  75. M.D. Wachter, M. Matton, K. Demuynck, P. Wambacq, P. Cools, D. Van Compernolle, Template-based continuous speech recognition. IEEE Trans. ASLP 15(4), 1377–1390 (2007)

    Google Scholar 

  76. D.H. Wolpert, Stacked generalization. Neural Netw. 5(2), 241–259 (1992)

    Article  MathSciNet  Google Scholar 

  77. S. Wu, B. Kingsbury, N. Mongan, S. Greenberg, Incorporating information from syllable-length time scales into automatic speech recognition, in Proceeding of ICASSP (1998), pp. 721–724

    Google Scholar 

  78. P. Xu, F. Jelinek, Random forest and the data sparseness problem in language modeling. Comput. Speech Lang. 21, 105–152 (2007)

    Article  Google Scholar 

  79. J. Xue, Y. Zhao, Random forests of phonetic decision trees for acoustic modeling in conversational speech recognition. IEEE Trans. ASLP 16(3), 519–528 (2008)

    Google Scholar 

  80. R. Zhang, A. Rudnicky, Applying N-best list re-ranking to acoustic model combinations of boosting training, in Proceeding of Interspeech (2004a)

    Google Scholar 

  81. R. Zhang, A. Rudnicky, A frame level boosting training scheme for acoustic modeling, in Proceeding of Interspeech (2004b)

    Google Scholar 

  82. Y. Zhao, X. Zhang, R.-S. Hu, J. Xue, X. Li, L. Che, R. Hu, L. Schopp, An automatic captioning system for telemedicine, in Proceeding of ICASSP (2006), pp. I-957–I-960

    Google Scholar 

  83. Z.-H. Zhou, N. Li, Multi-information ensemble diversity, in Proceeding of MCS (2010), pp. 134–144

    Google Scholar 

  84. Z.-H. Zhou, Ensemble Methods: Foundations and Algorithms (CRC Press, Boca Raton, 2012)

    Google Scholar 

  85. Q. Zhu, A. Stolcke, B.Y. Chen, N. Morgan, Using MLP features in SRI’s conversational speech recognition system, in Proceedings of Interspeech (2005), pp. 921–924

    Google Scholar 

  86. G. Zweig and M. Padmanabhan, Boosting Gaussian mixtures in an LVSCR system, Proc. ICASSP, pp. I-1527–I-1530 (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yunxin Zhao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer Science+Business Media New York

About this chapter

Cite this chapter

Zhao, Y., Xue, J., Chen, X. (2015). Ensemble Learning Approaches in Speech Recognition. In: Ogunfunmi, T., Togneri, R., Narasimha, M. (eds) Speech and Audio Processing for Coding, Enhancement and Recognition. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-1456-2_5

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-1456-2_5

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4939-1455-5

  • Online ISBN: 978-1-4939-1456-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics