Ensemble Learning Approaches in Speech Recognition

Zhao, Yunxin; Xue, Jian; Chen, Xin

doi:10.1007/978-1-4939-1456-2_5

Yunxin Zhao⁴,
Jian Xue⁵ &
Xin Chen⁶

2037 Accesses
4 Citations
6 Altmetric

Abstract

An overview is made on the ensemble learning efforts that have emerged in automatic speech recognition in recent years. The approaches that are based on different machine learning techniques and target various levels and components of speech recognition are described, and their effectiveness is discussed in terms of the direct performance measure of word error rate and the indirect measures of classification margin, diversity, as well as bias and variance. In addition, methods on reducing storage and computation costs of ensemble models for practical deployments of speech recognition systems are discussed. Ensemble learning for speech recognition has been largely fruitful, and it is expected to continue progress along with the advances in machine learning, speech and language modeling, as well as computing technology.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

K. Audhkhasi, A.M. Zavou, P.G. Georgiou, S.S. Narayanan, Theoretical analysis of diversity in an ensemble of automatic speech recognition systems. IEEE Trans. ASLP 22(3), 711–726 (2014)
Google Scholar
P. Biuhlmann, Bagging, subagging and bragging for improving some prediction algorithms, in Recent Advances and Trends in Nonparametric Statistics, ed. by E.G. Akritas, D.N. Politis (Elsevier, Amsterdam, 2003)
Google Scholar
J.K. Bradley, R.E. Schapire, FileterBoost: regression and classification on large datasets, in Advances in Neural Information Processing Systems, ed. by J.C. Platt et al., vol. 20 (MIT Press, Cambridge, 2008)
Google Scholar
L. Breiman, Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)
MathSciNet MATH Google Scholar
L. Breiman, Random forests. Mach. Learn. 45, 5–32 (2001)
Article MATH Google Scholar
C. Bresline, M.J.F. Gales, Generating complimentary systems for large vocabulary continuous speech recognition, in Proceeding of Interspeech (2006)
Google Scholar
C. Bresline, M.J.F. Gales, Building multiple complementary systems using directed decision tree, in Proceeding of Interspeech (2007), pp. 1441–1444
Google Scholar
G. Brown, An information theoretic perspective on multiple classifier system, in Proceedings of MCS (2009), pp. 344–353
Google Scholar
S.F. Chen, J.T. Goodman, An empirical study of smoothing techniques for language modeling, in Proceedings of ACI (1996)
Google Scholar
X. Chen, Y. Zhao, Data sampling based ensemble acoustic modeling, in Proceedings of ICASSP (2009), pp. 3805–3808
Google Scholar
X. Chen, Y. Zhao, Integrating MLP feature and discriminative training in data sampling based ensemble acoustic modeling, in Proceeding of Interspeech (2010), pp. 1349–1352
Google Scholar
X. Chen, Y. Zhao, Building acoustic model ensembles by data sampling with enhanced trainings and features. IEEE Trans. ASLP 21(3), 498–507 (2013)
Google Scholar
G. Cook, A. Robinson, Boosting the performance of connectionist large vocabulary speech recognition, in Proceeding of ICSLP (1996), pp. 1305–1308
Google Scholar
X. Cui, J. Xue, B. Xiang, B. Zhou, A study of bootstrapping with multiple acoustic features for improved automatic speech recognition, in Proceeding of Interspeech (2009), pp. 240–243
Google Scholar
X. Cui, J. Huang, J.-T. Chien, Multi-view and multi-objective semi-supervised learning for HMM-based automatic speech recognition. IEEE Trans. ASLP 20(7), 1923–1935 (2012)
Google Scholar
X. Cui, J. Xue, X. Chen, P. Olsen, P.L. Dognin, V.C. Upendra, J.R. Hershey, B. Zhou, Hidden Markov acoustic modeling with bootstrap and restructuring for low-resourced languages. IEEE Trans. ASLP 20(8), 2252–2264 (2012)
Google Scholar
G.E. Dahl, D. Yu, L. Deng, A. Acero, Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. ASLP 20(1), 30–42 (2012)
Google Scholar
L. Deng, D. Sun, A statistical approach to automatic speech recognition using the atomic speech units constructed from overlapping articulatory feature. J. Acoust. Soc. Am 95(5), 2702–2719 (1994)
Article Google Scholar
L. Deng, D. Yu, Deep convex network: a scalable architecture for speech pattern classification, in Proceeding of Interspeech (2011)
Google Scholar
L. Deng, D. Yu, J. Platt, Scalable stacking and learning for building deep architectures, in Proceeding of ICASSP (2012a)
Google Scholar
L. Deng, G. Tur, X. He, D. Hakkani-Tur, Use of Kernel deep convex networks and end-to-end learning for spoken language understanding, in IEEE workshop on spoken language technologies (2012b)
Google Scholar
L. Deng, J. Li, J.-T. Huang, K. Yao, D. Yu, F. Seide, M. Seltzer, G. Zweig, X. He, J. Williams, Y. Gong, A. Acero, Recent advances in deep learning for speech research at Microsoft, in Proceeding of ICASSP (2013a)
Google Scholar
L. Deng, X. He, J. Gao, Deep stacking networks for information retrieval, in Proceeding of ICASSP (2013b)
Google Scholar
T.G. Dietterich, An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting and randomization. Mach. Learn. 1(22), 139–157 (1998)
Google Scholar
T.G. Dietterich, Ensemble methods in machine learning, in Proceeding of MCS (2000), pp. 1–15
Google Scholar
C. Dimitrakakis, S. Bengio, Boosting HMMs with an application to speech recognition, in Proceeding of ICASSP (2004), pp. V-621–624
Google Scholar
C. Dimitrakakis, S. Bengio, Phoneme and sentence-level ensembles for speech recognition. Eurasip J. ASMP (2011). doi:10.1155/2011/426792
Google Scholar
J. Du, Y. Hu, H. Jiang, Boosted mixture learning of Gaussian mixture HMMs for speech recognition, in Proceeding of Interspeech (2010), pp. 2942–2945
Google Scholar
S. Dupont, H. Bourlard, Using multiple time scales in a multi-stream speech recognition system, in Proceeding of Eurospeech (1997), pp. 3–6
Google Scholar
G. Evermann, P.C. Woodland, Posterior probability decoding, confidence estimation and system combination, in Proceeding of speech transcription workshop (2000)
Google Scholar
J.G. Fiscus, A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER), in Proceeding of IEEE ASRU Workshop (1997), pp. 347–352
Google Scholar
A. Fred and A. K. Jain, Combining multiple clusterings using evidence accumulation, IEEE Trans. PAMI, 27(6), 835–850 (2005)
Google Scholar
J. Friedman, P. Hall, On bagging and nonlinear estimation. J. Stat. Plan. Inference 137(3), 669–683 (2007)
Article MathSciNet MATH Google Scholar
J. Friedman, T. Hastie, R. Tibshirani, Additive logistic regression: a statistical view of boosting. Ann. Stat. 28(2), 337–407 (2000)
Article MathSciNet MATH Google Scholar
Y. Freund, R.E. Schapire, Experiments with a new boosting algorithm, in Proceeding of ICML (1996), pp. 1–9
Google Scholar
Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997)
Article MathSciNet MATH Google Scholar
M. Gales, D. Y. Kim, P. C. Woodland, H. Y. Chan, D. Mrva, R. Sinha, and S. E. Tranter, Progress in the CU-HTK broadcast news transcription system, IEEE Trans. ASLP, 14(5), 1513–1525, (2006)
Google Scholar
A.K. Halberstadt, J.R. Glass, Heterogeneous measurements and multiple classifiers for speech recognition, in Proceeding of ICSLP (1998), pp. 995–998
Google Scholar
T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer, New York, 2001)
Book Google Scholar
D. Hillard, B. Hoffmeister, M. Ostendorf, R. Schluter, H. Ney, iROVER: improving system combination with classification, in Proceeding of HLT (2007)
Google Scholar
T.K. Ho, The random subspace method for constructing decision forests. IEEE Trans. PAMI 20(8), 832–844 (1998)
Article Google Scholar
HTK Toolkit, U.K. http://htk.eng.cam.ac
R. Hu, X. Li, Y. Zhao, Acoustic model training using greedy EM, in Proceeding of ICASSP (2005), pp. I697–700
Google Scholar
B. Hutchinson, L. Deng, D. Yu, Tensor deep stacking networks. IEEE Trans. PAMI, 35(8) (2013), 1944–1957
Google Scholar
D. Jurafsky, J.H. Martin, Speech and Language Processing, 2^nd ed., (Pearson-Prentice Hall, Englewood Cliffs, 2008)
Google Scholar
B. Kingsbury, N. Morgan, Recognizing reverberant speech with Rasta-PLP, in Proceeding of ICASSP (1997), pp. 1259–1262
Google Scholar
K. Kirchhoff, G.A. Fink, G. Sagerer, Combining acoustic and articulatory feature information for robust speech recognition. Speech Commun. 37, 303–319 (2002)
Article MATH Google Scholar
A. Krogh, J. Vedelsby, Neural network ensembles, cross validation, and active learning, in Advances in Neural Information Processing Systems, ed. by G. Tesauro, D.S. Touretzky, T.K. Leen (MIT Press, Cambridge, 1995), pp. 231–238
Google Scholar
L.I. Kuncheva, C.J. Whitaker, Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Mach. Learn. 51(2), 181–207 (2003)
Article MATH Google Scholar
L.I. Kuncheva, Combining pattern classifiers – methods and algorithms (Wiley, Hoboken, NJ, 2004)
Book MATH Google Scholar
A. Lazarevic, Z. Obradovic, Effective pruning of neural network classifier ensembles, in Proceeding of ICNN (2001), pp. 796–801
Google Scholar
K. Livescu, E. Fosler-Lussier, F. Metze, Subword modeling for automatic speech recognition. IEEE SPM 29(6), 44–57 (2012)
Article Google Scholar
C. Ma, H.-K.J. Kuo, H. Soltan, X. Cui, U. Chaudhari, L. Mangu, C.-H. Lee, in Proceeding of ICASSP (2010), pp. 4394–4397
Google Scholar
D.D. Margineantu, T.G. Dietterich, Pruning adaptive boosting, in Proceeding of ICML (1997), pp. 211–218
Google Scholar
G. Martinez-Munoz, A. Suarez, Aggregation ordering in bagging, in Proceeding of ICAIA (2004), pp. 258–263
Google Scholar
P. McMahon, P. McCourt, S. Vaseghi, Discriminative weighting of multi-resolution sub-band cepstral features for speech recognition, in Proceeding of ICSLP (1998), pp. 1055–1058
Google Scholar
C. Meyer, H. Schramm, Boosting HMM acoustic models in large vocabulary speech recognition. Speech Commun. 48, 532–548 (2006)
Article Google Scholar
T. Mikolov, A. Deoras, S. Kombrink, L. Burget, J. Cernocky, Empirical evaluation and combination of advanced language modeling techniques, in Proceeding of Interspeech (2011)
Google Scholar
D. Povey, B. Kingsbury, L. Mangu, G. Saon, H. Soltau, G. Zweig, FMPE: discriminatively trained features for speech recognition, in Proceeding of ICASSP (2005), pp. I-961–964
Google Scholar
Y. Qian, J. Liu, Cross-lingual and ensemble MLPs strategies for low-resource speech recognition, in Proceeding of Interspeech (2012)
Google Scholar
L. Rabiner, F. Juang, Fundamentals of Speech Recognition (Prentice Hall, Englewood Cliffs, 1993)
Google Scholar
T. Robinson, M. Hochberg, S. Renals, The use of recurrent neural networks in continuous speech recognition, in Automatic Speech and Speaker Recognition – Advanced Topics, ed. by C.H. Lee, K.K. Paliwal, F.K. Soong (Kluwer Academic Publishers, Boston, 1995). Chapter 19
Google Scholar
J.J. Rodriguz, L.I. Kuncheva, C.J. Alonso, Rotation forest: a new classifier ensemble method. IEEE Trans. PAMI 28(10), 1619–1630 (2006)
Article Google Scholar
G. Saon, H. Soltau, Boosting systems for large vocabulary continuous speech recognition. Speech Commun. 54(2), 212–218 (2012)
Google Scholar
R.E. Schapire, The strength of weak learnability. Mach. Learn. 5(2), 197–227 (1990)
Google Scholar
H. Schwenk, Using boosting to improve a hybrid HMM/neural network speech recognition, in Proceeding of ICASSP, pp. 1009–1012 (1999)
Google Scholar
T. Shinozaki, S. Furui, Spontaneous speech recognition using a massively parallel decoder, in Proceeding of ICSLP (2004), pp. 1705–1708
Google Scholar
O. Siohan, B. Ramabhadran, B. Kingsbury, Constructing ensembles of ASR systems using randomized decision trees, in Proceeding of ICASSP (2005), pp. I-197–I-200
Google Scholar
A. Strehl, J. Ghosh, Cluster ensembles – a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3, 583–617 (2002)
MathSciNet Google Scholar
E.K. Tang, P.N. Suganthan, X. Yao, An analysis of diversity measures. Mach. Learn. 65(1), 247–271 (2006)
Article Google Scholar
H. Tang, M. Hasegawa-Johnson, T. Huang, Toward robust learning of the Gaussian mixture state emission densities for hidden Markov models, in Proceeding of ICASSP (2010), pp. 2274–2277
Google Scholar
K. Tumer, J. Ghosh, Analysis of decision boundaries in linearly combined neural classifiers. Pattern Recognit. 29(2), 341–348 (1996)
Article Google Scholar
G. Tur, L. Deng, D. Hakkani-Tur, X. He, Towards deeper understanding deep convex networks for semantic utterance classification, in Proceeding of ICASSP (2012)
Google Scholar
N. Ueda, R. Nakano, Generalization error of ensemble estimators, in Proceeding of ICNN (1996), pp. 90–95
Google Scholar
M.D. Wachter, M. Matton, K. Demuynck, P. Wambacq, P. Cools, D. Van Compernolle, Template-based continuous speech recognition. IEEE Trans. ASLP 15(4), 1377–1390 (2007)
Google Scholar
D.H. Wolpert, Stacked generalization. Neural Netw. 5(2), 241–259 (1992)
Article MathSciNet Google Scholar
S. Wu, B. Kingsbury, N. Mongan, S. Greenberg, Incorporating information from syllable-length time scales into automatic speech recognition, in Proceeding of ICASSP (1998), pp. 721–724
Google Scholar
P. Xu, F. Jelinek, Random forest and the data sparseness problem in language modeling. Comput. Speech Lang. 21, 105–152 (2007)
Article Google Scholar
J. Xue, Y. Zhao, Random forests of phonetic decision trees for acoustic modeling in conversational speech recognition. IEEE Trans. ASLP 16(3), 519–528 (2008)
Google Scholar
R. Zhang, A. Rudnicky, Applying N-best list re-ranking to acoustic model combinations of boosting training, in Proceeding of Interspeech (2004a)
Google Scholar
R. Zhang, A. Rudnicky, A frame level boosting training scheme for acoustic modeling, in Proceeding of Interspeech (2004b)
Google Scholar
Y. Zhao, X. Zhang, R.-S. Hu, J. Xue, X. Li, L. Che, R. Hu, L. Schopp, An automatic captioning system for telemedicine, in Proceeding of ICASSP (2006), pp. I-957–I-960
Google Scholar
Z.-H. Zhou, N. Li, Multi-information ensemble diversity, in Proceeding of MCS (2010), pp. 134–144
Google Scholar
Z.-H. Zhou, Ensemble Methods: Foundations and Algorithms (CRC Press, Boca Raton, 2012)
Google Scholar
Q. Zhu, A. Stolcke, B.Y. Chen, N. Morgan, Using MLP features in SRI’s conversational speech recognition system, in Proceedings of Interspeech (2005), pp. 921–924
Google Scholar
G. Zweig and M. Padmanabhan, Boosting Gaussian mixtures in an LVSCR system, Proc. ICASSP, pp. I-1527–I-1530 (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Missouri, Columbia, MO, 65211, USA
Yunxin Zhao
Microsoft Corporation, Bellevue, WA, 98004, USA
Jian Xue
Pearson Knowledge Technology Group, Menlo Park, CA, 94025, USA
Xin Chen

Authors

Yunxin Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Jian Xue
View author publications
You can also search for this author in PubMed Google Scholar
Xin Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yunxin Zhao .

Editor information

Editors and Affiliations

Dept. of Electrical Engineering, Santa Clara University, Santa Clara, California, USA
Tokunbo Ogunfunmi
School of EE&C Engineering, The University of Western Australia, Crawley, West Australia, Australia
Roberto Togneri
Qualcomm Inc., Santa Clara, California, USA
Madihally (Sim) Narasimha

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Zhao, Y., Xue, J., Chen, X. (2015). Ensemble Learning Approaches in Speech Recognition. In: Ogunfunmi, T., Togneri, R., Narasimha, M. (eds) Speech and Audio Processing for Coding, Enhancement and Recognition. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-1456-2_5

Download citation

DOI: https://doi.org/10.1007/978-1-4939-1456-2_5
Published: 18 September 2014
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4939-1455-5
Online ISBN: 978-1-4939-1456-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics