Abstract
An overview is made on the ensemble learning efforts that have emerged in automatic speech recognition in recent years. The approaches that are based on different machine learning techniques and target various levels and components of speech recognition are described, and their effectiveness is discussed in terms of the direct performance measure of word error rate and the indirect measures of classification margin, diversity, as well as bias and variance. In addition, methods on reducing storage and computation costs of ensemble models for practical deployments of speech recognition systems are discussed. Ensemble learning for speech recognition has been largely fruitful, and it is expected to continue progress along with the advances in machine learning, speech and language modeling, as well as computing technology.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
K. Audhkhasi, A.M. Zavou, P.G. Georgiou, S.S. Narayanan, Theoretical analysis of diversity in an ensemble of automatic speech recognition systems. IEEE Trans. ASLP 22(3), 711–726 (2014)
P. Biuhlmann, Bagging, subagging and bragging for improving some prediction algorithms, in Recent Advances and Trends in Nonparametric Statistics, ed. by E.G. Akritas, D.N. Politis (Elsevier, Amsterdam, 2003)
J.K. Bradley, R.E. Schapire, FileterBoost: regression and classification on large datasets, in Advances in Neural Information Processing Systems, ed. by J.C. Platt et al., vol. 20 (MIT Press, Cambridge, 2008)
L. Breiman, Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)
L. Breiman, Random forests. Mach. Learn. 45, 5–32 (2001)
C. Bresline, M.J.F. Gales, Generating complimentary systems for large vocabulary continuous speech recognition, in Proceeding of Interspeech (2006)
C. Bresline, M.J.F. Gales, Building multiple complementary systems using directed decision tree, in Proceeding of Interspeech (2007), pp. 1441–1444
G. Brown, An information theoretic perspective on multiple classifier system, in Proceedings of MCS (2009), pp. 344–353
S.F. Chen, J.T. Goodman, An empirical study of smoothing techniques for language modeling, in Proceedings of ACI (1996)
X. Chen, Y. Zhao, Data sampling based ensemble acoustic modeling, in Proceedings of ICASSP (2009), pp. 3805–3808
X. Chen, Y. Zhao, Integrating MLP feature and discriminative training in data sampling based ensemble acoustic modeling, in Proceeding of Interspeech (2010), pp. 1349–1352
X. Chen, Y. Zhao, Building acoustic model ensembles by data sampling with enhanced trainings and features. IEEE Trans. ASLP 21(3), 498–507 (2013)
G. Cook, A. Robinson, Boosting the performance of connectionist large vocabulary speech recognition, in Proceeding of ICSLP (1996), pp. 1305–1308
X. Cui, J. Xue, B. Xiang, B. Zhou, A study of bootstrapping with multiple acoustic features for improved automatic speech recognition, in Proceeding of Interspeech (2009), pp. 240–243
X. Cui, J. Huang, J.-T. Chien, Multi-view and multi-objective semi-supervised learning for HMM-based automatic speech recognition. IEEE Trans. ASLP 20(7), 1923–1935 (2012)
X. Cui, J. Xue, X. Chen, P. Olsen, P.L. Dognin, V.C. Upendra, J.R. Hershey, B. Zhou, Hidden Markov acoustic modeling with bootstrap and restructuring for low-resourced languages. IEEE Trans. ASLP 20(8), 2252–2264 (2012)
G.E. Dahl, D. Yu, L. Deng, A. Acero, Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. ASLP 20(1), 30–42 (2012)
L. Deng, D. Sun, A statistical approach to automatic speech recognition using the atomic speech units constructed from overlapping articulatory feature. J. Acoust. Soc. Am 95(5), 2702–2719 (1994)
L. Deng, D. Yu, Deep convex network: a scalable architecture for speech pattern classification, in Proceeding of Interspeech (2011)
L. Deng, D. Yu, J. Platt, Scalable stacking and learning for building deep architectures, in Proceeding of ICASSP (2012a)
L. Deng, G. Tur, X. He, D. Hakkani-Tur, Use of Kernel deep convex networks and end-to-end learning for spoken language understanding, in IEEE workshop on spoken language technologies (2012b)
L. Deng, J. Li, J.-T. Huang, K. Yao, D. Yu, F. Seide, M. Seltzer, G. Zweig, X. He, J. Williams, Y. Gong, A. Acero, Recent advances in deep learning for speech research at Microsoft, in Proceeding of ICASSP (2013a)
L. Deng, X. He, J. Gao, Deep stacking networks for information retrieval, in Proceeding of ICASSP (2013b)
T.G. Dietterich, An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting and randomization. Mach. Learn. 1(22), 139–157 (1998)
T.G. Dietterich, Ensemble methods in machine learning, in Proceeding of MCS (2000), pp. 1–15
C. Dimitrakakis, S. Bengio, Boosting HMMs with an application to speech recognition, in Proceeding of ICASSP (2004), pp. V-621–624
C. Dimitrakakis, S. Bengio, Phoneme and sentence-level ensembles for speech recognition. Eurasip J. ASMP (2011). doi:10.1155/2011/426792
J. Du, Y. Hu, H. Jiang, Boosted mixture learning of Gaussian mixture HMMs for speech recognition, in Proceeding of Interspeech (2010), pp. 2942–2945
S. Dupont, H. Bourlard, Using multiple time scales in a multi-stream speech recognition system, in Proceeding of Eurospeech (1997), pp. 3–6
G. Evermann, P.C. Woodland, Posterior probability decoding, confidence estimation and system combination, in Proceeding of speech transcription workshop (2000)
J.G. Fiscus, A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER), in Proceeding of IEEE ASRU Workshop (1997), pp. 347–352
A. Fred and A. K. Jain, Combining multiple clusterings using evidence accumulation, IEEE Trans. PAMI, 27(6), 835–850 (2005)
J. Friedman, P. Hall, On bagging and nonlinear estimation. J. Stat. Plan. Inference 137(3), 669–683 (2007)
J. Friedman, T. Hastie, R. Tibshirani, Additive logistic regression: a statistical view of boosting. Ann. Stat. 28(2), 337–407 (2000)
Y. Freund, R.E. Schapire, Experiments with a new boosting algorithm, in Proceeding of ICML (1996), pp. 1–9
Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997)
M. Gales, D. Y. Kim, P. C. Woodland, H. Y. Chan, D. Mrva, R. Sinha, and S. E. Tranter, Progress in the CU-HTK broadcast news transcription system, IEEE Trans. ASLP, 14(5), 1513–1525, (2006)
A.K. Halberstadt, J.R. Glass, Heterogeneous measurements and multiple classifiers for speech recognition, in Proceeding of ICSLP (1998), pp. 995–998
T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer, New York, 2001)
D. Hillard, B. Hoffmeister, M. Ostendorf, R. Schluter, H. Ney, iROVER: improving system combination with classification, in Proceeding of HLT (2007)
T.K. Ho, The random subspace method for constructing decision forests. IEEE Trans. PAMI 20(8), 832–844 (1998)
HTK Toolkit, U.K. http://htk.eng.cam.ac
R. Hu, X. Li, Y. Zhao, Acoustic model training using greedy EM, in Proceeding of ICASSP (2005), pp. I697–700
B. Hutchinson, L. Deng, D. Yu, Tensor deep stacking networks. IEEE Trans. PAMI, 35(8) (2013), 1944–1957
D. Jurafsky, J.H. Martin, Speech and Language Processing, 2nd ed., (Pearson-Prentice Hall, Englewood Cliffs, 2008)
B. Kingsbury, N. Morgan, Recognizing reverberant speech with Rasta-PLP, in Proceeding of ICASSP (1997), pp. 1259–1262
K. Kirchhoff, G.A. Fink, G. Sagerer, Combining acoustic and articulatory feature information for robust speech recognition. Speech Commun. 37, 303–319 (2002)
A. Krogh, J. Vedelsby, Neural network ensembles, cross validation, and active learning, in Advances in Neural Information Processing Systems, ed. by G. Tesauro, D.S. Touretzky, T.K. Leen (MIT Press, Cambridge, 1995), pp. 231–238
L.I. Kuncheva, C.J. Whitaker, Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Mach. Learn. 51(2), 181–207 (2003)
L.I. Kuncheva, Combining pattern classifiers – methods and algorithms (Wiley, Hoboken, NJ, 2004)
A. Lazarevic, Z. Obradovic, Effective pruning of neural network classifier ensembles, in Proceeding of ICNN (2001), pp. 796–801
K. Livescu, E. Fosler-Lussier, F. Metze, Subword modeling for automatic speech recognition. IEEE SPM 29(6), 44–57 (2012)
C. Ma, H.-K.J. Kuo, H. Soltan, X. Cui, U. Chaudhari, L. Mangu, C.-H. Lee, in Proceeding of ICASSP (2010), pp. 4394–4397
D.D. Margineantu, T.G. Dietterich, Pruning adaptive boosting, in Proceeding of ICML (1997), pp. 211–218
G. Martinez-Munoz, A. Suarez, Aggregation ordering in bagging, in Proceeding of ICAIA (2004), pp. 258–263
P. McMahon, P. McCourt, S. Vaseghi, Discriminative weighting of multi-resolution sub-band cepstral features for speech recognition, in Proceeding of ICSLP (1998), pp. 1055–1058
C. Meyer, H. Schramm, Boosting HMM acoustic models in large vocabulary speech recognition. Speech Commun. 48, 532–548 (2006)
T. Mikolov, A. Deoras, S. Kombrink, L. Burget, J. Cernocky, Empirical evaluation and combination of advanced language modeling techniques, in Proceeding of Interspeech (2011)
D. Povey, B. Kingsbury, L. Mangu, G. Saon, H. Soltau, G. Zweig, FMPE: discriminatively trained features for speech recognition, in Proceeding of ICASSP (2005), pp. I-961–964
Y. Qian, J. Liu, Cross-lingual and ensemble MLPs strategies for low-resource speech recognition, in Proceeding of Interspeech (2012)
L. Rabiner, F. Juang, Fundamentals of Speech Recognition (Prentice Hall, Englewood Cliffs, 1993)
T. Robinson, M. Hochberg, S. Renals, The use of recurrent neural networks in continuous speech recognition, in Automatic Speech and Speaker Recognition – Advanced Topics, ed. by C.H. Lee, K.K. Paliwal, F.K. Soong (Kluwer Academic Publishers, Boston, 1995). Chapter 19
J.J. Rodriguz, L.I. Kuncheva, C.J. Alonso, Rotation forest: a new classifier ensemble method. IEEE Trans. PAMI 28(10), 1619–1630 (2006)
G. Saon, H. Soltau, Boosting systems for large vocabulary continuous speech recognition. Speech Commun. 54(2), 212–218 (2012)
R.E. Schapire, The strength of weak learnability. Mach. Learn. 5(2), 197–227 (1990)
H. Schwenk, Using boosting to improve a hybrid HMM/neural network speech recognition, in Proceeding of ICASSP, pp. 1009–1012 (1999)
T. Shinozaki, S. Furui, Spontaneous speech recognition using a massively parallel decoder, in Proceeding of ICSLP (2004), pp. 1705–1708
O. Siohan, B. Ramabhadran, B. Kingsbury, Constructing ensembles of ASR systems using randomized decision trees, in Proceeding of ICASSP (2005), pp. I-197–I-200
A. Strehl, J. Ghosh, Cluster ensembles – a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3, 583–617 (2002)
E.K. Tang, P.N. Suganthan, X. Yao, An analysis of diversity measures. Mach. Learn. 65(1), 247–271 (2006)
H. Tang, M. Hasegawa-Johnson, T. Huang, Toward robust learning of the Gaussian mixture state emission densities for hidden Markov models, in Proceeding of ICASSP (2010), pp. 2274–2277
K. Tumer, J. Ghosh, Analysis of decision boundaries in linearly combined neural classifiers. Pattern Recognit. 29(2), 341–348 (1996)
G. Tur, L. Deng, D. Hakkani-Tur, X. He, Towards deeper understanding deep convex networks for semantic utterance classification, in Proceeding of ICASSP (2012)
N. Ueda, R. Nakano, Generalization error of ensemble estimators, in Proceeding of ICNN (1996), pp. 90–95
M.D. Wachter, M. Matton, K. Demuynck, P. Wambacq, P. Cools, D. Van Compernolle, Template-based continuous speech recognition. IEEE Trans. ASLP 15(4), 1377–1390 (2007)
D.H. Wolpert, Stacked generalization. Neural Netw. 5(2), 241–259 (1992)
S. Wu, B. Kingsbury, N. Mongan, S. Greenberg, Incorporating information from syllable-length time scales into automatic speech recognition, in Proceeding of ICASSP (1998), pp. 721–724
P. Xu, F. Jelinek, Random forest and the data sparseness problem in language modeling. Comput. Speech Lang. 21, 105–152 (2007)
J. Xue, Y. Zhao, Random forests of phonetic decision trees for acoustic modeling in conversational speech recognition. IEEE Trans. ASLP 16(3), 519–528 (2008)
R. Zhang, A. Rudnicky, Applying N-best list re-ranking to acoustic model combinations of boosting training, in Proceeding of Interspeech (2004a)
R. Zhang, A. Rudnicky, A frame level boosting training scheme for acoustic modeling, in Proceeding of Interspeech (2004b)
Y. Zhao, X. Zhang, R.-S. Hu, J. Xue, X. Li, L. Che, R. Hu, L. Schopp, An automatic captioning system for telemedicine, in Proceeding of ICASSP (2006), pp. I-957–I-960
Z.-H. Zhou, N. Li, Multi-information ensemble diversity, in Proceeding of MCS (2010), pp. 134–144
Z.-H. Zhou, Ensemble Methods: Foundations and Algorithms (CRC Press, Boca Raton, 2012)
Q. Zhu, A. Stolcke, B.Y. Chen, N. Morgan, Using MLP features in SRI’s conversational speech recognition system, in Proceedings of Interspeech (2005), pp. 921–924
G. Zweig and M. Padmanabhan, Boosting Gaussian mixtures in an LVSCR system, Proc. ICASSP, pp. I-1527–I-1530 (2000)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer Science+Business Media New York
About this chapter
Cite this chapter
Zhao, Y., Xue, J., Chen, X. (2015). Ensemble Learning Approaches in Speech Recognition. In: Ogunfunmi, T., Togneri, R., Narasimha, M. (eds) Speech and Audio Processing for Coding, Enhancement and Recognition. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-1456-2_5
Download citation
DOI: https://doi.org/10.1007/978-1-4939-1456-2_5
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4939-1455-5
Online ISBN: 978-1-4939-1456-2
eBook Packages: EngineeringEngineering (R0)