Abstract
Speech is the most natural means of communication in human-to-human interactions. Automatic Speech Recognition (ASR) is the application of technology in developing machines that can autonomously transcribe a speech into a text in the real-time. This paper presents a short review of ASR systems. Fundamentally, the design of speech recognition system involves three major processes such as feature extraction, acoustic modeling and classification. Consequently, emphasis is laid on describing essential principles of the various techniques employed in each of these processes. On the other hand, it also presents the milestones in the speech processing research to date.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Christian Gottlieb Kratzenstein, Sur la naissance de la formation des voyelles, J. Phys., Vol 21, pp. 358–380, 1782.
Dudley, Homer, and Thomas H. Tarnoczy. “The speaking machine of Wolfgang von Kempelen.” The Journal of the Acoustical Society of America 22.2 (1950): 151–166.
Bowers, Brian, ed. Sir Charles Wheatstone FRS: 1802–1875. No. 29. IET, 2001.
Lindsay, David. “Talking Head: In the mid-1800s, Joseph Faber spent seventeen years working on his speech synthesizer.” American Heritage of Invention and Technology 13 (1997): 56–63.
Cater, John C. “Electronically Speaking: Computer Speech Generation.” Sams, 1983.
Fletcher, Harvey. “The Nature of Speech and Its Interpretation1.” Bell System Technical Journal 1.1 (1922): 129–144.
Dudley, Homer, R. R. Riesz, and S. S. A. Watkins. “A synthetic speaker.” Journal of the Franklin Institute 227.6 (1939): 739–764.
Davis, K. H., R. Biddulph, and S. Balashek. “Automatic recognition of spoken digits.” The Journal of the Acoustical Society of America 24.6 (1952): 637–642.
Dersch, W.C. SHOEBOX- a voice responsive machine, DATAMATION, 8:47–50, June 1962.
Lowerre, Bruce T. “The HARPY speech recognition system.” (1976).
Baker, James. “The DRAGON system–An overview.” Acoustics, Speech and Signal Processing, IEEE Transactions on 23.1 (1975): 24–29.
Jelinek, Frederick. “Continuous speech recognition by statistical methods.” Proceedings of the IEEE 64. 1976.
Kurzweil, Raymond. “The Kurzweil reading machine: A technical overview.” Science, Technology and the Handicapped (1976): 3–11.
Averbuch, Ar, et al. “Experiments with the TANGORA 20,000 word speech recognizer.” Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP’87. Vol. 12. IEEE, 1987.
Rekha, J. Ujwala, Shahu, K. Chatrapati, Babu, A. Vinaya. “Feature selection for phoneme recognition using a cooperative game theory based framework”. Multimedia, Communication and Computing Application: Proceedings of the 2014 International Conference on Multimedia, Communication and Computing Application (MCCA 2014), 191–195, CRC Press, 2015.
Rekha, J. Ujwala, Shahu, K. Chatrapati, Babu, A. Vinaya. “Feature selection using game theory for phoneme based speech recognition.” Contemporary Computing and Informatics (IC3I), 2014 International Conference on. IEEE, 2014.
Toh, Aik Ming, Roberto Togneri, and Sven Nordholm. “Spectral entropy as speech features for speech recognition.” Proceedings of PEECS 1 (2005).
Gelbart, David, Nelson Morgan, and Alexey Tsymbal. “Hill-climbing feature selection for multi-stream ASR.” In: INTERSPEECH, pp. 2967–2970 (2009).
Paliwal, K. K. Dimensionality reduction of the enhanced feature set for the HMM-based speech recognizer, Digital Signal Processing 2(3) (1992): 157–173.
Fant, Gunnar. Acoustic Theory of Speech Production. No. 2. Walter de Gruyter, 1970.
Makhoul, John. “Linear prediction: A tutorial review.” Proceedings of the IEEE 63.4 (1975): 561–580.
Bogert, B. P., and G. E. Peterson. “The acoustics of speech.” Handbook of speech pathology (1957): 109–173.
Davis, Steven, and Paul Mermelstein. “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences.” Acoustics, Speech and Signal Processing, IEEE Transactions on 28.4 (1980): 357–366.
Mermelstein, Paul. “Distance measures for speech recognition, psychological and instrumental.” Pattern recognition and artificial intelligence 116 (1976): 374–388.
Zheng, Fang, Guoliang Zhang, and Zhanjiang Song. “Comparison of different implementations of MFCC.” Journal of Computer Science and Technology 16.6 (2001): 582–589.
Hermansky, Hynek. “Perceptual linear predictive (PLP) analysis of speech.” the Journal of the Acoustical Society of America 87.4 (1990): 1738–1752.
Stevens, Stanley S. “On the psychophysical law.” Psychological review 64.3 (1957): 153.
Hermansky, Hynek, and Nelson Morgan. “RASTA processing of speech.” Speech and Audio Processing, IEEE Transactions on 2.4 (1994): 578–589.
Hermansky, Hynek, et al. “RASTA-PLP speech analysis technique.” Acoustics, Speech, and Signal Processing, IEEE International Conference on. Vol. 1. IEEE, 1992.
Misra, H., Ikbal, S., Bourlard, H., & Hermansky, H. “Spectral entropy based feature for robust ASR.” Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP’04). IEEE International Conference on. Vol. 1. IEEE, 2004.
Sakoe, Hiroaki, et al. “Dynamic programming algorithm optimization for spoken word recognition.” Readings in speech recognition 159 (1990).
Baum, Leonard E., and Ted Petrie. “Statistical inference for probabilistic functions of finite state Markov chains.” The annals of mathematical statistics (1966): 1554–1563.
Debyeche, Mohamed, Jean Paul Haton, and Amrane Houacine. “Improved Vector Quantization Approach for Discrete HMM Speech Recognition System.” Int. Arab J. Inf. Technol. 4.4 (2007): 338–344.
Cheng, Chih-Chieh, Fei Sha, and Lawrence K. Saul. “Matrix updates for perceptron training of continuous density hidden markov models.” Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009.
Gauvain, Jean-Luc, and Chin-Hui Lee. “Bayesian learning for hidden Markov model with Gaussian mixture state observation densities.” Speech Communication 11.2 (1992): 205–213.
Rabiner, L. R., et al. “Recognition of isolated digits using hidden Markov models with continuous mixture densities.” AT&T Technical Journal 64.6 (1985): 1211–1234.
Razavi, Marzieh, and Ramya Rasipuram. On Modeling Context-dependent Clustered States: Comparing HMM/GMM, Hybrid HMM/ANN and KL-HMM Approaches. No. EPFL-REPORT-192598. Idiap, 2013.
Dupont, Stéphane, et al. “Context Independent and Context Dependent Hybrid HMM/ANN Systems for Training Independent Tasks.” Proceedings of the EUROSPEECH’97. 1997.
Woodland, Philip C., and Steve J. Young. “The HTK tied-state continuous speech recogniser.” Eurospeech. 1993.
Levinson, Stephen E. “Continuously variable duration hidden Markov models for automatic speech recognition.” Computer Speech & Language 1.1 (1986): 29–45.
Russell, Martin. “A segmental HMM for speech pattern modelling.” Acoustics, Speech, and Signal Processing, 1993. ICASSP-93., 1993 IEEE International Conference on. Vol. 2. IEEE, 1993.
Kenny, Patrick, Matthew Lennig, and Paul Mermelstein. “A linear predictive HMM for vector-valued observations with applications to speech recognition.” Acoustics, Speech and Signal Processing, IEEE Transactions on 38.2 (1990): 220–225.
Petr Schwarz, and Jan Cernocky. (2008) “Phoneme Recognition Based on Long Temporal Context.”, Ph.D. Thesis, Brno University of Technology, Czech Republic.
Evermann, Gunnar, et al. The HTK book. Vol. 2. Cambridge: Entropic Cambridge Research Laboratory, 1997.
Rabiner, Lawrence. “A tutorial on hidden Markov models and selected applications in speech recognition.” Proceedings of the IEEE 77.2 (1989): 257–286.
Ostendorf, M., and V. Digalakis. “The stochastic segment model for continuous speech recognition.” Signals, Systems and Computers, 1991. 1991 Conference Record of the Twenty-Fifth Asilomar Conference on. IEEE, 1991.
Ostendorf, Mari, and Salim Roukos. “A stochastic segment model for phoneme-based continuous speech recognition.” Acoustics, Speech and Signal Processing, IEEE Transactions on 37.12 (1989): 1857–1869.
Morris, Jeremy J. “A study on the use of conditional random fields for automatic speech recognition.” PhD diss., The Ohio State University, 2010.
Gunawardana, Asela, et al. “Hidden conditional random fields for phone classification.” INTERSPEECH. 2005.
Graves, Alex, Abdel-rahman Mohamed, and Geoffrey Hinton. “Speech recognition with deep recurrent neural networks.” Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013.
Pinto, Joel Praveen. Multilayer perceptron based hierarchical acoustic modeling for automatic speech recognition. Diss. Ecole polytechnique fédérale de Lausanne, 2010.
Fukuda, Yohji, and Haruya Matsumoto. “Speech Recognition Using Modular Organizations Based On Multiple Hopfield Neural Networks”, Speech Science and Technology (SST-92), 1992: 226–231.
Minghu, Jiang, et al. “Fast learning algorithms for time-delay neural networks phoneme recognition.” Signal Processing Proceedings, 1998. ICSP’98. 1998 Fourth International Conference on. IEEE, 1998.
Venkateswarlu, R. L. K., and R. Vasantha Kumari. “Novel approach for speech recognition by using self—Organized maps.” Emerging Trends in Networks and Computer Communications (ETNCC), 2011 International Conference on. IEEE, 2011.
Ganapathiraju, Aravind, Jonathan E. Hamaker, and Joseph Picone. “Applications of support vector machines to speech recognition.” Signal Processing, IEEE Transactions on 52.8 (2004): 2348–2355.
N.D. Smith and M. Niranjan. Data-dependent Kernels in SVM Classification of Speech Patterns. In Proceedings of the International Conference on Spoken Language Processing (ICSLP), volume 1, pages 297–300, Beijing, China, 2000.
A. Ganapathiraju, J. Hamaker, and J. Picone. Hybrid SVM/HMM Architectures for Speech Recognition. In Proceedings of the 2000 Speech Transcription Workshop, volume 4, pages 504–507, Maryland (USA), May 2000.
J. Padrell-Sendra, D. Martın-Iglesias, and F. Dıaz-de-Marıa. Support vector machines for continuous speech recognition. In Proceedings of the 14th European Signal Processing Conference, Florence, Italy, 2006.
Makhoul, John, Salim Roucos, and Herbert Gish. “Vector quantization in speech coding.” Proceedings of the IEEE 73.11 (1985): 1551–1588.
Furui, Sadaoki. “Vector-quantization-based speech recognition and speaker recognition techniques.” Signals, Systems and Computers, 1991. 1991 Conference Record of the Twenty-Fifth Asilomar Conference on. IEEE, 1991.
Zaharia, Tiberius, et al. “Quantized dynamic time warping (DTW) algorithm.” Communications (COMM), 2010 8th International Conference on. IEEE, 2010.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer India
About this paper
Cite this paper
Ujwala Rekha, J., Shahu Chatrapati, K., Vinaya Babu, A. (2016). A Study on Speech Processing. In: Satapathy, S., Mandal, J., Udgata, S., Bhateja, V. (eds) Information Systems Design and Intelligent Applications. Advances in Intelligent Systems and Computing, vol 435. Springer, New Delhi. https://doi.org/10.1007/978-81-322-2757-1_22
Download citation
DOI: https://doi.org/10.1007/978-81-322-2757-1_22
Published:
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-2756-4
Online ISBN: 978-81-322-2757-1
eBook Packages: EngineeringEngineering (R0)