Search in Voice Control Systems

  • Andrey V. Savchenko
Part of the SpringerBriefs in Optimization book series (BRIEFSOPTI)


In this chapter the methodology of the segment homogeneity testing is applied in the voice control system with the availability of a small amount of the user speech data. The error rate of automatic speech recognition is decreased by requiring the speaker to put the stress on all vowels in a command. Sequential three-way decisions are applied to speed-up the classification procedure. In the rest part of this chapter this approach is used in the audiovisual voice commands recognition. The experimental results in Russian language prove that our approach is characterized by better accuracy and much lower search time in comparison with the known speech recognition methods.


Automatic Speech Recognition Acoustic Model Voice Command Maximum Likelihood Linear Regression Speaker Adaptation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. [1]
    Asadpour, V., Homayounpour, M.M., Towhidkhah, F.: Audio–visual speaker identification using dynamic facial movements and utterance phonetic content. Appl. Soft Comput. 11(2), 2083–2093 (2011)CrossRefGoogle Scholar
  2. [2]
    Basapur, S., Xu, S., Ahlenius, M., Lee, Y.S.: User expectations from dictation on mobile devices. In: Jacko, J.A. (ed.) Proceedings of the International Conference on Human-Computer Interaction. Lecture Notes in Computer Science, vol. 4551, pp. 217–225. Springer (2007)Google Scholar
  3. [3]
    Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: Speeded-up robust features (SURF). Comput. Vis. Image Underst. 110(3), 346–359 (2008)CrossRefGoogle Scholar
  4. [4]
    Bellegarda, J.R.: Spoken language understanding for natural interaction: the siri experience. In: Mariani, J., Rosset, S., Garnier-Rizet, M., Devillers, L. (eds.) Natural Interaction with Robots, Knowbots and Smartphones, pp. 3–14. Springer Science+Business Media New York (2014)Google Scholar
  5. [5]
    Benesty, J., Sondhi, M.M., Huang, Y.: Springer Handbook of Speech Processing. Springer, Berlin (2008)CrossRefGoogle Scholar
  6. [6]
    Benzeghiba, M., De Mori, R., Deroo, O., Dupont, S., Erbes, T., Jouvet, D., Fissore, L., Laface, P., Mertins, A., Ris, C., Rose, R., Tyagi, V., Wellekens, C.: Automatic speech recognition and speech variability: a review. Speech Commun. 49(10–11), 763–786 (2007)CrossRefGoogle Scholar
  7. [7]
    Chen, I.F., Chen, N.F., Lee, C.H.: A keyword-boosted sMBR criterion to enhance keyword search performance in deep neural network based acoustic modeling. In: Li, H., Meng, H.M., Ma, B., Chng, E., Xie, L. (eds.) Proceedings of Interspeech, pp. 2779–2783. ISCA (2014)Google Scholar
  8. [8]
    Cox, S.J., Harvey, R., Lan, Y., Newman, J.L., Theobald, B.J.: The challenge of multispeaker lip-reading. In: Auditory-Visual Speech Processing (AVSP), pp. 179–184 (2008)Google Scholar
  9. [9]
    Das, S.K., Picheny, M.A.: Issues in practical large vocabulary isolated word recognition: the IBM Tangora system. In: Lee, C.H., Soong, F.K., Paliwal, K.K. (eds.) Automatic Speech and Speaker Recognition, vol. 355, pp. 457–479. Springer US, Kluwer Academic Publishers (1996)CrossRefGoogle Scholar
  10. [10]
    Dupont, S., Luettin, J.: Audio-visual speech modeling for continuous speech recognition. IEEE Trans. Multimedia 2(3), 141–151 (2000)CrossRefGoogle Scholar
  11. [11]
    Gray, R., Buzo, A., Gray A., J., Matsuyama, Y.: Distortion measures for speech processing. IEEE Trans. Acoust. Speech Signal Process. 28(4), 367–376 (1980)Google Scholar
  12. [12]
    Khalid, M., Kishan, K., Kishen, K., Gounder, U., Chand, P., Metha, U., Mamun, K.: Design and development of low cost voice control smart home device in the South Pacific. In: Proceedings of the Asia-Pacific World Congress on Computer Science and Engineering, pp. 1–6 (2014)Google Scholar
  13. [13]
    LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)CrossRefGoogle Scholar
  14. [14]
    Lee, Y.M., Lee, L.S., Tseng, C.Y.: Isolated Mandarin syllable recognition with limited training data specially considering the effect of tones. IEEE Trans. Speech Audio Process. 5(1), 75–80 (1997)CrossRefGoogle Scholar
  15. [15]
    Mak, M.W., Yu, H.B.: A study of voice activity detection techniques for NIST speaker recognition evaluations. Comput. Speech Lang. 28(1), 295–313 (2014)CrossRefGoogle Scholar
  16. [16]
    Merialdo, B.: Multilevel decoding for very-large-size-dictionary speech recognition. IBM J. Res. Dev. 32(2), 227–237 (1988)CrossRefGoogle Scholar
  17. [17]
    Neri, A., Cucchiarini, C., Strik, W.: Automatic speech recognition for second language learning: how and why it actually works. In: International Congresses of phonetic sciences, pp. 1157–1160 (2003)Google Scholar
  18. [18]
    Paraschiv, I.C., Dascalu, M., Trausan-Matu, S.: Voice control framework for form based applications. In: Agre, G., Hitzler, P., Krisnadhi, A.A., Kuznetsov, S.O. (eds.) Proceedings of the International Conference on Artificial Intelligence: Methodology, Systems, and Applications. Lecture Notes in Computer Science, vol. 8722, pp. 222–227. Springer International Publishing Switzerland (2014)Google Scholar
  19. [19]
    Potamianos, G., Neti, C., Luettin, J., Matthews, I.: Audio-visual automatic speech recognition: An overview. Visual Audio-Visual Speech Process. (Issues) 22, 23 (2004)Google Scholar
  20. [20]
    Reddy, D.: Speech recognition by machine: a review. Proc. IEEE 64(4), 501–531 (1976)CrossRefGoogle Scholar
  21. [21]
    Savchenko, A.V.: Phonetic encoding method in the isolated words recognition problem. J. Commun. Technol. Electron. 59(4), 310–315 (2014)MathSciNetCrossRefGoogle Scholar
  22. [22]
    Savchenko, A.V.: Semi-automated speaker adaptation: how to control the quality of adaptation? In: Elmoataz, A., Lezoray, O., Nouboud, F., Mammass, D. (eds.) Proceedings of the International Conference on Image and Signal Processing (ICISP). Lecture Notes in Computer Science, vol. 8509, pp. 638–646. Springer International Publishing Switzerland (2014)Google Scholar
  23. [23]
    Savchenko, A.V., Belova, N.S.: Statistical testing of segment homogeneity in classification of piecewise-regular objects. Int. J. Appl. Math. Comput. Sci. 25(4), 915–925 (2015)CrossRefGoogle Scholar
  24. [24]
    Savchenko, A.V., Khokhlova, Y.I.: About neural-network algorithms application in viseme classification problem with face video in audiovisual speech recognition systems. Opt. Mem. Neural Netw. (Inf. Opt.) 23(1), 34–42 (2014)Google Scholar
  25. [25]
    Savchenko, A.V., Savchenko, L.V.: Towards the creation of reliable voice control system based on a fuzzy approach. Pattern Recogn. Lett. 65c, 145–151 (2015)Google Scholar
  26. [26]
    Schalkwyk, J., Beeferman, D., Beaufays, F., Byrne, B., Chelba, C., Cohen, M., Kamvar, M., Strope, B.: Your word is my command”: Google search by voice: a case study. In: Neustein, A. (ed.) Advances in Speech Recognition, pp. 61–90. Springer, New York (2010)CrossRefGoogle Scholar
  27. [27]
    Sirigos, J., Fakotakis, N., Kokkinakis, G.: A hybrid syllable recognition system based on vowel spotting. Speech Commun. 38(3–4), 427–440 (2002)CrossRefzbMATHGoogle Scholar
  28. [28]
    Szöke, I., Schwarz, P., Matejka, P., Burget, L., Karafiát, M., Fapso, M., Cernocký, J.: Comparison of keyword spotting approaches for informal continuous speech. In: 9th European Conference on Speech Communication and Technology (Eurospeech), pp. 633–636 (2005)Google Scholar

Copyright information

© The Author(s) 2016

Authors and Affiliations

  • Andrey V. Savchenko
    • 1
  1. 1.Laboratory of Algorithms and Technologies for Network AnalysisNational Research University Higher School of EconomicsNizhny NovgorodRussia

Personalised recommendations