Multi-view Representation Learning via Canonical Correlation Analysis for Dysarthric Speech Recognition

  • Myungjong Kim
  • Beiming Cao
  • Jun WangEmail author
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 856)


Although automatic speech recognition (ASR) has been commercially used for general public, it still does not perform sufficiently well for people with speech disorders (e.g., dysarthria). Multimodal ASR, which involves multiple sources of signals, has recently shown potential to improve the performance of dysarthric speech recognition. When multiple views (sources) of data (e.g., acoustic and articulatory) are available for training while only one view (e.g., acoustic) is available for testing, a better representation can be learned by simultaneously analyzing multiple sources of data. Although multi-view representation learning has recently used in normal speech recognition, it has rarely been studied in dysarthric speech recognition. In this paper, we investigate the effectiveness of multi-view representation learning via canonical correlation analysis (CCA) for dysarthric speech recognition. A representation of acoustic data is learned using CCA from the multi-view data (acoustic and articulatory). The articulatory data was simultaneously recorded with acoustic data using electromagnetic articulograph. Experimental evaluation on a database collected from nine patients with dysarthria due to Lou Gehrig’s disease demonstrated the effectiveness of the multi-view representation learning via CCA on deep neural network-based speech recognition systems.


Dysarthria Canonical correlation analysis Speech recognition Deep neuron network Multi-view representation learning 



This work was supported by the National Institutes of Health of the United States through grants R03DC013990 and R01DC013547 and by the American Speech-Language-Hearing Foundation through a New Century Scholar Research Grant. We would like to thank Dr. Jordan R. Green, Dr. Thomas F. Campbell, and Dr. Yana Yunusova, Jennifer McGlothlin, Kristin Teplansky, and the volunteering participants.


  1. 1.
    Duffy, J.R.: Motor Speech Disorders-E-Book: Substrates, Differential Diagnosis, and Management. Elsevier, New York City (2013)Google Scholar
  2. 2.
    Kim, M., Yoo, J., Kim, H.: Dysarthric speech recognition using dysarthria-severity-dependent and speaker-adaptive models. In: Proceeding of Interspeech, pp. 3622–3626 (2013)Google Scholar
  3. 3.
    Rudzicz, F.: Articulatory knowledge in the recognition of dysarthric speech. IEEE Trans. Audio Speech Lang. Process. 19(4), 947–960 (2011)CrossRefGoogle Scholar
  4. 4.
    Kim, M., Wang, J., Kim, H.: Dysarthric speech recognition using Kullback–Leibler divergence-based hidden markov model. In: Proceeding of Interspeech, pp. 2671–2675 (2016)Google Scholar
  5. 5.
    Yilmaz, E., Ganzeboom, M.S., Cucchiarini, C., Strik, H.: Multi-stage DNN training for automatic recognition of dysarthric speech. In: Proceeding of Interspeech, pp. 2685–2689 (2017)Google Scholar
  6. 6.
    Nakashika, T., Yoshioka, T., Takiguchi, T., Ariki, Y., Duffner, S., Garcia, C.: Dysarthric speech recognition using a convolutive bottleneck network. In: Proceeding of 12th IEEE International Conference on Signal Processing (ICSP), pp. 505–509 (2014)Google Scholar
  7. 7.
    Espana-Bonet, C., Fonollosa, J.A.: Automatic speech recognition with deep neural networks for impaired speech. In: Proceeding of Third International Conference on Advances in Speech and Language Technologies for Iberian Languages, pp. 97–107. Springer, Heidelberg (2016)CrossRefGoogle Scholar
  8. 8.
    Kirchhoff, K., Fink, G.A., Sagerer, G.: Combining acoustic and articulatory feature information for robust speech recognition. Speech Commun. 37(3), 303–319 (2002)CrossRefGoogle Scholar
  9. 9.
    Cao, B., Kim, M., Mau, T., Wang, J.: Recognizing whispered speech produced by an individual with surgically reconstructed larynx using articulatory movement data. In: Proceeding of ACL/ISCA Workshop Speech Language Processing Assistive Technology, pp. 80–86 (2016)Google Scholar
  10. 10.
    Hahm, S., Heitzman, D., Wang, J.: Recognizing dysarthric speech due to amyotrophic lateral sclerosis with across-speaker articulatory normalization. In: Proceeding of the ACL/ISCA Workshop on Speech and Language Processing for Assistive Technologies, pp. 47–54 (2015)Google Scholar
  11. 11.
    Wang, J., Samal, A., Green, J.: Preliminary test of a real-time, interactive silent speech interface based on electromagnetic articulograph. In: Proceeding of ACL/ISCA Workshop on Speech and Language Processing for Assistive Technologies, Baltimore, USA, pp. 38–45 (2014)Google Scholar
  12. 12.
    Wang, W., Arora, R., Livescu, K., Bilmes, J.A.: On deep multiview representation learning. In: ICML, pp. 1083–1092 (2015)Google Scholar
  13. 13.
    Wang, W., Arora, R., Livescu, K., Bilmes, J.A.: Unsupervised learning of acoustic features via deep canonical correlation analysis. In: Proceeding of ICASSP, pp. 4590–4594 (2015)Google Scholar
  14. 14.
    Borga, M.: Canonical correlation: a tutorial. Online tutorial (2001)Google Scholar
  15. 15.
    Green, J.R., Yunusova, Y., Kuruvilla, M.S., Wang, J., Pattee, G.L., Synhorst, L., Zinman, L., Berry, J.D.: Bulbar and speech motor assessment in ALS: challenges and future directions. Amyotroph. Lateral Scler. Frontotemporal Degener. 14(7–8), 494–500 (2013)CrossRefGoogle Scholar
  16. 16.
    Kim, M., Cao, B., Mau, T., Wang, J.: Speaker-independent silent speech recognition from flesh-point articulatory movements using an LSTM neural network. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 12, 2323–2336 (2017)CrossRefGoogle Scholar
  17. 17.
    Wang, J., Samal, A., Rong, P., Green, J.R.: An optimal set of flesh points on tongue and lips for speech-movement classification. J. Speech Lang. Hear. Res. 59, 15–26 (2016)CrossRefGoogle Scholar
  18. 18.
    Berry Jeffery, J.: Accuracy of the NDI wave speech research system. J. Speech Lang. Hear. Res. 54(5), 295–301 (2011)Google Scholar
  19. 19.
    Wang, J., Green, J.R., Samal, A., Yunusova, Y.: Articulatory distinctiveness of vowels and consonants: a data-driven approach. J. Speech Lang. Hear. Res. 56(5), 1539–1551 (2013)CrossRefGoogle Scholar
  20. 20.
    Kim, M., Cao, B, Mau, T, Wang, J.: Multiview representation learning via deep CCA for silent speech recognition. In: Proceeding of Interspeech, pp. 2769–2773 (2017)Google Scholar
  21. 21.
    Deng, L., Yu, D.: Deep learning: methods and applications. Found. Trends Signal Process. 3–4, 197–387 (2014)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Rabiner, L., Juang, B.H.: An introduction to hidden Markov models. IEEE ASSP Mag. 3(1), 4–16 (1986)CrossRefGoogle Scholar
  23. 23.
    Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)CrossRefGoogle Scholar
  24. 24.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  25. 25.
    Sak, H., Senior, A., Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: Fifteenth Annual Conference of the International Speech Communication Association, pp. 338–342 (2014)Google Scholar
  26. 26.
    Federico, M., Bertoldi, N., Cettolo, M.: IRSTLM: an open source toolkit for handling large scale language models. In: Proceeding of Interspeech, pp. 1618–1621 (2008)Google Scholar
  27. 27.
    Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J.: The Kaldi speech recognition toolkit. In: Proceeding of IEEE Workshop on Automatic Speech Recognition and Understanding, Waikoloa, USA, pp. 1–4 (2011)Google Scholar
  28. 28.
    Christensen, H., Aniol, M.B., Bell, P., Green, P.D., Hain, T., King, S., Swietojanski, P.: Combining in-domain and out-of-domain speech data for automatic recognition of disordered speech. In: Proceeding of Interspeech, pp. 3642–3645 (2013)Google Scholar
  29. 29.
    Kim, M., Kim, Y., Yoo, J., Wang, J., Kim, H.: Regularized speaker adaptation of KL-HMM for dysarthric speech recognition. IEEE Trans. Neural Syst. Rehabil. Eng. 25(9), 1581–1591 (2017)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Speech Disorders and Technology Lab, Department of BioengineeringUniversity of Texas at DallasRichardsonUSA
  2. 2.Callier Center for Communication DisordersUniversity of Texas at DallasRichardsonUSA

Personalised recommendations