Skip to main content

Articulatory Gesture Rich Representation Learning of Phonological Units in Low Resource Settings

  • Conference paper
  • First Online:
Book cover Statistical Language and Speech Processing (SLSP 2016)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9918))

Included in the following conference series:

Abstract

Recent literature presents evidence that both linguistic (phonemic) and non linguistic (speaker identity, emotional content) information resides at a lower dimensional manifold embedded richly inside the higher-dimensional spectral features like MFCC and PLP. Linguistic or phonetic units of speech can be broken down to a legal inventory of articulatory gestures shared across several phonemes based on their manner of articulation. We intend to discover a subspace rich in gestural information of speech and captures the invariance of similar gestures. In this paper, we investigate unsupervised techniques best suited for learning such a subspace. Main contribution of the paper is an approach to learn gesture-rich representation of speech automatically from data in completely unsupervised manner. This study compares the representations obtained through convolutional autoencoder (ConvAE) and standard unsupervised dimensionality reduction techniques such as manifold learning and Principal Component Analysis (PCA) through the task of phoneme classification. Manifold learning techniques such as Locally Linear Embedding (LLE), Isomap and Laplacian Eigenmaps are evaluated in this study. The representations which best separate different gestures are suitable for discovering subword units in case of low or zero resource speech conditions. Further, we evaluate the representation using Zero Resource Speech Challenge’s ABX discriminability measure. Results indicate that representation obtained through ConvAE and Isomap out-perform baseline MFCC features in the task of phoneme classification as well as ABX measure and induce separation between sounds composed of different set of gestures. We further cluster the representations using Dirichlet Process Gaussian Mixture Model (DPGMM) to automatically learn the cluster distribution of data and show that these clusters correspond to groups of similar manner of articulation. DPGMM distribution is used as apriori to obtain correspondence terms for robust ConvAE training.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 34.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 44.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Anguera, X., Dupoux, E., Jansen, A., Versteegh, M., Schatz, T., Thiollière, R., Ludusan, B.: The zero resource speech challenge

    Google Scholar 

  2. Badino, L., Mereta, A., Rosasco, L.: Discovering discrete subword units with binarized autoencoders and hidden-Markov-model encoders. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)

    Google Scholar 

  3. Belkin, M., Niyogi, P.: Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 15(6), 1373–1396 (2003)

    Article  MATH  Google Scholar 

  4. Blumstein, S.E., Stevens, K.N.: Acoustic invariance in speech production: evidence from measurements of the spectral characteristics of stop consonants. J. Acoust. Soc. Am. 66(4), 1001–1017 (1979)

    Article  Google Scholar 

  5. Browman, C.P., Goldstein, L.: Articulatory gestures as phonological units. Phonology 6(02), 201–251 (1989)

    Article  Google Scholar 

  6. Browman, C.P., Goldstein, L.: Articulatory phonology: an overview. Phonetica 49(3–4), 155–180 (1992)

    Article  Google Scholar 

  7. Browman, C.P., Goldstein, L.: Dynamics and articulatory phonology. In: Port, R.F., van Gelder, T. (eds.) Mind as Motion, pp. 175–193. MIT Press, Cambridge (1995)

    Google Scholar 

  8. Browman, C.P., Goldstein, L.M.: Towards an articulatory phonology. Phonology 3(01), 219–252 (1986)

    Google Scholar 

  9. Chollet, F.: Keras (2015). https://github.com/fchollet/keras

  10. Errity, A., McKenna, J.: An investigation of manifold learning for speech analysis. In: INTERSPEECH. Citeseer (2006)

    Google Scholar 

  11. Görür, D., Rasmussen, C.E.: Dirichlet process gaussian mixture models: choice of the base distribution. J. Comput. Sci. Technol. 25(4), 653–664 (2010)

    Article  MathSciNet  Google Scholar 

  12. Greenberg, S., Kingsbury, B.E.: The modulation spectrogram: in pursuit of an invariant representation of speech. In: 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 1997, vol. 3, pp. 1647–1650. IEEE (1997)

    Google Scholar 

  13. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)

    Google Scholar 

  14. Kamper, H., Elsner, M., Jansen, A., Goldwater, S.: Unsupervised neural network based feature extraction using weak top-down constraints. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5818–5822. IEEE (2015)

    Google Scholar 

  15. Kingma, D., Ba, J.: Adam: a method for stochastic optimization (2014). arXiv preprint arXiv:1412.6980

  16. Lee, C.-H., Siniscalchi, S.M.: An information-extraction approach to speech processing: analysis, detection, verification, and recognition. Proc. IEEE 101(5), 1089–1115 (2013)

    Article  Google Scholar 

  17. Leng, B., Guo, S., Zhang, X., Xiong, Z.: 3D object retrieval with stacked local convolutional autoencoder. Sig. Process. 112, 119–128 (2015)

    Article  Google Scholar 

  18. Makhzani, A., Frey, B.J.: Winner-take-all autoencoders. In: Advances in Neural Information Processing Systems, pp. 2773–2781 (2015)

    Google Scholar 

  19. Masci, J., Meier, U., Cireşan, D., Schmidhuber, J.: Stacked convolutional auto-encoders for hierarchical feature extraction. In: Honkela, T. (ed.) ICANN 2011, Part I. LNCS, vol. 6791, pp. 52–59. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  20. Ostendorf, M.: Moving beyond the ‘beads-on-a-string’ model of speech. In: Proceedings of the IEEE ASRU Workshop, pp. 79–84. Citeseer (1999)

    Google Scholar 

  21. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  22. Renshaw, D., Kamper, H., Jansen, A., Goldwater, S.: A comparison of neural network methods for unsupervised representation learning on the zero resource speech challenge. In: Proceedings of the Interspeech (2015)

    Google Scholar 

  23. Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2000)

    Article  Google Scholar 

  24. Tenenbaum, J.B., Langford, J.C., De Silva, V.: A global geometric framework for nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000)

    Article  Google Scholar 

  25. Tomar, V.S., Rose, R.C.: Application of a locality preserving discriminant analysis approach to ASR. In: 2012 11th International Conference on Information Science, Signal Processing and their Applications (ISSPA), pp. 103–107. IEEE (2012)

    Google Scholar 

  26. Tomar, V.S., Rose, R.C.: Efficient manifold learning for speech recognition using locality sensitive hashing. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6995–6999. IEEE (2013)

    Google Scholar 

  27. Tomar, V.S., Rose, R.C.: Noise aware manifold learning for robust speech recognition. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7087–7091. IEEE (2013)

    Google Scholar 

  28. You, M., Chen, C., Bu, J., Liu, J., Tao, J.: Emotional speech analysis on nonlinear manifold. In: 18th International Conference on Pattern Recognition, ICPR 2006, vol. 3, pp. 91–94. IEEE (2006)

    Google Scholar 

Download references

Acknowledgments

Authors would like to thank ITRA, Media Lab Asia to fund this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Brij Mohan Lal Srivastava .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Srivastava, B.M.L., Shrivastava, M. (2016). Articulatory Gesture Rich Representation Learning of Phonological Units in Low Resource Settings. In: Král, P., Martín-Vide, C. (eds) Statistical Language and Speech Processing. SLSP 2016. Lecture Notes in Computer Science(), vol 9918. Springer, Cham. https://doi.org/10.1007/978-3-319-45925-7_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-45925-7_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-45924-0

  • Online ISBN: 978-3-319-45925-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics