Unsupervised Speech Unit Discovery Using K-means and Neural Networks

  • Céline ManentiEmail author
  • Thomas Pellegrini
  • Julien Pinquier
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10583)


Unsupervised discovery of sub-lexical units in speech is a problem that currently interests speech researchers. In this paper, we report experiments in which we use phone segmentation followed by clustering the segments together using k-means and a Convolutional Neural Network. We thus obtain an annotation of the corpus in pseudo-phones, which then allows us to find pseudo-words. We compare the results for two different segmentations: manual and automatic. To check the portability of our approach, we compare the results for three different languages (English, French and Xitsonga). The originality of our work lies in the use of neural networks in an unsupervised way that differ from the common method for unsupervised speech unit discovery based on auto-encoders. With the Xitsonga corpus, for instance, with manual and automatic segmentations, we were able to obtain 46% and 42% purity scores, respectively, at phone-level with 30 pseudo-phones. Based on the inferred pseudo-phones, we discovered about 200 pseudo-words.


Neural representation of speech and language Unsupervised learning Speech unit discovery Neural network Sub-lexical units Phone clustering 


  1. 1.
    Towards spoken term discovery at scale with zero resources. In: INTERSPEECH, pp. 1676–1679. International Speech Communication Association (2010)Google Scholar
  2. 2.
    Badino, L., Canevari, C., Fadiga, L., Metta, G.: An auto-encoder based approach to unsupervised learning of subword units. In: ICASSP, pp. 7634–7638 (2014)Google Scholar
  3. 3.
    Badino, L.: Phonetic context embeddings for DNN-HMM phone recognition. In: Interspeech 2016, 17th Annual Conference of the International Speech Communication Association, San Francisco, CA, USA, September 8–12, pp. 405–409 (2016)Google Scholar
  4. 4.
    Church, K.W., Helfman, J.I.: Dotplot: a program for exploring self-similarity in millions of lines of text and code. J. Comput. Graph. Stat. 2(2), 153–174 (1993)Google Scholar
  5. 5.
    van Heerden, C., Davel, M., Barnard, E.: The semi-automated creation of stratified speech corpora (2013)Google Scholar
  6. 6.
    Kiesling, S., Dilley, L., Raymond, W.D.: The variation in conversation (vic) project: creation of the buckeye corpus of conversational speech. In: Language Variation and Change, pp. 55–97 (2006)Google Scholar
  7. 7.
    Lyzinski, V., Sell, G., Jansen, A.: An evaluation of graph clustering methods for unsupervised term discovery. In: INTERSPEECH, pp. 3209–3213. ISCA (2015)Google Scholar
  8. 8.
    Manenti, C., Pellegrini, T., Pinquier, J.: CNN-based phone segmentation experiments in a less-represented language (regular paper). In: INTERSPEECH, p. 3549. ISCA (2016)Google Scholar
  9. 9.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)CrossRefzbMATHGoogle Scholar
  10. 10.
    Muscariello, A., Bimbot, F., Gravier, G.: Unsupervised Motif acquisition in speech via seeded discovery and template matching combination. IEEE Trans. Audio Speech Lang. Process. 20(7), 2031–2044 (2012). CrossRefGoogle Scholar
  11. 11.
    Park, A.S., Glass, J.R.: Unsupervised pattern discovery in speech. IEEE Trans. Audio Speech Lang. Process. 16(1), 186–197 (2008)CrossRefGoogle Scholar
  12. 12.
    Pitt, M., Dilley, L., Johnson, K., Kiesling, S., Raymond, W., Hume, E., Fosler-Lussier, E.: Buckeye corpus of conversational speech (2nd release) (2007).
  13. 13.
    Renshaw, D., Kamper, H., Jansen, A., Goldwater, S.: A comparison of neural network methods for unsupervised representation learning on the zero resource speech challenge. In: INTERSPEECH, pp. 3199–3203 (2015)Google Scholar
  14. 14.
    Tian, F., Gao, B., Cui, Q., Chen, E., Liu, T.Y.: Learning deep representation for graph clustering, pp. 1293–1299 (2014)Google Scholar
  15. 15.
    Versteegh, M., Thiollire, R., Schatz, T., Cao, X.N., Anguera, X., Jansen, A., Dupoux, E.: The zero resource speech challenge 2015. In: INTERSPEECH, pp. 3169–3173 (2015)Google Scholar
  16. 16.
    Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: ICML, pp. 1096–1103. ACM (2008)Google Scholar
  17. 17.
    Wang, H., Lee, T., Leung, C.C.: Unsupervised spoken term detection with acoustic segment model. In: Speech Database and Assessments (Oriental COCOSDA), pp. 106–111. IEEE (2011)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Céline Manenti
    • 1
    Email author
  • Thomas Pellegrini
    • 1
  • Julien Pinquier
    • 1
  1. 1.IRIT, Université de Toulouse, UPSToulouseFrance

Personalised recommendations