Skip to main content
Log in

A feature selection-based speaker clustering method for paralinguistic tasks

  • Short Paper
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

In recent years, computational paralinguistics has emerged as a new topic within speech technology. It concerns extracting non-linguistic information from speech (such as emotions, the level of conflict, whether the speaker is drunk). It was shown recently that many methods applied here can be assisted by speaker clustering; for example, the features extracted from the utterances could be normalized speaker-wise instead of using a global method. In this paper, we propose a speaker clustering algorithm based on standard clustering approaches like K-means and feature selection. By applying this speaker clustering technique in two paralinguistic tasks, we were able to significantly improve the accuracy scores of several machine learning methods, and we also obtained an insight into what features could be efficiently used to separate the different speakers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

References

  1. Ajmera J, Wooters C (2003) A robust speaker clustering algorithm. In: Proceedings of ASRU, pp 411–416

  2. Benbouzid D, Busa-Fekete R, Casagrande N, Collin FD, Kégl B (2012) MultiBoost: a multi-purpose boosting package. J Mach Learn Res 13:549–553

    MATH  Google Scholar 

  3. Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum, New York

    Book  MATH  Google Scholar 

  4. Bradley P, Fayyad UM (1998) Refining initial points for K-means clustering. In: Proceedings of ICML, Madison, WI, USA, pp 91–99

  5. Cha SH (2007) Comprehensive survey on distance/similarity measures between probability density functions. Int J Math Models Methods Appl Sci 1(4):300–307

    Google Scholar 

  6. Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:1–27

    Article  Google Scholar 

  7. Dehak N, Kenny PJ, Dehak R, Dumouchel P, Ouellet P (2010) Front end factor analysis for speaker verification. IEEE transactions on audio, speech and language processing, pp 788–798

  8. Dupuy G, Meignier S, Deléglise P, Estève Y (2014) Recent improvements on ILP-based clustering for broadcast news speaker diarization. In: Proceedings of Odyssey, pp 187–193

  9. Eyben F, Weninger F, Schuller B (2013) Affect recognition in real-life acoustic conditions - A new perspective on feature selection. In: Proceedings of Interspeech, Lyon, France, pp 2044–2048

  10. Eyben F, Wöllmer M, Schuller B (2010) Opensmile: the Munich versatile and fast open-source audio feature extractor. In: Proceedings of ACM multimedia, pp 1459–1462

  11. Felföldi L, Kocsor A, Tóth L (2003) Classifier combination in speech recognition. Period Polytech Electr Eng 47(1):125–140

    MATH  Google Scholar 

  12. Fred AL, Jain AK (2005) Combining multiple clusterings using evidence accumulation. IEEE Trans Pattern Anal Mach Intell 27(6):835–850

    Article  Google Scholar 

  13. Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier networks. In: Proceedings of AISTATS, pp 315–323

  14. Gosztolya G (2014) Is AdaBoost competitive for phoneme classification? In: Proceedings of CINTI (IEEE), Budapest, Hungary, pp 61–66

  15. Gosztolya G (2015) Conflict intensity estimation from speech using greedy forward-backward feature selection. In: Proceedings of Interspeech, Dresden, Germany, pp 1339–1344

  16. Gosztolya G, Busa-Fekete R, Tóth L (2013) Detecting autism, emotions and social signals using AdaBoost. In: Proceedings of Interspeech, Lyon, France, pp. 220–224

  17. Gosztolya G, Dombi J (2014) Applying representative uninorms for phonetic classifier combination. In: Proceedings of MDAI, Tokyo, Japan, pp 182–191

  18. Gosztolya G, Grósz T, Busa-Fekete R, Tóth L (2014) Detecting the intensity of cognitive and physical load using AdaBoost and deep rectifier neural networks. In: Proceedings of Interspeech, Singapore, pp 452–456

  19. Gosztolya G, Grósz T, Busa-Fekete R, Tóth L (2016) Determining native language and deception using phonetic features and classifier combination. In: Proceedings of Interspeech, p. accepted

  20. Gosztolya G, Kocsor A (2005) A hierarchical evaluation methodology in speech recognition. Acta Cybern 17(2):213–224

    MathSciNet  MATH  Google Scholar 

  21. Gosztolya G, Szilágyi L (2015) Application of fuzzy and possibilistic \(c\)-means clustering models in blind speaker clustering. Acta Polytechnica Hungarica 12(7):41–56

    Google Scholar 

  22. Grósz T, Busa-Fekete R, Gosztolya G, Tóth L (2015) Assessing the degree of Nativeness and Parkinson’s condition using Gaussian Processes and Deep Rectifier Neural Networks. In: Proceedings of Interspeech, pp 1339–1343

  23. Guan N, Tao D, Luo Z, Yuan B (2012) NeNMF: an optimal gradient method for nonnegative matrix factorization. IEEE Trans Signal Process 60(6):2882–2898

    Article  MathSciNet  Google Scholar 

  24. Gupta R, Audhkhasi K, Lee S, Narayanan SS (2013) Speech paralinguistic event detection using probabilistic time-series smoothing and masking. In: Proceedings of Interspeech, pp 173–177

  25. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18

    Article  Google Scholar 

  26. Han KJ, Narayanan SS (2008) Agglomerative hierarchical speaker clustering using incremental Gaussian mixture cluster modeling. In: Proceedings of Interspeech, pp 20–23

  27. Hand D, Mannila H, Smyth P (2001) Principles of data mining. MIT Press, Cambridge

    Google Scholar 

  28. Hantke S, Weninger F, Kurle R, Ringeval F, Batliner A, Mousa AED, Schuller B (2016) I hear you eat and speak: automatic recognition of Eating Condition and food type, use-cases, and impact on ASR performance. PLoS One 1–24

  29. Kaya H, Özkaptan T, Salah AA, Gürgen F (2014) Canonical correlation analysis and local fisher discriminant analysis based multi-view acoustic feature reduction for physical load prediction. In: Proceedings of Interspeech, Singapore, pp 442–446

  30. Manning C, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  31. Neuberger T, Beke A (2013) Automatic laughter detection in spontaneous speech using GMM–SVM method. In: Proceedings of TSD, pp 113–120

  32. Plessis B, Sicsu A, Heutte L, Menu E, Lecolinet E, Debon O, Moreau JV (1993) A multi-classifier combination strategy for the recognition of handwritten cursive words. In: Proceedings of ICDAR, pp 642–645

  33. Räsänen O, Pohjalainen J (2013) Random subset feature selection in automatic recognition of developmental disorders, affective states, and level of conflict from speech. In: Proceedings of Interspeech, Lyon, France, pp 210–214

  34. Schapire R, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. Mach Learn 37(3):297–336

    Article  MATH  Google Scholar 

  35. Schölkopf B, Platt J, Shawe-Taylor J, Smola A, Williamson R (2001) Estimating the support of a high-dimensional distribution. Neural Comput 13(7):1443–1471

    Article  MATH  Google Scholar 

  36. Schuller B, Steidl S, Batliner A, Epps J, Eyben F, Ringeval F, Marchi E, Zhang Y (2014) The INTERSPEECH 2014 computational paralinguistics challenge: cognitive & physical load. In: Proceedings of Interspeech, pp 427–431

  37. Schuller B, Steidl S, Batliner A, Hantke S, Hönig F, Orozco-Arroyave JR, Nöth E, Zhang Y, Weninger F (2015) The INTERSPEECH 2015 computational paralinguistics challenge: Nativeness, Parkinson’s & Eating Condition. In: Proceedings of Interspeech, pp 478–482

  38. Schuller B, Steidl S, Batliner A, Vinciarelli A, Scherer K, Ringeval F, Chetouani M, Weninger F, Eyben F, Marchi E, Salamin H, Polychroniou A, Valente F, Kim S (2013) The Interspeech 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism. In: Proceedings of Interspeech, Lyon, France, pp 148–152

  39. Sculley D (2010) Web-scale k-means clustering. In: Proceedings of WWW, Raleigh, North Carolina, USA, pp 1177–1178

  40. van Segbroeck M, Travadi R, Vaz C, Kim J, Black MP, Potamianos A, Narayanan SS (2014) Classification of Cognitive Load from speech using an i-vector framework. In: Proceedings of Interspeech, Singapore, pp 671–675

  41. Sokal RR, Michener CD (1958) A statistical method for evaluating systematic relationships. Univ Kans Sci Bull 28(1):1409–1438

    Google Scholar 

  42. Steinhaus H (1956) Sur la division des corp materiels en parties. Bull Acad Pol Sci C1 III. (IV):801–804

    MathSciNet  MATH  Google Scholar 

  43. Stroop JR (1935) Studies of interference in serial verbal reactions. J Exp Psychol 18(6):643–662

    Article  Google Scholar 

  44. Szilágyi L, Szilágyi SM (2014) Generalization rules for the suppressed fuzzy \(c\)-means clustering algorithm. Neurocomputing 139:298–309

    Article  Google Scholar 

  45. Todd SC, Tóth MT, Busa-Fekete R (2009) A MATLAB program for cluster analysis using graph theory. Comput Geosci 36(6):1205–1213

    Article  Google Scholar 

  46. Tóth L (2014) Combining time- and frequency-domain convolution in convolutional neural network-based phone recognition. In: Proceedings of ICASSP, pp 190–194

  47. Tóth SL, Sztahó D, Vicsi K (2012) Speech emotion perception by human and machine. In: Proceedings of COST action, Patras, Greece, pp 213–224

  48. Yap TF (2012) Speech production under Cognitive Load: effects and classification. Ph.D. thesis, University of New South Wales

  49. Yu K, Jiang X, Bunke H (2012) Partially supervised speaker clustering. IEEE Trans Pattern Anal Mach Intell 34(5):959–971

    Article  Google Scholar 

Download references

Acknowledgements

This publication is supported by the European Union and co-funded by the European Social Fund. Project title: Telemedicine-oriented research activities in the fields of mathematics, informatics and medical sciences. Project number: TÁMOP-4.2.2.A-11/1/KONV-2012-0073.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Gábor Gosztolya or László Tóth.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gosztolya, G., Tóth, L. A feature selection-based speaker clustering method for paralinguistic tasks. Pattern Anal Applic 21, 193–204 (2018). https://doi.org/10.1007/s10044-017-0612-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-017-0612-0

Keywords

Navigation