A feature selection-based speaker clustering method for paralinguistic tasks

Gosztolya, Gábor; Tóth, László

doi:10.1007/s10044-017-0612-0

A feature selection-based speaker clustering method for paralinguistic tasks

Short Paper
Published: 23 March 2017

Volume 21, pages 193–204, (2018)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Gábor Gosztolya¹ &
László Tóth^1,2

351 Accesses
6 Citations
Explore all metrics

Abstract

In recent years, computational paralinguistics has emerged as a new topic within speech technology. It concerns extracting non-linguistic information from speech (such as emotions, the level of conflict, whether the speaker is drunk). It was shown recently that many methods applied here can be assisted by speaker clustering; for example, the features extracted from the utterances could be normalized speaker-wise instead of using a global method. In this paper, we propose a speaker clustering algorithm based on standard clustering approaches like K-means and feature selection. By applying this speaker clustering technique in two paralinguistic tasks, we were able to significantly improve the accuracy scores of several machine learning methods, and we also obtained an insight into what features could be efficiently used to separate the different speakers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Ajmera J, Wooters C (2003) A robust speaker clustering algorithm. In: Proceedings of ASRU, pp 411–416
Benbouzid D, Busa-Fekete R, Casagrande N, Collin FD, Kégl B (2012) MultiBoost: a multi-purpose boosting package. J Mach Learn Res 13:549–553
MATH Google Scholar
Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum, New York
Book MATH Google Scholar
Bradley P, Fayyad UM (1998) Refining initial points for K-means clustering. In: Proceedings of ICML, Madison, WI, USA, pp 91–99
Cha SH (2007) Comprehensive survey on distance/similarity measures between probability density functions. Int J Math Models Methods Appl Sci 1(4):300–307
Google Scholar
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:1–27
Article Google Scholar
Dehak N, Kenny PJ, Dehak R, Dumouchel P, Ouellet P (2010) Front end factor analysis for speaker verification. IEEE transactions on audio, speech and language processing, pp 788–798
Dupuy G, Meignier S, Deléglise P, Estève Y (2014) Recent improvements on ILP-based clustering for broadcast news speaker diarization. In: Proceedings of Odyssey, pp 187–193
Eyben F, Weninger F, Schuller B (2013) Affect recognition in real-life acoustic conditions - A new perspective on feature selection. In: Proceedings of Interspeech, Lyon, France, pp 2044–2048
Eyben F, Wöllmer M, Schuller B (2010) Opensmile: the Munich versatile and fast open-source audio feature extractor. In: Proceedings of ACM multimedia, pp 1459–1462
Felföldi L, Kocsor A, Tóth L (2003) Classifier combination in speech recognition. Period Polytech Electr Eng 47(1):125–140
MATH Google Scholar
Fred AL, Jain AK (2005) Combining multiple clusterings using evidence accumulation. IEEE Trans Pattern Anal Mach Intell 27(6):835–850
Article Google Scholar
Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier networks. In: Proceedings of AISTATS, pp 315–323
Gosztolya G (2014) Is AdaBoost competitive for phoneme classification? In: Proceedings of CINTI (IEEE), Budapest, Hungary, pp 61–66
Gosztolya G (2015) Conflict intensity estimation from speech using greedy forward-backward feature selection. In: Proceedings of Interspeech, Dresden, Germany, pp 1339–1344
Gosztolya G, Busa-Fekete R, Tóth L (2013) Detecting autism, emotions and social signals using AdaBoost. In: Proceedings of Interspeech, Lyon, France, pp. 220–224
Gosztolya G, Dombi J (2014) Applying representative uninorms for phonetic classifier combination. In: Proceedings of MDAI, Tokyo, Japan, pp 182–191
Gosztolya G, Grósz T, Busa-Fekete R, Tóth L (2014) Detecting the intensity of cognitive and physical load using AdaBoost and deep rectifier neural networks. In: Proceedings of Interspeech, Singapore, pp 452–456
Gosztolya G, Grósz T, Busa-Fekete R, Tóth L (2016) Determining native language and deception using phonetic features and classifier combination. In: Proceedings of Interspeech, p. accepted
Gosztolya G, Kocsor A (2005) A hierarchical evaluation methodology in speech recognition. Acta Cybern 17(2):213–224
MathSciNet MATH Google Scholar
Gosztolya G, Szilágyi L (2015) Application of fuzzy and possibilistic \(c\)-means clustering models in blind speaker clustering. Acta Polytechnica Hungarica 12(7):41–56
Google Scholar
Grósz T, Busa-Fekete R, Gosztolya G, Tóth L (2015) Assessing the degree of Nativeness and Parkinson’s condition using Gaussian Processes and Deep Rectifier Neural Networks. In: Proceedings of Interspeech, pp 1339–1343
Guan N, Tao D, Luo Z, Yuan B (2012) NeNMF: an optimal gradient method for nonnegative matrix factorization. IEEE Trans Signal Process 60(6):2882–2898
Article MathSciNet Google Scholar
Gupta R, Audhkhasi K, Lee S, Narayanan SS (2013) Speech paralinguistic event detection using probabilistic time-series smoothing and masking. In: Proceedings of Interspeech, pp 173–177
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18
Article Google Scholar
Han KJ, Narayanan SS (2008) Agglomerative hierarchical speaker clustering using incremental Gaussian mixture cluster modeling. In: Proceedings of Interspeech, pp 20–23
Hand D, Mannila H, Smyth P (2001) Principles of data mining. MIT Press, Cambridge
Google Scholar
Hantke S, Weninger F, Kurle R, Ringeval F, Batliner A, Mousa AED, Schuller B (2016) I hear you eat and speak: automatic recognition of Eating Condition and food type, use-cases, and impact on ASR performance. PLoS One 1–24
Kaya H, Özkaptan T, Salah AA, Gürgen F (2014) Canonical correlation analysis and local fisher discriminant analysis based multi-view acoustic feature reduction for physical load prediction. In: Proceedings of Interspeech, Singapore, pp 442–446
Manning C, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge
Book MATH Google Scholar
Neuberger T, Beke A (2013) Automatic laughter detection in spontaneous speech using GMM–SVM method. In: Proceedings of TSD, pp 113–120
Plessis B, Sicsu A, Heutte L, Menu E, Lecolinet E, Debon O, Moreau JV (1993) A multi-classifier combination strategy for the recognition of handwritten cursive words. In: Proceedings of ICDAR, pp 642–645
Räsänen O, Pohjalainen J (2013) Random subset feature selection in automatic recognition of developmental disorders, affective states, and level of conflict from speech. In: Proceedings of Interspeech, Lyon, France, pp 210–214
Schapire R, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. Mach Learn 37(3):297–336
Article MATH Google Scholar
Schölkopf B, Platt J, Shawe-Taylor J, Smola A, Williamson R (2001) Estimating the support of a high-dimensional distribution. Neural Comput 13(7):1443–1471
Article MATH Google Scholar
Schuller B, Steidl S, Batliner A, Epps J, Eyben F, Ringeval F, Marchi E, Zhang Y (2014) The INTERSPEECH 2014 computational paralinguistics challenge: cognitive & physical load. In: Proceedings of Interspeech, pp 427–431
Schuller B, Steidl S, Batliner A, Hantke S, Hönig F, Orozco-Arroyave JR, Nöth E, Zhang Y, Weninger F (2015) The INTERSPEECH 2015 computational paralinguistics challenge: Nativeness, Parkinson’s & Eating Condition. In: Proceedings of Interspeech, pp 478–482
Schuller B, Steidl S, Batliner A, Vinciarelli A, Scherer K, Ringeval F, Chetouani M, Weninger F, Eyben F, Marchi E, Salamin H, Polychroniou A, Valente F, Kim S (2013) The Interspeech 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism. In: Proceedings of Interspeech, Lyon, France, pp 148–152
Sculley D (2010) Web-scale k-means clustering. In: Proceedings of WWW, Raleigh, North Carolina, USA, pp 1177–1178
van Segbroeck M, Travadi R, Vaz C, Kim J, Black MP, Potamianos A, Narayanan SS (2014) Classification of Cognitive Load from speech using an i-vector framework. In: Proceedings of Interspeech, Singapore, pp 671–675
Sokal RR, Michener CD (1958) A statistical method for evaluating systematic relationships. Univ Kans Sci Bull 28(1):1409–1438
Google Scholar
Steinhaus H (1956) Sur la division des corp materiels en parties. Bull Acad Pol Sci C1 III. (IV):801–804
MathSciNet MATH Google Scholar
Stroop JR (1935) Studies of interference in serial verbal reactions. J Exp Psychol 18(6):643–662
Article Google Scholar
Szilágyi L, Szilágyi SM (2014) Generalization rules for the suppressed fuzzy \(c\)-means clustering algorithm. Neurocomputing 139:298–309
Article Google Scholar
Todd SC, Tóth MT, Busa-Fekete R (2009) A MATLAB program for cluster analysis using graph theory. Comput Geosci 36(6):1205–1213
Article Google Scholar
Tóth L (2014) Combining time- and frequency-domain convolution in convolutional neural network-based phone recognition. In: Proceedings of ICASSP, pp 190–194
Tóth SL, Sztahó D, Vicsi K (2012) Speech emotion perception by human and machine. In: Proceedings of COST action, Patras, Greece, pp 213–224
Yap TF (2012) Speech production under Cognitive Load: effects and classification. Ph.D. thesis, University of New South Wales
Yu K, Jiang X, Bunke H (2012) Partially supervised speaker clustering. IEEE Trans Pattern Anal Mach Intell 34(5):959–971
Article Google Scholar

Download references

Acknowledgements

This publication is supported by the European Union and co-funded by the European Social Fund. Project title: Telemedicine-oriented research activities in the fields of mathematics, informatics and medical sciences. Project number: TÁMOP-4.2.2.A-11/1/KONV-2012-0073.

Author information

Authors and Affiliations

MTA-SZTE Research Group on Artificial Intelligence of the Hungarian Academy of Sciences, University of Szeged, 103 Tisza Lajos krt., Szeged, Hungary
Gábor Gosztolya & László Tóth
Institute of Informatics, University of Szeged, Szeged, Hungary
László Tóth

Authors

Gábor Gosztolya
View author publications
You can also search for this author in PubMed Google Scholar
László Tóth
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Gábor Gosztolya or László Tóth.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gosztolya, G., Tóth, L. A feature selection-based speaker clustering method for paralinguistic tasks. Pattern Anal Applic 21, 193–204 (2018). https://doi.org/10.1007/s10044-017-0612-0

Download citation

Received: 09 December 2015
Accepted: 16 March 2017
Published: 23 March 2017
Issue Date: February 2018
DOI: https://doi.org/10.1007/s10044-017-0612-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A feature selection-based speaker clustering method for paralinguistic tasks

Abstract

Access this article

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation