Comparison of Text-Independent Original Speaker Recognition from Emotionally Converted Speech

Přibil, Jiří; Přibilová, Anna

doi:10.1007/978-3-319-28109-4_14

Jiří Přibil¹⁰ &
Anna Přibilová¹¹

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 48))

838 Accesses
2 Citations

Abstract

The paper describes an application of the classifier based on the Gaussian mixture models (GMM) for reverse identification of the original speaker from the emotionally transformed speech in Czech and Slovak. We investigate whether the identification score given by the GMM classifier depends on the type and the structure of used speech features. Comparison of the results obtained with the sentences in German and English has shown that the structure and the balance of the speech database have influence on the identification accuracy but the used language is not practically important. The evaluation experiments confirmed that the developed text-independent GMM original speaker identifier is functional for the closed-set classification tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Skowron, M., Rank, S., Swiderska, A., Küster, D., Kappas, A.: Applying a text-based affective dialogue system in psychological research: case studies on the effects of system behaviour, interaction context and social exclusion. Cogn. Comput. p. 20 (2014), doi:10.1007/s12559-014-9271-2
Google Scholar
Maia, R., Akamine, M.: On the impact of excitation and spectral parameters for expressive statistical parametric speech synthesis. Comput. Speech Lang. 28(5), 1209–1232 (2014)
Article Google Scholar
Riviello, M.T., Chetouani, M., Cohen, D., Esposito, A.: On the perception of emotional “voices”: a cross-cultural comparison among American, French and Italian subjects. In: Esposito, A., Vinciarelli, A., Vicsi, K., Pelachaud, C., Nijholt, A. (eds.) Analysis of Verbal and Nonverbal Communication and Enactment: The Processing Issues. LNCS, vol. 6800, pp. 368–377. Springer, Berlin (2011)
Google Scholar
Yun, S., Lee, Y.J., Kim, S.H.: Multilingual speech-to-speech translation system for mobile consumer devices. IEEE Trans. Consum. Electron. 60(3), 508–516 (2014)
Article Google Scholar
Přibil, J., Přibilová, A.: Application of expressive speech in TTS System with cepstral description. In: Esposito, A., Bourbakis, N., Avouris, N., Hatrzilygeroudis, I. (eds.) Verbal and Nonverbal Features of Human-Human and Human-Machine Interaction. LNAI, vol. 5042, pp. 201–213. Springer, Berlin (2008)
Google Scholar
Hanzlíček, Z., Matoušek, J., Tihelka, D.: First experiments on text-to-speech system personification. In: Matoušek, V., Mautner, P. (eds.) Text, Speech, and Dialogue 2009. LNCS, vol. 5729, pp. 186–193. Springer, Berlin (2009)
Chapter Google Scholar
Lee, H.J.: Fairy tale storytelling system: using both prosody and text for emotional speech synthesis. In: Lee, G., Howard, D., Ślogonezak, D., Hong, Y.S. (eds.) Convergence and Hybrid Information Technology. Communications in Computer and Information Science, vol. 310, pp. 317–324. Springer, Berlin (2012)
Chapter Google Scholar
Alcantara, J.A., Lu, L.P., Magno, J.K., Soriano, Z., Ong, E., Resurreccion, R.: Emotional narration of children’s stories. In: Nishizaki, S.Y., Numao, M., Caro, J., Suarez, M.T. (eds.) Theory and Practice of Computation. Proceedings in Information and Communication Technology, vol. 5, pp. 1–14. Springer, Japan (2012)
Chapter Google Scholar
Přibil, J., Přibilová, A.: Czech TTS engine for Braille pen device based on pocket PC platform. In: Vích, R. (ed.) Proceedings of the 16th Conference Electronic Speech Signal Processing ESSP’05 joined with the 15th Czech-German Workshop Speech Processing, pp. 402–408 (2005)
Google Scholar
Erro, D., Alonso, A., Serrano, L., Navas, E., Hernaez, I.: Interpretable parametric voice conversion functions based on Gaussian mixture models and constrained transformations. Comput. Speech Lang. 30(1), 3–15 (2015)
Article Google Scholar
Tihelka, D., Matoušek, J., Kala, J.: Quality deterioration factors in unit selection speech synthesis. In: Matoušek, V., Mautner, P. (eds.) Text, Speech, and Dialogue 2007. LNAI, vol. 4629, pp. 508–515. Springer, Berlin (2007)
Chapter Google Scholar
Přibil, J., Přibilová, A., Matoušek, J.: GMM classification of TTS synthesis: Identification of original speaker’s voice. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) Text, Speech, and Dialogue. LNAI, vol. 8655, pp. 365–373. Springer, Cham (2014)
Google Scholar
Shahin, I.: Speaker identification in emotional talking environments based on CSPHMM2s. Eng. Appl. Artif. Intell. 26(7), 1652–1659 (2013)
Article Google Scholar
Reynolds, D.A., Rose, R.C.: Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans. Speech Audio Process. 3(1), 72–83 (1995)
Article Google Scholar
Ajmera, P.K., Jadhav, D.V., Holambe, R.S.: Text-independent speaker identification using Radon and discrete cosine transforms based features from speech spectrogram. Pattern Recognit. 44(10–11), 2749–2759 (2011)
Article Google Scholar
Jawarkar, N.P., Holambe, R.S., Basu, T.K.: Text-independent speaker identification in emotional environments: a classifier fusion approach. In: Sambath, S., Zhu, E. (eds.) Frontiers in Computer Education. AISC, vol. 133, pp. 569–576. Springer, Berlin (2012)
Chapter Google Scholar
Přibil, J., Přibilová, A.: Evaluation of influence of spectral and prosodic features on GMM classification of Czech and Slovak emotional speech. EURASIP J. Audio Speech Music Process. 2013(8), 1–22 (2013)
Google Scholar
Přibilová, A., Přibil, J.: Harmonic model for female voice emotional synthesis. In: Fierrez, J., et al. (eds.) Biometric ID Management and Multimodal Communication. LNCS, vol. 5707, pp. 41–48. Springer, Berlin (2009)
Chapter Google Scholar
Scherer, K.R.: Vocal communication of emotion: a review of research paradigms. Speech Commun. 40(1–2), 227–256 (2003)
Article MATH Google Scholar
Vích, R.: Cepstral speech model, Padé approximation, excitation, and gain matching in cepstral speech synthesis. In: Proceedings of the 15th Biennial EURASIP Conference Biosignal 2000, pp. 77–82. Brno, Czech Republic (2000)
Google Scholar
Madlová, A.: Autoregressive and cepstral parametrization in harmonic speech modelling. J. Electr. Eng. 53(1–2), 46–49 (2002)
Google Scholar
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech. In: Proceedings of the INTERSPEECH 2005, pp. 1517–1520. Lisbon, Portugal (2005)
Google Scholar
Lopes, C., Perdigão, F.: Phoneme recognition on the TIMIT database. In: I. Ipšić (ed.) Speech Technologies, InTech (2011). doi:10.5772/17600
Google Scholar
Dileep, A.D., Sekhar, CCh.: Class-specific GMM based intermediate matching kernel for classification of varying length patterns of long duration speech using support vector machines. Speech Commun. 57, 126–143 (2014)
Article Google Scholar
Zhao, X., Wang, D.: Analysing noise robustness of MFCC and GFCC features in speaker identification. In: Proceedings of the IEEE International Conference on acoustics, Speech and Signal Processing (ICASSP), pp. 7204–7208 (2013)
Google Scholar
Ooi, C.S., Seng, K.P., Ang, L.M., Chew, L.W.: A new approach of audio emotion recognition. Expert Syst. Appl. 41(13), 5858–5869 (2014)
Article Google Scholar
Gharavian, D., Sheikhan, M., Ashoftedel, F.: Emotion recognition improvement using normalized formant supplementary features by hybrid of DTW-MLP-GMM model. Neural Comput. Appl. 22(6), 1181–1191 (2013)
Article Google Scholar
Stanek, M., Sigmund, M.: Comparison of speaker individuality in triangle areas of plane formant spaces. In: Proceedings of the 24th International Conference Radioelektronika, Bratislava 2014, p. 4 (2014). doi:10.1109/Radioelek.2014.6828439
Wu, C.H., Hsia, C.C., Lee, C.H., Lin, M.C.: Hierarchical prosody conversion using regression-based clustering for emotional speech synthesis. IEEE Trans. Audio Speech Lang. Process. 18(6), 1394–1405 (2010)
Article Google Scholar
Sezgin, M.C., Gunsel, B., Kurt, G.K.: Perceptual audio features for emotion detection. EURASIP J. Audio Speech Music Process. 2012(16) (2012). http://asmp.eurasipjournals.com/2012/1/16
Tóth, L., Grósz, T.: A Comparison of deep neural network training methods for large vocabulary speech recognition. In: Habernal, I., Matoušek, V. (eds.) Text, Speech and Dialogue. LNAI, vol. 8082, pp. 36–43. Springer, Berlin (2013)
Google Scholar
Nabney, I.T.: Netlab Pattern Analysis Toolbox (1996-2001). Retrieved 16 February 2012, from http://www.mathworks.com/matlabcentral/fileexchange/2654-netlab
Přibil, J., Přibilová, A.: GMM-Based evaluation of emotional style transformation in Czech and Slovak. Cogn. Comput. p. 11 (2014). doi:10.1007/s12559-014-9283-y
Google Scholar
Zhao, J., Jiang, Q.: Probabilistic PCA for t-distributions. Neurocomputing 69(16–18), 2217–2226 (2006)
Google Scholar
Staroniewicz, P. Majewski, W.: SVM based text-dependent speaker identification for large set of voices. In: Proceedings of the 12th European Signal Processing Conference, EUSIPCO 2004, pp. 333–336. Vienna, Austria (2004)
Google Scholar

Download references

Acknowledgments

This work has been supported by the and VEGA 1/0090/16 Grant Agency of the Slovak Academy of Sciences (VEGA 2/0013/14) and the Ministry of Education of the Slovak Republic (KEGA 022STU-4/2014).

Author information

Authors and Affiliations

Institute of Measurement Science, SAS, Dúbravská cesta 9, SK-841 04, Bratislava, Slovakia
Jiří Přibil
Faculty of Electrical Engineering & Information Technology, Institute of Electronics and Photonics, Slovak University of Technology, Ilkovičova 3, SK-812 19, Bratislava, Slovakia
Anna Přibilová

Authors

Jiří Přibil
View author publications
You can also search for this author in PubMed Google Scholar
Anna Přibilová
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiří Přibil .

Editor information

Editors and Affiliations

Department of Psychology, Seconda Università di Napoli and IIASS, Caserta, Italy
Anna Esposito
(Pompeu Fabra University), Escola Superior Politècnica Tecnocampus, Mataró, Spain
Marcos Faundez-Zanuy
sezione di Napoli Osservatorio, Istituto Nazionale di Geofisica e Vulcan, Napoli, Italy
Antonietta M. Esposito
Department of Psychology, Seconda Universita di Napoli and IIASS, Caserta, Italy
Gennaro Cordasco
Boulevard Dolez, University of Mons, TCTS Lab.31, Mons, Belgium
Thomas Drugman
Data and Signal Processing Research Grou, University of Vic, Vic, Spain
Jordi Solé-Casals
NeuroLab, Università degli Studi "Mediterranea" di, Reggio Calabria, Italy
Francesco Carlo Morabito

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Přibil, J., Přibilová, A. (2016). Comparison of Text-Independent Original Speaker Recognition from Emotionally Converted Speech. In: Esposito, A., et al. Recent Advances in Nonlinear Speech Processing. Smart Innovation, Systems and Technologies, vol 48. Springer, Cham. https://doi.org/10.1007/978-3-319-28109-4_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-28109-4_14
Published: 23 January 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-28107-0
Online ISBN: 978-3-319-28109-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics