Abstract
The paper describes an application of the classifier based on the Gaussian mixture models (GMM) for reverse identification of the original speaker from the emotionally transformed speech in Czech and Slovak. We investigate whether the identification score given by the GMM classifier depends on the type and the structure of used speech features. Comparison of the results obtained with the sentences in German and English has shown that the structure and the balance of the speech database have influence on the identification accuracy but the used language is not practically important. The evaluation experiments confirmed that the developed text-independent GMM original speaker identifier is functional for the closed-set classification tasks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Skowron, M., Rank, S., Swiderska, A., Küster, D., Kappas, A.: Applying a text-based affective dialogue system in psychological research: case studies on the effects of system behaviour, interaction context and social exclusion. Cogn. Comput. p. 20 (2014), doi:10.1007/s12559-014-9271-2
Maia, R., Akamine, M.: On the impact of excitation and spectral parameters for expressive statistical parametric speech synthesis. Comput. Speech Lang. 28(5), 1209–1232 (2014)
Riviello, M.T., Chetouani, M., Cohen, D., Esposito, A.: On the perception of emotional “voices”: a cross-cultural comparison among American, French and Italian subjects. In: Esposito, A., Vinciarelli, A., Vicsi, K., Pelachaud, C., Nijholt, A. (eds.) Analysis of Verbal and Nonverbal Communication and Enactment: The Processing Issues. LNCS, vol. 6800, pp. 368–377. Springer, Berlin (2011)
Yun, S., Lee, Y.J., Kim, S.H.: Multilingual speech-to-speech translation system for mobile consumer devices. IEEE Trans. Consum. Electron. 60(3), 508–516 (2014)
Přibil, J., Přibilová, A.: Application of expressive speech in TTS System with cepstral description. In: Esposito, A., Bourbakis, N., Avouris, N., Hatrzilygeroudis, I. (eds.) Verbal and Nonverbal Features of Human-Human and Human-Machine Interaction. LNAI, vol. 5042, pp. 201–213. Springer, Berlin (2008)
Hanzlíček, Z., Matoušek, J., Tihelka, D.: First experiments on text-to-speech system personification. In: Matoušek, V., Mautner, P. (eds.) Text, Speech, and Dialogue 2009. LNCS, vol. 5729, pp. 186–193. Springer, Berlin (2009)
Lee, H.J.: Fairy tale storytelling system: using both prosody and text for emotional speech synthesis. In: Lee, G., Howard, D., Ślogonezak, D., Hong, Y.S. (eds.) Convergence and Hybrid Information Technology. Communications in Computer and Information Science, vol. 310, pp. 317–324. Springer, Berlin (2012)
Alcantara, J.A., Lu, L.P., Magno, J.K., Soriano, Z., Ong, E., Resurreccion, R.: Emotional narration of children’s stories. In: Nishizaki, S.Y., Numao, M., Caro, J., Suarez, M.T. (eds.) Theory and Practice of Computation. Proceedings in Information and Communication Technology, vol. 5, pp. 1–14. Springer, Japan (2012)
Přibil, J., Přibilová, A.: Czech TTS engine for Braille pen device based on pocket PC platform. In: Vích, R. (ed.) Proceedings of the 16th Conference Electronic Speech Signal Processing ESSP’05 joined with the 15th Czech-German Workshop Speech Processing, pp. 402–408 (2005)
Erro, D., Alonso, A., Serrano, L., Navas, E., Hernaez, I.: Interpretable parametric voice conversion functions based on Gaussian mixture models and constrained transformations. Comput. Speech Lang. 30(1), 3–15 (2015)
Tihelka, D., Matoušek, J., Kala, J.: Quality deterioration factors in unit selection speech synthesis. In: Matoušek, V., Mautner, P. (eds.) Text, Speech, and Dialogue 2007. LNAI, vol. 4629, pp. 508–515. Springer, Berlin (2007)
Přibil, J., Přibilová, A., Matoušek, J.: GMM classification of TTS synthesis: Identification of original speaker’s voice. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) Text, Speech, and Dialogue. LNAI, vol. 8655, pp. 365–373. Springer, Cham (2014)
Shahin, I.: Speaker identification in emotional talking environments based on CSPHMM2s. Eng. Appl. Artif. Intell. 26(7), 1652–1659 (2013)
Reynolds, D.A., Rose, R.C.: Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans. Speech Audio Process. 3(1), 72–83 (1995)
Ajmera, P.K., Jadhav, D.V., Holambe, R.S.: Text-independent speaker identification using Radon and discrete cosine transforms based features from speech spectrogram. Pattern Recognit. 44(10–11), 2749–2759 (2011)
Jawarkar, N.P., Holambe, R.S., Basu, T.K.: Text-independent speaker identification in emotional environments: a classifier fusion approach. In: Sambath, S., Zhu, E. (eds.) Frontiers in Computer Education. AISC, vol. 133, pp. 569–576. Springer, Berlin (2012)
Přibil, J., Přibilová, A.: Evaluation of influence of spectral and prosodic features on GMM classification of Czech and Slovak emotional speech. EURASIP J. Audio Speech Music Process. 2013(8), 1–22 (2013)
Přibilová, A., Přibil, J.: Harmonic model for female voice emotional synthesis. In: Fierrez, J., et al. (eds.) Biometric ID Management and Multimodal Communication. LNCS, vol. 5707, pp. 41–48. Springer, Berlin (2009)
Scherer, K.R.: Vocal communication of emotion: a review of research paradigms. Speech Commun. 40(1–2), 227–256 (2003)
Vích, R.: Cepstral speech model, Padé approximation, excitation, and gain matching in cepstral speech synthesis. In: Proceedings of the 15th Biennial EURASIP Conference Biosignal 2000, pp. 77–82. Brno, Czech Republic (2000)
Madlová, A.: Autoregressive and cepstral parametrization in harmonic speech modelling. J. Electr. Eng. 53(1–2), 46–49 (2002)
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech. In: Proceedings of the INTERSPEECH 2005, pp. 1517–1520. Lisbon, Portugal (2005)
Lopes, C., Perdigão, F.: Phoneme recognition on the TIMIT database. In: I. Ipšić (ed.) Speech Technologies, InTech (2011). doi:10.5772/17600
Dileep, A.D., Sekhar, CCh.: Class-specific GMM based intermediate matching kernel for classification of varying length patterns of long duration speech using support vector machines. Speech Commun. 57, 126–143 (2014)
Zhao, X., Wang, D.: Analysing noise robustness of MFCC and GFCC features in speaker identification. In: Proceedings of the IEEE International Conference on acoustics, Speech and Signal Processing (ICASSP), pp. 7204–7208 (2013)
Ooi, C.S., Seng, K.P., Ang, L.M., Chew, L.W.: A new approach of audio emotion recognition. Expert Syst. Appl. 41(13), 5858–5869 (2014)
Gharavian, D., Sheikhan, M., Ashoftedel, F.: Emotion recognition improvement using normalized formant supplementary features by hybrid of DTW-MLP-GMM model. Neural Comput. Appl. 22(6), 1181–1191 (2013)
Stanek, M., Sigmund, M.: Comparison of speaker individuality in triangle areas of plane formant spaces. In: Proceedings of the 24th International Conference Radioelektronika, Bratislava 2014, p. 4 (2014). doi:10.1109/Radioelek.2014.6828439
Wu, C.H., Hsia, C.C., Lee, C.H., Lin, M.C.: Hierarchical prosody conversion using regression-based clustering for emotional speech synthesis. IEEE Trans. Audio Speech Lang. Process. 18(6), 1394–1405 (2010)
Sezgin, M.C., Gunsel, B., Kurt, G.K.: Perceptual audio features for emotion detection. EURASIP J. Audio Speech Music Process. 2012(16) (2012). http://asmp.eurasipjournals.com/2012/1/16
Tóth, L., Grósz, T.: A Comparison of deep neural network training methods for large vocabulary speech recognition. In: Habernal, I., Matoušek, V. (eds.) Text, Speech and Dialogue. LNAI, vol. 8082, pp. 36–43. Springer, Berlin (2013)
Nabney, I.T.: Netlab Pattern Analysis Toolbox (1996-2001). Retrieved 16 February 2012, from http://www.mathworks.com/matlabcentral/fileexchange/2654-netlab
Přibil, J., Přibilová, A.: GMM-Based evaluation of emotional style transformation in Czech and Slovak. Cogn. Comput. p. 11 (2014). doi:10.1007/s12559-014-9283-y
Zhao, J., Jiang, Q.: Probabilistic PCA for t-distributions. Neurocomputing 69(16–18), 2217–2226 (2006)
Staroniewicz, P. Majewski, W.: SVM based text-dependent speaker identification for large set of voices. In: Proceedings of the 12th European Signal Processing Conference, EUSIPCO 2004, pp. 333–336. Vienna, Austria (2004)
Acknowledgments
This work has been supported by the and VEGA 1/0090/16 Grant Agency of the Slovak Academy of Sciences (VEGA 2/0013/14) and the Ministry of Education of the Slovak Republic (KEGA 022STU-4/2014).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Přibil, J., Přibilová, A. (2016). Comparison of Text-Independent Original Speaker Recognition from Emotionally Converted Speech. In: Esposito, A., et al. Recent Advances in Nonlinear Speech Processing. Smart Innovation, Systems and Technologies, vol 48. Springer, Cham. https://doi.org/10.1007/978-3-319-28109-4_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-28109-4_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-28107-0
Online ISBN: 978-3-319-28109-4
eBook Packages: EngineeringEngineering (R0)