Skip to main content

Latest Advances in Computational Speech Analysis for Mobile Sensing

  • Chapter
  • First Online:
Digital Phenotyping and Mobile Sensing

Abstract

The human vocal anatomy is an intricate anatomical structure which affords us the ability to vocalise a large variety of acoustically rich sounds. As a result, any given speech signal contains an abundant array of information about the speaker in terms of both the intended message, i.e., the linguistic content, and insights into particular states and traits relating to the speaker, i.e., the paralinguistic content. In the field of computational speech analysis, there are substantial and ongoing research efforts to disengage these different facets with the aim of robust and accurate recognition. Speaker states and traits of interest in such analysis include affect, depressive and mood disorders and autism spectrum conditions to name but a few. Within this chapter, a selection of state-of-the-art speech analysis toolkits, which enable this research, are introduced. Further, their advantages and limitations concerning mobile sensing are also discussed. Ongoing challenges and possible future research directions in relation to the identified limitations are also highlighted.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Nicholas Cummins is a co-developer of the DeepSpectrum and auDeep toolkits. Björn W. Schuller is a co-developer of all five toolkits.

  2. 2.

    https://www.audeering.com/technology/opensmile/.

  3. 3.

    https://github.com/openXBOW/openXBOW.

  4. 4.

    https://github.com/DeepSpectrum/DeepSpectrum.

  5. 5.

    https://github.com/ethereon/caffe-tensorflow.

  6. 6.

    https://github.com/auDeep/auDeep.

  7. 7.

    https://github.com/end2you/end2you.

References

  • Alemdar H, Leroy V, Prost-Boucle A, Pétrot F (2017) Ternary neural networks for resource-efficient AI applications. In: 2017 international joint conference on neural networks (IJCNN). Anchorage, AK, USA, pp 2547–2554

    Google Scholar 

  • Amiriparian S, Freitag M, Cummins N, Schuller B (2017) Sequence to sequence autoencoders for unsupervised representation learning from audio. In: Proceedings of the detection and classification of acoustic scenes and events 2017 workshop. IEEE, Munich, Germany, pp 17–21

    Google Scholar 

  • Amiriparian S, Gerczuk M, Ottl S, Cummins N, Freitag M, Pugachevskiy S, Schuller B (2017) Snore sound classification using image-based deep spectrum features. In: Proceedings of INTERSPEECH 2017, 18th annual conference of the international speech communication association. ISCA, Stockholm, Sweden, pp 3512–3516. https://doi.org/10.21437/Interspeech.2017-434

  • Amiriparian S, Schmitt M, Cummins N, Qian K, Dong F, Schuller B (2018) Deep unsupervised representation learning for abnormal heart sound classification. In: Proceedings of the 40th annual international conference of the IEEE engineering in medicine and biology society, EMBC 2018. IEEE, Honolulu, HI, pp 4776–4779. https://doi.org/10.1109/EMBC.2018.8513102

  • Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35:1798–1828

    Article  Google Scholar 

  • Bone D, Lee CC, Chaspari T, Gibson J, Narayanan S (2017) Signal processing and machine learning for mental health research and clinical applications. IEEE Signal Process Mag 34:189–196

    Article  Google Scholar 

  • Chang C-C, Lin C-J (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:1–27

    Article  Google Scholar 

  • Chen W, Wilson J, Tyree S, et al (2015) Compressing neural networks with the hashing trick. In: 32nd international conference on machine learning. ICML, Lille, France, pp 2285–2294

    Google Scholar 

  • Chen Y, Krishna T, Emer JS, Sze V (2017) Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J Solid-State Circuits 52:127–138. https://doi.org/10.1109/JSSC.2016.2616357

    Article  Google Scholar 

  • Cheng Y, Wang D, Zhou P, Zhang T (2017) A survey of model compression and acceleration for deep neural networks

    Google Scholar 

  • Cummins N, Amiriparian S, Hagerer G, Batliner A, Steidl S, Schuller B (2017a) An image-based deep spectrum feature representation for the recognition of emotional speech. In: Proceedings of the 25th ACM international conference on multimedia, MM 2017. ACM, Mountain View, CA, pp 478–484

    Google Scholar 

  • Cummins N, Amiriparian S, Ottl S, Gerczuk M, Schmitt M, Schuller B (2018a) Multimodal bag-of-words for cross domains sentiment analysis. In: Proceedings 43rd IEEE international conference on acoustics, speech, and signal processing, ICASSP 2018. IEEE, Calgary, Canada

    Google Scholar 

  • Cummins N, Baird A, Schuller BW (2018b) The increasing impact of deep learning on speech analysis for health: challenges and opportunities. Methods Spec Issue Transl data Anal Heal informatics

    Google Scholar 

  • Cummins N, Scherer S, Krajewski J, Schnieder S, Epps J, Quatieri TF (2015) A review of depression and suicide risk assessment using speech analysis. Speech Commun 71:10–49

    Article  Google Scholar 

  • Cummins N, Schmitt M, Amiriparian S, Krajewski J, Schuller B (2017b) You sound ill, take the day off: classification of speech affected by upper respiratory tract infection. In: Proceedings of the 39th annual international conference of the IEEE engineering in medicine and biology society, EMBC 2017. IEEE, Jeju Island, South Korea, pp 3806–3809

    Google Scholar 

  • Cunningham S, Green P, Christensen H, Atria J, Coy A, Malavasi M, Desideri L, Rudzicz F (2017) Cloud-based speech technology for assistive technology applications (CloudCAST). Stud Health Technol Inform 242:322–329

    PubMed  Google Scholar 

  • Deng J, Cummins N, Schmitt M, Qian K, Ringeval F, Schuller B (2017) Speech-based diagnosis of autism spectrum condition by generative adversarial network representations. In: Proceedings of the 7th international digital health conference, DH ’17. ACM, London, UK, pp 53–57

    Google Scholar 

  • Donahue C, McAuley J, Puckette M (2018) Adversarial audio synthesis

    Google Scholar 

  • Eyben F, Scherer K, Schuller B, Sundberg J, André E, Busso C, Devillers L, Epps J, Laukka P, Narayanan S, Truong K (2016) The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans Affect Comput 7:190–202

    Article  Google Scholar 

  • Eyben F, Weninger F, Schuller FGB (2013) Recent developments in openSMILE, the munich open-source multimedia feature extractor. In: Proceedings of the 21st ACM international conference on multimedia, MM ’13. ACM, Barcelona, Spain, pp 835–838

    Google Scholar 

  • Eyben F, Wöllmer M, Schuller B (2010) openSMILE: the munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM international conference on multimedia, MM ’10. ACM, Firenze, Italy, pp 1459–1462. https://doi.org/10.1145/1873951.1874246

  • Fitch WT (2000) The evolution of speech: a comparative review. Trends Cogn Sci 4:258–267

    Article  CAS  Google Scholar 

  • Freitag M, Amiriparian S, Pugachevskiy S, Cummins N, Schuller B (2018) auDeep: unsupervised learning of representations from audio with deep recurrent neural networks. J Mach Learn Res 18:1–5

    Google Scholar 

  • Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ (eds) Advances in neural information processing systems 27. Curran Associates, Inc., pp 2672–2680

    Google Scholar 

  • Gupta S, Agrawal A, Gopalakrishnan K, Narayanan P (2015) Deep learning with limited numerical precision. In: 32nd international conference on machine learning. ICML, Lille, France, pp 1737–1746

    Google Scholar 

  • Hagerer G, Cummins N, Eyben F, Schuller B (2017) “Did you laugh enough today?”—{D}eep neural networks for mobile and wearable laughter trackers. In: Proceedings of INTERSPEECH 2017, 18th annual conference of the international speech communication association. ISCA, Stockholm, Sweden, pp 2044–2045

    Google Scholar 

  • Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explor Newsl 11:10–18

    Article  Google Scholar 

  • Han J, Zhang Z, Cummins N, Schuller B (2018) Adversarial training in affective computing and sentiment analysis: recent advances and prospectives. IEEE Comput Intell Mag Spec Issue Comput Intell Affect Comput Sentim Anal

    Google Scholar 

  • Han S, Liu X, Mao H, Pu J, Pedram A, Horowitz MA, Dally WJ (2016) EIE: efficient inference engine on compressed deep neural network. In: 2016 ACM/IEEE 43rd annual international symposium on computer architecture (ISCA). IEEE, Seoul, South Korea, pp 243–254. https://doi.org/10.1109/ISCA.2016.30

  • Han S, Mao H, Dally WJ (2015a) Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding

    Google Scholar 

  • Han S, Pool J, Tran J, Dally W (2015b) Learning both weights and connections for efficient neural network. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in neural information processing systems 28. Curran Associates, Inc., pp 1135–1143

    Google Scholar 

  • He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: The IEEE conference on computer vision and pattern recognition (CVPR). Las Vegas, NV, pp 770–778

    Google Scholar 

  • Istepanian RSH, Al-Anzi T (2018). m-Health 2.0: new perspectives on mobile health, machine learning and big data analytics. Methods

    Google Scholar 

  • Jankowski S, Covello J, Bellini H, Ritchie J, Costa D (2014) The internet of things: making sense of the next mega-trend

    Google Scholar 

  • Joshi J, Goecke R, Alghowinem S, Dhall A, Wagner M, Epps J, Parker G, Breakspear M (2013) Multimodal assistive technologies for depression diagnosis and monitoring. J Multimodal User Interfaces 7:217–228

    Article  Google Scholar 

  • Jouppi NP, Young C, Patil N, Patterson D (2018) A domain-specific architecture for deep neural networks. Commun ACM 61:50–59. https://doi.org/10.1145/3154484

    Article  Google Scholar 

  • Kargl F, van der Heijden RW, Erb B, Bösch C (2019) Privacy in Mobile Sensing. In: Baumeister H, Montag C (eds) Mobile sensing and digital phenotyping in psychoinformatics. Springer, Berlin

    Google Scholar 

  • Keren G, Kirschstein T, Marchi E, Ringeval F, Schuller B (2017) End-to-end learning for dimensional emotion recognition from physiological signals. In: Proceedings 18th IEEE international conference on multimedia and expo, ICME 2017. IEEE, Hong Kong, PR China, pp 985–990

    Google Scholar 

  • Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems. Curran Associates, Inc., pp 1097–1105

    Google Scholar 

  • Lane ND, Bhattacharya S, Georgiev P et al (2016) DeepX: a software accelerator for low-power deep learning inference on mobile devices. 2016 15th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN). Austria, Vienna, pp 1–12

    Google Scholar 

  • LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444. https://doi.org/10.1038/nature14539

    Article  CAS  Google Scholar 

  • Levelt WJM, Roelofs A, Meyer AS (1999) A theory of lexical access in speech production. Behav Brain Sci 22:1–38

    CAS  PubMed  Google Scholar 

  • Marchi E, Eyben F, Hagerer G, Schuller BW (2016) Real-time tracking of speakers’ emotions, states, and traits on mobile platforms. Proceedings INTERSPEECH 2016, 17th annual conference of the international speech communication association. ISCA, San Francisco, CA, pp 1182–1183

    Google Scholar 

  • Nakkiran P, Alvarez R, Prabhavalkar R, Parada C (2015) Compressing deep neural networks using a rank-constrained topology. Proceedings INTERSPEECH 2015, 16th annual conference of the international speech communication association. ISCA, Dresden, Germany, pp 1473–1477

    Google Scholar 

  • O’Shaughnessy D (1999) Speech communications: human and machine, 2nd edn. Wiley-IEEE Press, Piscataway, NJ

    Book  Google Scholar 

  • Orozco-Arroyave JR, Hönig F, Arias-Londoño JD, Vargas-Bonilla JF, Daqrouq K, Skodda S, Rusz J, Nöth E (2016) Automatic detection of Parkinson’s disease in running speech spoken in three different languages. J Acoust Soc Am 139:481–500

    Article  CAS  Google Scholar 

  • Ota K, Dao MS, Mezaris V, De Natale FGB (2017) Deep learning for mobile multimedia: a survey. ACM Trans Multimed Comput Commun Appl 13, 34:1–34:22. https://doi.org/10.1145/3092831

  • Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830

    Google Scholar 

  • Quatieri TF (2002) Discrete-time speech signal processing: principles and practice. Prentice Hall, Upper Saddle River, NJ

    Google Scholar 

  • Ren Z, Cummins N, Pandit V, Han J, Qian K, Schuller B (2018) Learning image-based representations for heart sound classification. In: Proceedings of the 2018 international conference on digital health, DH’18. ACM, Lyon, France, pp 143–147. https://doi.org/10.1145/3194658.3194671

  • Ringeval F, Marchi E, Grossard C, Xavier J, Chetouani M, Cohen D, Schuller B (2016) Automatic analysis of typical and atypical encoding of spontaneous emotion in the voice of children. Proceedings INTERSPEECH 2016, 17th annual conference of the international speech communication association. ISCA, San Francisco, CA, pp 1210–1214

    Google Scholar 

  • Ringeval F, Schuller B, Valstar M, Cowie R, Kaya H, Schmitt M, Amiriparian S, Cummins N, Lalanne D, Michaud A, et al (2018). AVEC 2018 workshop and challenge: bipolar disorder and cross-cultural affect recognition. In: Proceedings of the 2018 on audio/visual emotion challenge and workshop. pp 3–13

    Google Scholar 

  • Ringeval F, Schuller B, Valstar M, Gratch J, Cowie R, Scherer S, Mozgai S, Cummins N, Schmitt M, Pantic M (2017). AVEC 2017: real-life depression, and affect recognition workshop and challenge. In: Proceedings of the 7th annual workshop on audio/visual emotion challenge, AVEC ’17. ACM, Mountain View, CA, pp 3–9

    Google Scholar 

  • Saito Y, Takamichi S, Saruwatari H (2018) Statistical parametric speech synthesis incorporating generative adversarial networks. IEEE/ACM Trans Audio Speech Lang Process 26:84–96. https://doi.org/10.1109/TASLP.2017.2761547

    Article  Google Scholar 

  • Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, Chen X, Chen X (2016) Improved techniques for training GANs. In: Lee DD, Sugiyama M, Luxburg UV, Guyon I, Garnett R (eds) Advances in neural information processing systems. Curran Associates Inc, Barcelona, Spain, pp 2234–2242

    Google Scholar 

  • Schmitt M, Ringeval F, Schuller B (2016) At the border of acoustics and linguistics: bag-of-audio-words for the recognition of emotions in speech. Proceedings INTERSPEECH 2016, 17th annual conference of the international speech communication association. ISCA, San Francisco, CA, pp 495–499

    Google Scholar 

  • Schmitt M, Schuller B (2017) openXBOW—introducing the Passau open-source crossmodal bag-of-words toolkit. J Mach Learn Res 18:3370–3374

    Google Scholar 

  • Schuller B (2017) Can affective computing save lives? Meet Mobile Health. IEEE Comput Mag 50:40

    Article  Google Scholar 

  • Schuller B, Batliner A (2013) Computational paralinguistics: emotion, affect and personality in speech and language processing. Wiley

    Google Scholar 

  • Schuller B, Steidl S, Batliner A, Bergelson E, Krajewski J, Janott C, Amatuni A, Casillas M, Seidl A, Soderstrom M, Warlaumont A, Hidalgo G, Schnieder S, Heiser C, Hohenhorst W, Herzog M, Schmitt M, Qian K, Zhang Y, Trigeorgis G, Tzirakis P, Zafeiriou S (2017) The INTERSPEECH 2017 computational paralinguistics challenge: addressee, cold and snoring. In: Proceedings INTERSPEECH 2017, 18th annual conference of the international speech communication association. ISCA, Stockholm, Sweden, pp 3442–3446. https://doi.org/10.21437/Interspeech.2017-43

  • Schuller B, Steidl S, Batliner A, Hantke S, Hönig F, Orozco-Arroyave JR, Nöth E, Zhang Y, Weninger F (2015) The INTERSPEECH 2015 computational paralinguistics challenge: degree of nativeness, Parkinson’s and eating condition. Proceedings INTERSPEECH 2015, 16th annual conference of the international speech communication association. ISCA, Dresden, Germany, pp 478–482

    Google Scholar 

  • Schuller B, Steidl S, Batliner A, Marschik P, Baumeister H, Dong F, Pokorny FB, Rathner E-M, Bartl-Pokorny KD, Einspieler C, Zhang D, Baird A, Amiriparian S, Qian K, Ren Z, Schmitt M, Tzirakis P, Zafeiriou S (2018) The INTERSPEECH 2018 computational paralinguistics challenge: atypical and self-assessed affect, crying and heart beats. In: Proceedings INTERSPEECH 2018, 19th annual conference of the international speech communication association. ISCA, Hyderabad, India, pp 122–126. https://doi.org/10.21437/Interspeech.2018-51

  • Schuller B, Steidl S, Batliner A, Vinciarelli A, Scherer K, Ringeval F, Chetouani M, Weninger F, Eyben F, Marchi E, Mortillaro M, Salamin H, Polychroniou A, Valente F, Kim S (2013) The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism. In: Proceedings INTERSPEECH 2013, 14th annual conference of the international speech communication association. ISCA, Lyon, France, pp 148–152

    Google Scholar 

  • Schuller B, Weninger F, Zhang Y, Ringeval F, Batliner A, Steidl S, Eyben F, Marchi E, Vinciarelli A, Scherer K, Chetouani M, Mortillaro M (2018b) Affective and behavioural computing: lessons learnt from the first computational paralinguistics challenge. Comput, Speech Lang, p 32

    Google Scholar 

  • Schuman CD, Potok TE, Patton RM, Birdwell JD, Dean ME, Rose GS, Plank JS (2017) A survey of neuromorphic computing and neural networks in hardware

    Google Scholar 

  • Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition

    Google Scholar 

  • Snell J, Swersky K, Zemel R (2017) Prototypical networks for few-shot learning. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems 30. Curran Associates, Inc., pp 4077–4087

    Google Scholar 

  • Triantafillou E, Larochelle H, Snell J, Tenenbaum J, Swersky KJ, Ren M, Zemel R, Ravi S (2018) Meta-learning for semi-supervised few-shot classification

    Google Scholar 

  • Trigeorgis G, Ringeval F, Brückner R, Marchi E, Nicolaou M, Schuller B, Zafeiriou S (2016) Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In: Proceedings 41st IEEE international conference on acoustics, speech, and signal processing, ICASSP 2016. IEEE, Shanghai, PR China, pp 5200–5204

    Google Scholar 

  • Tsiartas A, Albright C, Bassiou N, Frandsen M, Miller I, Shriberg E, Smith J, Voss L, Wagner V (2017) Sensay analyticstm: a real-time speaker-state platform. 2017 IEEE international conference on acoustics, speech and signal processing, ICASSP ’17. IEEE, New Orleans, LA, pp 6483–6582

    Google Scholar 

  • Tzirakis P, Zafeiriou S, Schuller B (2018) End2You—the imperial toolkit for multimodal profiling by end-to-end learning

    Google Scholar 

  • Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, Moore G, Odell J, Ollason D, Povey D et al (2002) The HTK book. Cambridge Univ Eng Dep 3:175

    Google Scholar 

  • Zhang Z, Cummins N, Schuller B (2017) Advanced data exploitation in speech analysis—an overview. IEEE Signal Process Mag 34:107–129. https://doi.org/10.1109/MSP.2017.2699358

    Article  Google Scholar 

  • Zhu M, Gupta S (2017) To prune, or not to prune: exploring the efficacy of pruning for model compression

    Google Scholar 

Download references

Acknowledgements

This research has received funding from the Innovative Medicines Initiative 2 Joint Undertaking under grant agreement No. 115902. This joint undertaking receives support from the European Union’s Horizon 2020 research and innovation programme and EFPIA.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nicholas Cummins .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Cummins, N., Schuller, B.W. (2019). Latest Advances in Computational Speech Analysis for Mobile Sensing. In: Baumeister, H., Montag, C. (eds) Digital Phenotyping and Mobile Sensing. Studies in Neuroscience, Psychology and Behavioral Economics. Springer, Cham. https://doi.org/10.1007/978-3-030-31620-4_9

Download citation

Publish with us

Policies and ethics