Analysing Speech for Clinical Applications

Trancoso, Isabel; Correia, Joana; Teixeira, Francisco; Raj, Bhiksha; Abad, Alberto

doi:10.1007/978-3-030-00810-9_1

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11171))

Included in the following conference series:

International Conference on Statistical Language and Speech Processing

769 Accesses
1 Citations

Abstract

The boost in speech technologies that we have witnessed over the last decade has allowed us to go from a state of the art in which correctly recognizing strings of words was a major target, to a state in which we aim much beyond words. We aim at extracting meaning, but we also aim at extracting all possible cues that are conveyed by the speech signal. In fact, we can estimate bio-relevant traits such as height, weight, gender, age, physical and mental health. We can also estimate language, accent, emotional and personality traits, and even environmental cues. This wealth of information, that one can now extract with recent advances in machine learning, has motivated an exponentially growing number of speech-based applications that go much beyond the transcription of what a speaker says. In particular, it has motivated many health related applications, namely aiming at non-invasive diagnosis and monitorization of diseases that affect speech.

Most of the recent work on speech-based diagnosis tools addresses the extraction of features, and/or the development of sophisticated machine learning classifiers [5, 7, 12,13,14, 17]. The results have shown remarkable progress, boosted by several joint paralinguistic challenges, but most results are obtained from limited training data acquired in controlled conditions.

This talk covers two emerging concerns related to this growing trend. One is the collection of large in-the-wild datasets and the effects of this extended uncontrolled collection in the results [4]. Another concern is how the diagnosis may be done without compromising patient privacy [18].

As a proof-of-concept, we will discuss these two aspects and show our results for two target diseases, Depression and Cold, a selection motivated by the availability of corresponding lab datasets distributed in paralinguistic challenges. The availability of these lab datasets allowed us to build a baseline system for each disease, using a simple neural network trained with common features that have not been optimized for either disease. Given the modular architecture adopted, each component of the system can be individually improved at a later stage, although the limited amount of data does not motivate us to exploit deeper networks.

Our mining effort has been focused on video blogs (vlogs), that include a single speaker which, at some point, admits that he/she is currently affected by a given disease. Retrieving vlogs with the target disease involves not only a simple query (i.e. depression vlog), but also a post-filtering stage to exclude videos that do not correspond to our target of first person, present experiences (lectures, in particular, are relatively frequent). This filtering stage combines multimodal features automatically extracted from the video and its metadata, using mostly off-the-shelf tools.

We collected a large dataset for each target disease from YouTube, and manually labelled a small subset which we named the in-the-Wild Speech Medical (WSM) corpus. Although our mining efforts made use of relatively simple techniques using mostly existing toolkits, they proved effective. The best performing models achieved a precision of \(88\%\) and \(93\%\), and a recall of \(97\%\) and \(72\%\), for the datasets of Cold and Depression, respectively, in the task of filtering videos containing these speech affecting diseases.

We compared the performance of our baseline neural network classifiers trained with data collected in controlled conditions in tests with corresponding in-the-wild data. For the Cold datasets, the baseline neural network achieved an Unweighted Average Recall (UAR) of 66.9% for the controlled dataset, and 53.1% for the manually labelled subset of the WSM corpus. For the Depression datasets, the corresponding values were 60.6%, and 54.8%, respectively (at interview level, the UAR increased to 61.9% for the vlog corpus). The performance degradation that we had anticipated for using in-the-wild data may be due to a greater variability in recording conditions (p.e. microphone, noise) and in the effects of speech altering diseases in the subjects’ speech. Our current work with vlog datasets attempts to estimate the quality of the predicted labels of a very large set in an unsupervised way, using noisy models.

The second aspect we addressed was patient privacy. Privacy is an emerging concern among users of voice-activated digital assistants, sparkled by the awareness of devices that must be always in the listening mode. Despite this growing concern, the potential misuse of health related speech based cues has not yet been fully realized. This is the motivation for adopting secure computation frameworks, in which cryptographic techniques are combined with state-of-the-art machine learning algorithms. Privacy in speech processing is an interdisciplinary topic, which was first applied to speaker verification, using Secure Multi-Party Computation, and Secure Modular Hashing techniques [1, 15], and later to speech emotion recognition, also using hashing techniques [6]. The most recent efforts on privacy preserving speech processing have followed the progress in secure machine learning, combining neural networks and Full Homomorphic Encryption (FHE) [3, 8, 9].

In this work, we applied an encrypted neural network, following the FHE paradigm, to the problem of secure detection of pathological speech. This was done by developing an encrypted version of a neural network, trained with unencrypted data, in order to produce encrypted predictions of health-related labels. As proof-of-concept, we used the same two above mentioned target diseases, and compared the performance of the simple neural network classifiers with their encrypted counterparts on datasets collected in controlled conditions. For the Cold dataset, the baseline neural network achieved a UAR of 66.9%, whereas the encrypted network achieved 66.7%. For the Depression dataset, the baseline value was 60.6%, whereas the encrypted network achieved 60.2% (67.9% at interview level). The slight difference in results showed the validity of our secure approach.

This approach relies on the computation of features on the client side before encryption, with only the inference stage being computed in an encrypted setting. Ideally, an end-to-end approach would overcome this limitation, but combining convolutional neural networks with FHE imposes severe limitations to their size. Likewise, the use of recurrent layers such as LSTMs (Long Short Term Memory) also requires a number of operations too large for current FHE frameworks, making them computationally unfeasible as well.

FHE schemes, by construction, only work with integers, whilst neural networks work with real numbers. By using encoding methods to convert real weights to integers we are throwing away the capability of using an FHE batching technique that would allow us to compute several predictions, at the same time, using the same encrypted value. Recent advances in machine learning have pushed towards the “quantization” and“discretization” of neural networks, so that models occupy less space and operations consume less power. Some works have already implemented these techniques using homomorphic encryption, such as Binarized Neural Networks [10, 11, 16] and Discretized Neural Networks [2]. The talk will also cover our recent efforts in applying this type of approach to the detection of health related cues in speech signals, while discretizing the network and maximizing the throughput of its encrypted counterpart.

More than presenting our recent work in these two aspects of speech analysis for medical applications, this talk intends to point to different directions for future work in these two relatively unexplored topics that were by no means exhausted in this summary.

This work was supported by national funds through Fundação para a Ciência e a Tecnologia (FCT) with references UID/CEC/50021/2013, and SFRH/BD/103402/2014.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Boufounos, P., Rane, S.: Secure binary embeddings for privacy preserving nearest neighbors. In: International Workshop on Information Forensics and Security (WIFS) (2011)
Google Scholar
Bourse, F., Minelli, M., Minihold, M., Paillier, P.: Fast homomorphic evaluation of deep discretized neural networks. IACR Cryptology ePrint Archive 2017, 1114 (2017)
Google Scholar
Chabanne, H., de Wargny, A., Milgram, J., Morel, C., et al.: Privacy-preserving classification on deep neural network. IACR Cryptology ePrint Archive 2017, 35 (2017)
Google Scholar
Correia, J., Raj, B., Trancoso, I., Teixeira, F.: Mining multimodal repositories for speech affecting diseases. In: INTERSPEECH (2018)
Google Scholar
Cummins, N., Scherer, S., Krajewski, J., Schnieder, S., Epps, J., Quatieri, T.F.: A review of depression and suicide risk assessment using speech analysis. Speech Commun. 71, 10–49 (2015)
Article Google Scholar
Dias, M., Abad, A., Trancoso, I.: Exploring hashing and cryptonet based approaches for privacy-preserving speech emotion recognition. In: ICASSP. IEEE (2018)
Google Scholar
Dibazar, A.A., Narayanan, S., Berger, T.W.: Feature analysis for automatic detection of pathological speech. In: 24th Annual Conference and the Annual Fall Meeting of the Biomedical Engineering Society EMBS/BMES Conference, vol. 1, pp. 182–183. IEEE (2002)
Google Scholar
Gilad-Bachrach, R., Dowlin, N., Laine, K., et al.: CryptoNets: applying neural networks to encrypted data with high throughput and accuracy. In: ICML, JMLR Workshop and Conference Proceedings, vol. 48, pp. 201–210 (2016)
Google Scholar
Hesamifard, E., Takabi, H., Ghasemi, M.: CryptoDL: deep neural networks over encrypted data. CoRR abs/1711.05189 (2017)
Google Scholar
Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural networks. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems 29, pp. 4107–4115. Curran Associates, Inc., New York (2016)
Google Scholar
Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Quantized neural networks: training neural networks with low precision weights and activations. J. Mach. Learn. Res. 18, 187:1–187:30 (2017)
Google Scholar
Lopez-de Ipiña, K., et al.: On automatic diagnosis of Alzheimers disease based on spontaneous speech analysis and emotional temperature. Cogn. Comput. 7(1), 44–55 (2015)
Article Google Scholar
López-de Ipiña, K., et al.: On the selection of non-invasive methods based on speech analysis oriented to automatic Alzheimer disease diagnosis. Sensors 13(5), 6730–6745 (2013)
Article Google Scholar
Orozco-Arroyave, J.R., et al.: Characterization methods for the detection of multiple voice disorders: neurological, functional, and laryngeal diseases. IEEE J. Biomed. Health Inform. 19(6), 1820–1828 (2015)
Article Google Scholar
Pathak, M.A., Raj, B.: Privacy-preserving speaker verification and identification using gaussian mixture models. IEEE Trans. Audio Speech Lang. Process. 21(2), 397–406 (2013)
Article Google Scholar
Sanyal, A., Kusner, M.J., Gascón, A., Kanade, V.: TAPAS: tricks to accelerate (encrypted) prediction as a service. CoRR abs/1806.03461 (2018)
Google Scholar
Schuller, B., et al.: The INTERSPEECH 2017 computational paralinguistics challenge: addressee, cold & snoring. In: INTERSPEECH (2017)
Google Scholar
Teixeira, F., Abad, A., Trancoso, I.: Patient privacy in paralinguistic tasks. In: INTERSPEECH (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

INESC-ID/Instituto Superior Técnico, University of Lisbon, Lisbon, Portugal
Isabel Trancoso, Joana Correia, Francisco Teixeira & Alberto Abad
Carnegie Mellon University, Pittsburgh, USA
Joana Correia & Bhiksha Raj

Authors

Isabel Trancoso
View author publications
You can also search for this author in PubMed Google Scholar
Joana Correia
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Teixeira
View author publications
You can also search for this author in PubMed Google Scholar
Bhiksha Raj
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Abad
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Isabel Trancoso .

Editor information

Editors and Affiliations

University of Mons, Mons, Belgium
Thierry Dutoit
Rovira i Virgili University, Tarragona, Spain
Carlos Martín-Vide
University of Mons, Mons, Belgium
Gueorgui Pironkov

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Trancoso, I., Correia, J., Teixeira, F., Raj, B., Abad, A. (2018). Analysing Speech for Clinical Applications. In: Dutoit, T., Martín-Vide, C., Pironkov, G. (eds) Statistical Language and Speech Processing. SLSP 2018. Lecture Notes in Computer Science(), vol 11171. Springer, Cham. https://doi.org/10.1007/978-3-030-00810-9_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-00810-9_1
Published: 19 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00809-3
Online ISBN: 978-3-030-00810-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics