C2SI corpus: a database of speech disorder productions to assess intelligibility and quality of life in head and neck cancers


Within the framework of the Carcinologic Speech Severity Index (C2SI) INCa Project, we collected a large database of French speech recordings aiming at validating Disorder Severity Indexes. Such a database will be useful for measuring the impact of oral and pharyngeal cavity cancer on speech production. It will permit to assess patients’ quality of life after treatment. The database is composed of audio recordings from 134 sessions and associated metadata. Several intelligibility and comprehensibility levels of speech functions have been evaluated. Acoustics and prosody have been assessed. Perceptual evaluation rates from both naive and expert juries are being produced. Automatic analyzes are being carried out. It is intended to provide speech therapists and physicians with objective tools, which take into account the intelligibility and comprehensibility of patients which received cancer treatment (surgery and/or radiotherapy and/or chemotherapy). The aim of this paper is to justify the necessity of such a corpus and to present its data collection. This C2SI corpus will be available to the scientific community through the Scientific Interest Group Parolothèque.

Grant 2014-135 from Institut National pour le CAncer (INCa) in 2014, “Sciences Humaines et Sociales, épidémiologie et Santé Publique” call. Lead by Pr Virginie Woisard at University Hospital of Toulouse and Grant ANR-18-CE45-0008 from The French National Research Agency in 2018 RUGBI project “Improving the measurement of intelligibility of pathological production disorders impaired speech” lead by Jérôme Farinas at IRIT. We thank the company Voxygen\(^{1}\) for providing us with their speech synthesis platform necessary for the realization of the corpus DAP.

  • Speech intelligibility and comprehensibility
  • Quality of life assessment
  • Speech corpus
  • Pathological speech