Enhancement of esophageal speech obtained by a voice conversion technique using time dilated Fourier cepstra

  • Imen Ben OthmaneEmail author
  • Joseph Di Martino
  • Kaïs Ouni


This paper presents a novel speaking-aid system for enhancing esophageal speech (ES). The method adopted in this paper aims to improve the quality of esophageal speech using a combination of a voice conversion technique and a time dilation algorithm. In the proposed system, a Deep Neural Network (DNN) is used as a nonlinear mapping function for vocal tract vector transformation. Then the converted frames are used to determine realistic excitation and phase vectors from the target training space using a frame selection algorithm. Next, in order to preserve speaker identity of the esophageal speakers, we use the source vocal tract features and propose to apply on them a time dilation algorithm to reduce the unpleasant esophageal noises. Finally the converted speech is reconstructed using the dilated source vocal tract frames and the predicted excitation and phase. DNN and Gaussian mixture model (GMM) based voice conversion systems have been evaluated using objective and subjective measures. Such an experimental study has been realized also in order to evaluate the changes in speech quality and intelligibility of the transformed signals. Experimental results demonstrate that the proposed methods provide considerable improvement in intelligibility and naturalness of the converted esophageal speech.


Esophageal speech Voice conversion Deep neural networks Time dilation algorithm Noise reduction Excitation and phase Gaussian mixture model 


  1. Abe, M., et al. (1990). Voice conversion through vector quantization. Journal of the Acoustical Society of Japan (E), 11(2), 71–76.CrossRefGoogle Scholar
  2. Arya, S. (1996). Nearest neighbor searching and applications. Univ. of Maryland at College Park, MD.Google Scholar
  3. Barney, H. L., Haworth, F. E., & Dunn, H. K. (1959). An experimental transistorized artificial larynx. Bell Labs Technical Journal, 38(6), 1337–1356.CrossRefGoogle Scholar
  4. Beauregard, G. T., Zhu, X., & Wyse, L. (2005) An efficient algorithm for real-time spectrogram inversion. In Proceedings of the 8th international conference on digital audio effects.Google Scholar
  5. Boll, S. F. (1979). Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech and Signal Processing, 27(2), 113120.CrossRefGoogle Scholar
  6. Chenausky, K., & MacAuslan, J. (2000). Utilization of microprocessors in voice quality improvement: The electrolarynx. Current Opinion in Otolaryngology & Head and Neck Surgery, 8(3), 138–142.CrossRefGoogle Scholar
  7. Childers, D. G., Skinner, D. P., & Kemerait, R. C. (1977). The cepstrum: A guide to processing. Proceedings of the IEEE, 65(10), 1428–1443.CrossRefGoogle Scholar
  8. Cole, D., et al. (1997). Application of noise reduction techniques for alaryngeal speech enhancement. In TENCON’97. IEEE region 10 annual conference. Speech and image technologies for computing and telecommunications., Proceedings of IEEE (Vol. 2). IEEE.Google Scholar
  9. Del Pozo, A., & Young, S. (2006). Continuous tracheoesophageal speech repair. In Signal processing conference, 2006 14th European. IEEE.Google Scholar
  10. Del Pozo, A., & Young, S. (2008). Repairing tracheoesophageal speech duration. In Proc Speech Prosody.Google Scholar
  11. Desai, S., et al. (2009). Voice conversion using artificial neural networks. In Acoustics, speech and signal processing, 2009. ICASSP 2009. IEEE international conference on. IEEE.Google Scholar
  12. Desai, S., et al. (2010). Spectral mapping using artificial neural networks for voice conversion. IEEE Transactions on Audio, Speech, and Language Processing, 18(5), 954–964.CrossRefGoogle Scholar
  13. Deza, M. M., & Deza, E. (2009). Encyclopedia of distances. In Encyclopedia of distances (pp. 1–583). Springer, Berlin.Google Scholar
  14. Doi, H. (2010). Esophageal speech enhancement based on statistical voice conversion with Gaussian mixture models. IEICE Transaction on Information and Systems, 93(9), 2472–2482.CrossRefGoogle Scholar
  15. Doi, H., Nakamura, K., Toda, T., Saruwatari, H., & Shikano, K. (May 2011). An evaluation of alaryngeal speech enhancement methods based on voice conversion techniques. In Proc. ICASSP (pp. 5136–5139).Google Scholar
  16. Doi, H., et al. (2010). Statistical approach to enhancing esophageal speech based on Gaussian mixture models. In Acoustics speech and signal processing (ICASSP), 2010 IEEE international conference on. IEEE.Google Scholar
  17. Filter, M. D., & Hyman, M. (1975). Relationship of acoustic parameters and perceptual ratings of esophageal speech. Perceptual and Motor Skills, 40(1), 63–68.CrossRefGoogle Scholar
  18. García, B., Vicente, J., & Aramendi, E. (2002). Time-spectral technique for esophageal speech regeneration. In 11th EUSIPCO (European Signal Processing Conference). IEEE, Toulouse, France.Google Scholar
  19. García, B., et al. (2005). Esophageal voices: Glottal flow restoration. In IEEE international conference on acoustics, speech, and signal processing, 2005. Proceedings (ICASSP’05) (Vol. 4). IEEE.Google Scholar
  20. Griffin, D., & Lim, J. (1984). Signal estimation from modified short-time Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2), 236–243.CrossRefGoogle Scholar
  21. Hisada, A., & Sawada, H. (2002). Real-time clarification of esophageal speech using a comb filter. In International conference on disability, virtual reality and associated technologies (pp. 39–46).Google Scholar
  22. Hoops, H. R., & Noll, J. D. (1969). Relationship of selected acoustic variables to judgments of esophageal speech. Journal of Communication Disorders, 2(1), 1–13.CrossRefGoogle Scholar
  23. Ben Othmane, I., Di Martino, J., & Ouni, K. (2017). Enhancement of esophageal speech using voice conversion techniques. In International conference on natural language, signal and speech processing-ICNLSSP.Google Scholar
  24. Ben Othmane, I., Di Martino, J., & Ouni, K. (2018). Improving the computational performance of standard GMM-based voice conversion systems used in real-time applications. In ICECOCS18—1st international conference on electronics, control, optimization and computer science, Dec 2018, Kenitra, Morocco. IEEE.Google Scholar
  25. Ben Othmane, I., Di Martino, J., & Ouni, K. (2018). Enhancement of esophageal speech using statistical and neuromimetic voice conversion techniques. Journal of International Science and General Applications, 1(1), 10.Google Scholar
  26. Kain, A., & Macon, M. W. (1998). Spectral voice conversion for text-to-speech synthesis. In Acoustics, speech and signal processing, 1998. Proceedings of the 1998 IEEE international conference on (Vol. 1). IEEE.Google Scholar
  27. Kanungo, T. (2002). An efficient k-means clustering algorithm: Analysis and implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7), 881–892.CrossRefGoogle Scholar
  28. Kawahara, H., et al. (1999). Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of F0 and periodicity. In Sixth european conference on speech communication and technology.Google Scholar
  29. Ling-HuiChen, Z.-H., & YanSong, L.-R. (2013). Joint spectral distribution modeling using restricted boltzmann machines for voice conversion.Google Scholar
  30. Liu, H., Zhao, Q., Wan, M. X., & Wang, S. P. (2006). Enhancement of electrolarynx speech based on auditory masking. IEEE Transactions on Biomedical Engineering, 53(5), 865874.Google Scholar
  31. Mantilla-Caeiros, A., Nakano-Miyatake, M., & Perez-Meana, H. (2010). A pattern recognition based esophageal speech enhancement system. Journal of Applied Research and Technology, 8(1), 56–70.Google Scholar
  32. Matsui, K., et al. (2002). Enhancement of esophageal speech using formant synthesis. Acoustical Science and Technology, 23(2), 69–76.CrossRefGoogle Scholar
  33. Matui, K., Hara, N., Kobayashi, N., & Hirose, H. (May, 1999). Enhancement of esophageal speech using formant synthesis. In Proc. ICASSP (pp. 1831–1834), Phoenix, Arizona.Google Scholar
  34. Mouchtaris, A., Van der Spiegel, J., & Mueller, P. (2006). Nonparallel training for voice conversion based on a parameter adaptation approach. IEEE Transactions on Audio, Speech, and Language Processing, 14(3), 952–963.CrossRefGoogle Scholar
  35. Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10).Google Scholar
  36. Nakamura, K., Toda, T., Saruwatari, H., & Shikano, K. (2012). Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech. SPECOM, 54(1), 134146.Google Scholar
  37. Nakashika, T., Takiguchi, T., & Ariki, Y. (2014). High-order sequence modeling using speaker-dependent recurrent temporal restricted Boltzmann machines for voice conversion. In Fifteenth annual conference of the international speech communication association.Google Scholar
  38. Nakashika, T., et al. (2013). Voice conversion in high-order eigen space using deep belief nets. In Interspeech.Google Scholar
  39. Nankaku, Y., et al. (2007). Spectral conversion based on statistical models including time-sequence matching.Google Scholar
  40. Narendranath, M. (1995). Transformation of formants for voice conversion using artificial neural networks. Speech Communication, 16(2), 207–216.CrossRefGoogle Scholar
  41. Park, S. H. (2011). Simple linear regression. In International Encyclopedia of Statistical Science (pp. 1327–1328). Springer, Berlin.Google Scholar
  42. Qi, Y., Weinberg, B., & Bi, N. (1995). Enhancement of female esophageal and tracheoesophageal speech. The Journal of the Acoustical Society of America, 98(5), 2461–2465.CrossRefGoogle Scholar
  43. Robbins, J., et al. (1984). A comparative acoustic study of normal, esophageal, and tracheoesophageal speech production. The Journal of Speech and Hearing Disorders, 49(2), 202–210.CrossRefGoogle Scholar
  44. Robbins, J., et al. (1984). Selected acoustic features of tracheoesophageal, esophageal, and laryngeal speech. Archives of Otolaryngology, 110(10), 670–672.CrossRefGoogle Scholar
  45. Sharifzadeh, H. R., McLoughlin, I. V., & Ahmadi, F. (2010). Reconstruction of normal sounding speech for laryngectomy patients through a modified CELP codec. IEEE Transactions on Biomedical Engineering, 57(10), 2448–2458.CrossRefGoogle Scholar
  46. Shipp, T. (1967). Frequency, duration, and perceptual measures in relation to judgments of alaryngeal speech acceptability. Journal of Speech, Language, and Hearing Research, 10, 417–427.CrossRefGoogle Scholar
  47. Snidecor, J. C., & Curry, E. T. (1959). XLIV temporal and pitch aspects of superior esophageal speech. Annals of Otology, Rhinology & Laryngology, 68(3), 623–636.CrossRefGoogle Scholar
  48. Srivastava, N. (2013). Improving neural networks with dropout. University of Toronto 182.Google Scholar
  49. Srivastava, N. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929–1958.MathSciNetzbMATHGoogle Scholar
  50. Stylianou, O. C., & Moulines, E. (1998). Continuous probabilistic transform for voice conversion. IEEE Transactions on Speech and Audio Processing, 6(2), 131–142.CrossRefGoogle Scholar
  51. Toda, T., Black, A. W., & Tokuda, K. (2007). Voice conversion based on maximum likelihood estimation of spectral parameter trajectory. IEEE Transactions on Audio, Speech, and Language, 15(8), 2222–2235.CrossRefGoogle Scholar
  52. Tüurkmen, H. I., & Karsligil, M. E. (2008). Reconstruction of dysphonic speech by melp. In Iberoamerican congress on pattern recognition. Springer, Berlin.Google Scholar
  53. Werghi, A., Di Martino, J., & Jebara, S. B. (2010). On the use of an iterative estimation of continuous probabilistic transforms for voice conversion. In I/V Communications and mobile network (ISVC), 2010 5th international symposium on. IEEE.Google Scholar
  54. Wu, Z., Chng, E. S., & Li, H. (2013). Conditional restricted boltzmann machine for voice conversion. In Signal and information processing (ChinaSIP), 2013 IEEE China summit & international conference on. IEEE.Google Scholar
  55. Zhang, M., et al. (2008). Text-independent voice conversion based on state mapped codebook. In Acoustics, speech and signal processing, 2008. ICASSP 2008. IEEE international conference on. IEEE.Google Scholar
  56. Zhu, X., Beauregard, G. T., & Wyse, L. L. (2007). Real-time signal estimation from modified short-time Fourier transform magnitude spectra. IEEE Transactions on Audio, Speech, and Language Processing, 15(5), 1645–1653.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Research Unit Signals and Mechatronic Systems, SMS, UR13ES49, National Engineering School of CarthageENICarthage University of CarthageCarthageTunisia
  2. 2.Loria - Laboratoire Lorrain de Recherche en Informatique et ses ApplicationsVandœuvre-lès-NancyFrance

Personalised recommendations