Application of Feature Extraction in Text-to-Speech Processing

  • Václav Šebesta
  • Jana Tučková
Conference paper


A speech signal synthesis in real time, with an unlimited vocabulary is very complicated task for all languages. The synthesizers usually work in the frequency domain and the fundamental frequency F0 and duration must be determined for all phonemes or diphones by conventional equipment based on linguistic rules [4]. Our effort is to minimize the difference between the synthetic speech of the synthesizer, which is usually more monotonous, and the natural speech of people. Because of this, a special functional block is included into the synthesizer for prosody control. A multilayer artificial neural network (ANN) is used for prosody control in our case. In this part of the synthesizer the fundamental frequency is “a little bit” modified in such a way that speech can sound as natural as possible.

The number of input training parameters for ANN training must be generally kept as small as possible because of the optimal generalization ability of the network. An original method for the determination of the most important features (input parameters) for the training of ANN for prosody control is described in this paper. This method is based on the “data mining” from the database of the training patterns by the GUHA method described in [2].


Fundamental Frequency Speech Signal Natural Speech Synthetic Speech Elementary Conjunction 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    Tučková J., Šebesta V.: Prosody Modeling for a Text-toSpeech System by Artificial Neural Networks. Proc. IASTED Int. Conf. “Signal and Image Processing 2000”, IASTED/ACTA Press, November 2000, Las Vegas, USA, pp 312–317.Google Scholar
  2. [2]
    Hájek P., Sochorová A., Zvárová J.: GUHA for personal computers., Computational Statistics and Data Analysis, Vo1.19, 1995, North Holland, pp.149–153.CrossRefMATHGoogle Scholar
  3. [3]
    Šebesta V.: Pruning of Neural Networks by Statistical Optimization. Proc. of the 6th School of Neural Networks, Theory and Applications. Micro-computer’ 94. Sedmihorky, Czech Rep., September 1994, pp.209–214., ISBN 80-2140564-3.Google Scholar
  4. [4]
    Tučková J., Vích R.: Fundamental Frequency Control in Czech Text-to-Speech Synthesis. Proc. IASTED Int. Conference SIP’97, ISBN 0-88986-247-7, New Orleans, Louisiana, USA, December 1997, pp.85–87.Google Scholar
  5. [5]
    Vích R.: Pitch Synchronous Linear Predictive Czech and Slovak Text-to-Speech Synthesis. Proc. of the 15th Intemat. Congress on Acoustics, ICA’95, Trondheim, Norway, June 1995.Google Scholar
  6. [6]
    Sejnovski T. J. Rosenberg C. R.: NETtalk: a parallel network that learns to read aloud. The Johns Hopkins University Electrical Engineering and Computer Science, [saTechnical Report JHU/EECS-86/0 1, 32p.Google Scholar

Copyright information

© Springer-Verlag Wien 2001

Authors and Affiliations

  • Václav Šebesta
    • 1
  • Jana Tučková
    • 2
  1. 1.Institute of Computer ScienceAcademy of Sciences of the Czech RepublicCzech Republic
  2. 2.Faculty of Electrical EngineeringCzech Technical UniversityCzech Republic

Personalised recommendations