Skip to main content
  • 774 Accesses

Text-to-speech (TTS) synthesis is the art of designing talking machines. It is often seen by engineers as an easy task, compared to speech recognition.1 It is true, indeed, that it is easier to create a bad, first trial text-to-speech (TTS) system than to design a rudimentary speech recognizer.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 629.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 799.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 799.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Allen, J., Hunnicut, S. and Klatt, D. (1987). From Text to Speech: The MITTALK System, Cambridge University Press.

    Google Scholar 

  • Badin, P., Bailly, G., Raybaudi, M. and Segebarth, C. (1998). “A three-dimensional linear articulatory model based on MRI data”. Proceedings of the International Conference on Speech and Language Processing, vol. 2, pp. 417–420, Sydney, Australia, November 1998.

    Google Scholar 

  • Balestri, M., Paechiotti, A., Quazza, S., Salza, P. and L., Sandri, S. (1999). “Choose the best to modify the least: a new generation concatenative synthesis system”, Proceedings of Eurospeech, Budapest, Hungary.

    Google Scholar 

  • Bozkurt, B. (2005). Zeros of the Z Transform (ZZT) Representation and Chirp Group Delay Processing for the Analysis of Source and Filter Characteristics of Speech Signals, PhD Dissertation, Faculté Polytechnique de Mons, Belgium.

    Google Scholar 

  • Bozkurt, B., Dutoit, T., Prudon, R., d’Alessandro, C. and Pagel, V. (2004). “Improving quality of MBROLA synthesis for non-uniform units synthesis”. In Narayanan, S. and Alwan, A. (eds.), Text to Speech Synthesis: New Paradigms and Advances. PrenticeHall.

    Google Scholar 

  • Campbell, N. and Marumoto, T. (2000). “Automatic labelling of voice-quality in speech databases for synthesis”., Proceedings of the International Conference on Spoken Language Processing (ICSLP’00), pp. 468–471, Beijing, China.

    Google Scholar 

  • Charpentier, F. and Stella, M.G. (1986). “Diphone synthesis using an overlap-add technique for speech waveforms concatenation”, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing 86, pp. 2015–2018.

    Google Scholar 

  • Charpentier, F. and Moulines, E. (1989), “Pitch-synchronous waveform processing fechniques for text-to-speech synthesis using diphones”, Proceedings of Eurospeech 89, Paris, vol. 2, pp. 13–19.

    Google Scholar 

  • Cheng, Y.M. and O’Shaughnessy, D. (1986). “Automatic and reliable estimation of glottal closure instant and period”, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no 12, pp. 1805–1815.

    Article  Google Scholar 

  • Conkie, A. (1999). “A robust unit selection system for speech synthesis”, Proceedings of the 137th meet. ASA/Forum Acusticum, Berlin.

    Google Scholar 

  • Conkie, A. and Isard, S. (1994). “optimal coupling of diphones”, Proceedings of the 2nd ESCA/IEEE Workshop on Speech Synthesis, Mohonk, Sept. 1994.

    Google Scholar 

  • D’Alessandro, C. and Doval, B. (2003). “Voice quality modification for emotional speech synthesis”. Proceedings of the European Speech Communication and Technology (Eurospeech’03), pp. 1653–1656. Geneva, Switzerland.

    Google Scholar 

  • Dixon, N.R. and Maxey, H.D. (1968). “Terminal analog synthesis of continuous speech using the diphone method of segment assembly”, IEEE Transactions on ASSP, AU-16, no. 1, pp. 40–50.

    Google Scholar 

  • Donovan, R.E. (2001). “A new distance measure for costing spectral discontinuities in concatenative speech synthesisers”. Proceedings of 4th ISCA Tutorial and Research Workshop on Speech Synthesis, Perthshire, Scotland.

    Google Scholar 

  • Doval, B. and d’Alessandro, C. (1997). “Spectral correlates of glottal waveform models: an analytic study”. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP’97), pp. 1295–1298.

    Google Scholar 

  • Dutoit, T. (1994). “High quality text-to-speech synthesis : a comparison of four candidate algorithms”. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP’94), pp. 565–568. Adelaide, Australia.

    Google Scholar 

  • Dutoit, T. (1997). An Introduction to Text-To-Speech Synthesis, Kluwer Academic Publishers.

    Google Scholar 

  • Dutoit, T. and Leich, H. (1993). “MBR-PSOLA: text-to-speech synthesis based on an MBE resynthesis of the segments database”, Speech Communication, 13, pp. 435–440.

    Article  Google Scholar 

  • Flanagan, J.L., Ishizaka K. and Shipley K.L. (1975). “Synthesis of speech from a dynamic model of the vocal cords and vocal tract”. Bell System Technical Journal, 54, pp. 485–506.

    Google Scholar 

  • Hamon, C., Moulines, E. and Charpentier F., (1989). “A diphone system based on time-domain prosodic modifications of speech”, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing 89, S5.7, pp. 238–241.

    Article  Google Scholar 

  • Harris, C.M. (1953). “A study of the building blocks in speech”, Journal of the Acoustical Society of America, 25, pp. 962–969.

    Article  ADS  Google Scholar 

  • Hess, W. and Indefrey, H. (1987). “Accurate time-domain pitch determination of speech signals by means of a laryngograph”, Speech Communication, 6, pp. 55–68.

    Article  Google Scholar 

  • Hunt, A.J. and Black A.W. (1996). “Unit selection in a concatenative speech synthesis system using a large speech database”. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP’96), vol. 1, pp. 373–376. Atlanta, Georgia.

    Article  Google Scholar 

  • Kawai H. and Tsuzaki M. (2004). “Voice quality variation in a long-term recording of a single speaker speech corpus”. In Narayanan, S. and Alwan, A. (eds.), Text to Speech Synthesis: New Paradigms and Advances. Prentice Hall.

    Google Scholar 

  • Klabbers, E. and Veldhuis, R., (2001). “Reducing audible spectral discontinuities”, IEEE Transactions on Speech and Audio Processing, 9(1):39–51.

    Article  Google Scholar 

  • Klatt, D.H. (1987). “Text-to-speech conversion”, Journal of the Acoustical Society of America, 82(3), 737–793.

    Article  ADS  Google Scholar 

  • Klatt D.H. and Klatt L.C. (1990). “Analysis, synthesis, and perception of voice quality variations among female and male talkers. ” Journal of the Acoustical Society of America, 87(2):820–57.

    Article  ADS  Google Scholar 

  • Laroche, J. (2003). “Frequency-domain techniques for high-quality voice modification”, Proceedings of the International Conference on Digital Audio Effects (DAFx-03), London.

    Google Scholar 

  • Lindblom, B.E.F. (1989), “Phonetic Invariance and the Adaptive Nature of Speech”, in B.A.G. Elsendoorn and H. Bouma eds., Working Models of Human Perception, Academi Press, New York, pp. 139–173.

    Google Scholar 

  • Macchi, M., Altom, M.J., Kahn, D., Singhal, S. and Spiegel, M., (1993). “Intelligibility as a 6th Function of speech coding method for template-based speech synthesis”, Proceedings of Eurospeech 93, Berlin,pp. 893–896.

    Google Scholar 

  • Macon, M.W. (1996). “Speech Synthesis Based on Sinusoidal Modeling”, Ph. D Dissertation, Georgia Institute of Technology.

    Google Scholar 

  • Malah, D. (1979). “Time-domain algorithms for harmonic bandwidth reduction and time-scaling of pitch signals”, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 27, no 2, pp. 121–133.

    Article  Google Scholar 

  • Markel, J.D. and Gray A.H. (1976). Linear Prediction of Speech, Springer.

    Google Scholar 

  • Meyer, P., Rühl, H.W., Krüger, R., Kluger, M. Vogten, L.L.M., Dirksen, A. and Belhoula, K., (1993). “PHRITTS – A text-to-speech synthesizer for the German language”, Proceedings of Eurospeech 93, Berlin, pp. 877–890.

    Google Scholar 

  • Mitkov, R. (2003). Handbook of Computational Linguistics, R. Mitkov, ed., Oxford University Press.

    Google Scholar 

  • Möbius, B. (2000). “Corpus-based speech synthesis: methods and challenges”. Arbeitspapiere des Instituts für Maschinelle Sprachverarbeitung (Univ. Stuttgart), AIMS 6(4), 87–116.

    Google Scholar 

  • Morita, N. and Itakura, F. (1986). “Time-scale modification algorithm for speech by use of pointer interval control overlap and add (PICOLA) and its evaluation,” Proceedings Of Annual Meeting of Acoustical Society of Japan.

    Google Scholar 

  • Moulines, E. and Charpentier, F. (1988). “Diphone synthesis using a multipulse LPC technique”, Proceedings of the FASE International Conference, Edinburgh, pp. 47–51.

    Google Scholar 

  • Moulines, E. and Charpentier F. (1990). “Pitch synchronous waveform processing techniques for text-to-speech synthesis using diphones”. Speech Communication, 9, 5–6.

    Google Scholar 

  • O’Shaughnessy, D. (1984). “Design of a real-time French text-to-speech system”, Speech Communication, 3, 233–243.

    Article  Google Scholar 

  • Papoulis, A. (1962). The Fourier Integral and Its Applications, McGraw Hill, p. 47.

    Google Scholar 

  • Roucos, S. and Wilgus, A. (1985). “High-quality time scale modification of speech”, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing 85, pp. 236–239.

    Google Scholar 

  • Shannon, R.V., Zeng, F.G., Kamath, V., Wygonski, J. and Ekelid, M. (1995). “Speech recognition with primarily temporal cues”, Science, 13;270(5234):303–4.

    Google Scholar 

  • Sondhi, M.M. and Schroeter, J. (1997). “Speech production models and their digital implementations”, The Digital Signal Processing Handbook. CRC and IEEE Press.

    Google Scholar 

  • Stylianou, Y. (1998a). “Concatenative speech synthesis using a Harmonic plus Noise Model”. Proceedings of the 3rd ESCA Speech Synthesis Workshop, 261–266. Jenolan Caves, Australia.

    Google Scholar 

  • Stylianou, Y. (1998b), “Removing phase mismatches in concatenative speech synthesis”, Proceedings of the 3rd ESCA Speech Synthesis Workshop, pp. 267–272.

    Google Scholar 

  • Stylianou, Y. (1999). “Assessment and correction of voice quality variabilities in large speech databases for concatenative speech synthesis”. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP’99), 377–380. Phoenix, A2.

    Google Scholar 

  • Stylianou, Y. and Syrdal, A.K. (2001). “Perceptual and objective detection of discontinuities in concatenative speech synthesis”, Proceedings of ICASSP, Salt Lake City, UT.

    Google Scholar 

  • Stylianou, Y., Dutoit, T. and Schroeter, J. (1997). “Diphone concatenation using a Harmonic plus Noise Model of speech”, Proceeding of Eurospeech ’97, pp. 613–616.

    Google Scholar 

  • Sproat, R., ed. (1998). Multilingual Text-to-Speech Synthesis. Kluwer Academic Publishers.

    Google Scholar 

  • Sproat, R., Ostendorf, M. and A. Hunt, eds. (1999). The Need for Increased Speech Synthesis Research: Report of the 1998 NSF Workshop for Discussing Research Priorities and Evaluation Strategies in Speech Synthesis.

    Google Scholar 

  • Syrdal, A., Stylianou, Y., Garisson, L., Conkie A. and Schroeter J. (1998). “TD-PSOLA versus Harmonic plus Noise Model in diphone based speech synthesis”. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP’98), 273–276. Seattle, WA.

    Google Scholar 

  • Tokuda, K., Zen, H. and Black, A. (2004). “An HMM-based approach to multilingual speech synthesis”. In Narayanan, S. and Alwan, A. (eds.), Text to Speech Synthesis: New Paradigms and Advances. Prentice Hall.

    Google Scholar 

  • Van Santen, J.P.H., Sproat, R., Olive, J., Hirshberg, J. eds. (1997), Progress in Speech Synthesis, Springer.

    Google Scholar 

  • Vepa, J. and King, S. (2004). “Join cost for unit selection speech synthesis”. In Alwan A. and Narayanan S. (eds.), Speech Synthesis. Prentice Hall.

    Google Scholar 

  • Verhelst, W. and Roelands, M. (1993). “An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech”, Proceedings of ICASSP-93, Vol. II, pp. 554–557.

    Google Scholar 

  • Wouters, J. and Macon, M. (1998). “Perceptual evaluation of distance measures for concatenative speech synthesis”, Proceedings of ICSLP, Vol 6, pp. 2747–2750, Denver Co.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Dutoit, T., Bozkurt, B. (2008). Speech Synthesis. In: Havelock, D., Kuwano, S., Vorländer, M. (eds) Handbook of Signal Processing in Acoustics. Springer, New York, NY. https://doi.org/10.1007/978-0-387-30441-0_30

Download citation

Publish with us

Policies and ethics