Speech Synthesis

Dutoit, Thierry; Bozkurt, Baris

doi:10.1007/978-0-387-30441-0_30

Thierry Dutoit⁴ &
Baris Bozkurt⁵

774 Accesses

Text-to-speech (TTS) synthesis is the art of designing talking machines. It is often seen by engineers as an easy task, compared to speech recognition.¹ It is true, indeed, that it is easier to create a bad, first trial text-to-speech (TTS) system than to design a rudimentary speech recognizer.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 629.00; Price excludes VAT (USA)

Softcover Book: USD 799.99; Price excludes VAT (USA)

Hardcover Book: USD 799.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Allen, J., Hunnicut, S. and Klatt, D. (1987). From Text to Speech: The MITTALK System, Cambridge University Press.
Google Scholar
Badin, P., Bailly, G., Raybaudi, M. and Segebarth, C. (1998). “A three-dimensional linear articulatory model based on MRI data”. Proceedings of the International Conference on Speech and Language Processing, vol. 2, pp. 417–420, Sydney, Australia, November 1998.
Google Scholar
Balestri, M., Paechiotti, A., Quazza, S., Salza, P. and L., Sandri, S. (1999). “Choose the best to modify the least: a new generation concatenative synthesis system”, Proceedings of Eurospeech, Budapest, Hungary.
Google Scholar
Bozkurt, B. (2005). Zeros of the Z Transform (ZZT) Representation and Chirp Group Delay Processing for the Analysis of Source and Filter Characteristics of Speech Signals, PhD Dissertation, Faculté Polytechnique de Mons, Belgium.
Google Scholar
Bozkurt, B., Dutoit, T., Prudon, R., d’Alessandro, C. and Pagel, V. (2004). “Improving quality of MBROLA synthesis for non-uniform units synthesis”. In Narayanan, S. and Alwan, A. (eds.), Text to Speech Synthesis: New Paradigms and Advances. PrenticeHall.
Google Scholar
Campbell, N. and Marumoto, T. (2000). “Automatic labelling of voice-quality in speech databases for synthesis”., Proceedings of the International Conference on Spoken Language Processing (ICSLP’00), pp. 468–471, Beijing, China.
Google Scholar
Charpentier, F. and Stella, M.G. (1986). “Diphone synthesis using an overlap-add technique for speech waveforms concatenation”, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing 86, pp. 2015–2018.
Google Scholar
Charpentier, F. and Moulines, E. (1989), “Pitch-synchronous waveform processing fechniques for text-to-speech synthesis using diphones”, Proceedings of Eurospeech 89, Paris, vol. 2, pp. 13–19.
Google Scholar
Cheng, Y.M. and O’Shaughnessy, D. (1986). “Automatic and reliable estimation of glottal closure instant and period”, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no 12, pp. 1805–1815.
Article Google Scholar
Conkie, A. (1999). “A robust unit selection system for speech synthesis”, Proceedings of the 137th meet. ASA/Forum Acusticum, Berlin.
Google Scholar
Conkie, A. and Isard, S. (1994). “optimal coupling of diphones”, Proceedings of the 2nd ESCA/IEEE Workshop on Speech Synthesis, Mohonk, Sept. 1994.
Google Scholar
D’Alessandro, C. and Doval, B. (2003). “Voice quality modification for emotional speech synthesis”. Proceedings of the European Speech Communication and Technology (Eurospeech’03), pp. 1653–1656. Geneva, Switzerland.
Google Scholar
Dixon, N.R. and Maxey, H.D. (1968). “Terminal analog synthesis of continuous speech using the diphone method of segment assembly”, IEEE Transactions on ASSP, AU-16, no. 1, pp. 40–50.
Google Scholar
Donovan, R.E. (2001). “A new distance measure for costing spectral discontinuities in concatenative speech synthesisers”. Proceedings of 4th ISCA Tutorial and Research Workshop on Speech Synthesis, Perthshire, Scotland.
Google Scholar
Doval, B. and d’Alessandro, C. (1997). “Spectral correlates of glottal waveform models: an analytic study”. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP’97), pp. 1295–1298.
Google Scholar
Dutoit, T. (1994). “High quality text-to-speech synthesis : a comparison of four candidate algorithms”. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP’94), pp. 565–568. Adelaide, Australia.
Google Scholar
Dutoit, T. (1997). An Introduction to Text-To-Speech Synthesis, Kluwer Academic Publishers.
Google Scholar
Dutoit, T. and Leich, H. (1993). “MBR-PSOLA: text-to-speech synthesis based on an MBE resynthesis of the segments database”, Speech Communication, 13, pp. 435–440.
Article Google Scholar
Flanagan, J.L., Ishizaka K. and Shipley K.L. (1975). “Synthesis of speech from a dynamic model of the vocal cords and vocal tract”. Bell System Technical Journal, 54, pp. 485–506.
Google Scholar
Hamon, C., Moulines, E. and Charpentier F., (1989). “A diphone system based on time-domain prosodic modifications of speech”, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing 89, S5.7, pp. 238–241.
Article Google Scholar
Harris, C.M. (1953). “A study of the building blocks in speech”, Journal of the Acoustical Society of America, 25, pp. 962–969.
Article ADS Google Scholar
Hess, W. and Indefrey, H. (1987). “Accurate time-domain pitch determination of speech signals by means of a laryngograph”, Speech Communication, 6, pp. 55–68.
Article Google Scholar
Hunt, A.J. and Black A.W. (1996). “Unit selection in a concatenative speech synthesis system using a large speech database”. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP’96), vol. 1, pp. 373–376. Atlanta, Georgia.
Article Google Scholar
Kawai H. and Tsuzaki M. (2004). “Voice quality variation in a long-term recording of a single speaker speech corpus”. In Narayanan, S. and Alwan, A. (eds.), Text to Speech Synthesis: New Paradigms and Advances. Prentice Hall.
Google Scholar
Klabbers, E. and Veldhuis, R., (2001). “Reducing audible spectral discontinuities”, IEEE Transactions on Speech and Audio Processing, 9(1):39–51.
Article Google Scholar
Klatt, D.H. (1987). “Text-to-speech conversion”, Journal of the Acoustical Society of America, 82(3), 737–793.
Article ADS Google Scholar
Klatt D.H. and Klatt L.C. (1990). “Analysis, synthesis, and perception of voice quality variations among female and male talkers. ” Journal of the Acoustical Society of America, 87(2):820–57.
Article ADS Google Scholar
Laroche, J. (2003). “Frequency-domain techniques for high-quality voice modification”, Proceedings of the International Conference on Digital Audio Effects (DAFx-03), London.
Google Scholar
Lindblom, B.E.F. (1989), “Phonetic Invariance and the Adaptive Nature of Speech”, in B.A.G. Elsendoorn and H. Bouma eds., Working Models of Human Perception, Academi Press, New York, pp. 139–173.
Google Scholar
Macchi, M., Altom, M.J., Kahn, D., Singhal, S. and Spiegel, M., (1993). “Intelligibility as a 6th Function of speech coding method for template-based speech synthesis”, Proceedings of Eurospeech 93, Berlin,pp. 893–896.
Google Scholar
Macon, M.W. (1996). “Speech Synthesis Based on Sinusoidal Modeling”, Ph. D Dissertation, Georgia Institute of Technology.
Google Scholar
Malah, D. (1979). “Time-domain algorithms for harmonic bandwidth reduction and time-scaling of pitch signals”, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 27, no 2, pp. 121–133.
Article Google Scholar
Markel, J.D. and Gray A.H. (1976). Linear Prediction of Speech, Springer.
Google Scholar
Meyer, P., Rühl, H.W., Krüger, R., Kluger, M. Vogten, L.L.M., Dirksen, A. and Belhoula, K., (1993). “PHRITTS – A text-to-speech synthesizer for the German language”, Proceedings of Eurospeech 93, Berlin, pp. 877–890.
Google Scholar
Mitkov, R. (2003). Handbook of Computational Linguistics, R. Mitkov, ed., Oxford University Press.
Google Scholar
Möbius, B. (2000). “Corpus-based speech synthesis: methods and challenges”. Arbeitspapiere des Instituts für Maschinelle Sprachverarbeitung (Univ. Stuttgart), AIMS 6(4), 87–116.
Google Scholar
Morita, N. and Itakura, F. (1986). “Time-scale modification algorithm for speech by use of pointer interval control overlap and add (PICOLA) and its evaluation,” Proceedings Of Annual Meeting of Acoustical Society of Japan.
Google Scholar
Moulines, E. and Charpentier, F. (1988). “Diphone synthesis using a multipulse LPC technique”, Proceedings of the FASE International Conference, Edinburgh, pp. 47–51.
Google Scholar
Moulines, E. and Charpentier F. (1990). “Pitch synchronous waveform processing techniques for text-to-speech synthesis using diphones”. Speech Communication, 9, 5–6.
Google Scholar
O’Shaughnessy, D. (1984). “Design of a real-time French text-to-speech system”, Speech Communication, 3, 233–243.
Article Google Scholar
Papoulis, A. (1962). The Fourier Integral and Its Applications, McGraw Hill, p. 47.
Google Scholar
Roucos, S. and Wilgus, A. (1985). “High-quality time scale modification of speech”, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing 85, pp. 236–239.
Google Scholar
Shannon, R.V., Zeng, F.G., Kamath, V., Wygonski, J. and Ekelid, M. (1995). “Speech recognition with primarily temporal cues”, Science, 13;270(5234):303–4.
Google Scholar
Sondhi, M.M. and Schroeter, J. (1997). “Speech production models and their digital implementations”, The Digital Signal Processing Handbook. CRC and IEEE Press.
Google Scholar
Stylianou, Y. (1998a). “Concatenative speech synthesis using a Harmonic plus Noise Model”. Proceedings of the 3rd ESCA Speech Synthesis Workshop, 261–266. Jenolan Caves, Australia.
Google Scholar
Stylianou, Y. (1998b), “Removing phase mismatches in concatenative speech synthesis”, Proceedings of the 3rd ESCA Speech Synthesis Workshop, pp. 267–272.
Google Scholar
Stylianou, Y. (1999). “Assessment and correction of voice quality variabilities in large speech databases for concatenative speech synthesis”. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP’99), 377–380. Phoenix, A2.
Google Scholar
Stylianou, Y. and Syrdal, A.K. (2001). “Perceptual and objective detection of discontinuities in concatenative speech synthesis”, Proceedings of ICASSP, Salt Lake City, UT.
Google Scholar
Stylianou, Y., Dutoit, T. and Schroeter, J. (1997). “Diphone concatenation using a Harmonic plus Noise Model of speech”, Proceeding of Eurospeech ’97, pp. 613–616.
Google Scholar
Sproat, R., ed. (1998). Multilingual Text-to-Speech Synthesis. Kluwer Academic Publishers.
Google Scholar
Sproat, R., Ostendorf, M. and A. Hunt, eds. (1999). The Need for Increased Speech Synthesis Research: Report of the 1998 NSF Workshop for Discussing Research Priorities and Evaluation Strategies in Speech Synthesis.
Google Scholar
Syrdal, A., Stylianou, Y., Garisson, L., Conkie A. and Schroeter J. (1998). “TD-PSOLA versus Harmonic plus Noise Model in diphone based speech synthesis”. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP’98), 273–276. Seattle, WA.
Google Scholar
Tokuda, K., Zen, H. and Black, A. (2004). “An HMM-based approach to multilingual speech synthesis”. In Narayanan, S. and Alwan, A. (eds.), Text to Speech Synthesis: New Paradigms and Advances. Prentice Hall.
Google Scholar
Van Santen, J.P.H., Sproat, R., Olive, J., Hirshberg, J. eds. (1997), Progress in Speech Synthesis, Springer.
Google Scholar
Vepa, J. and King, S. (2004). “Join cost for unit selection speech synthesis”. In Alwan A. and Narayanan S. (eds.), Speech Synthesis. Prentice Hall.
Google Scholar
Verhelst, W. and Roelands, M. (1993). “An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech”, Proceedings of ICASSP-93, Vol. II, pp. 554–557.
Google Scholar
Wouters, J. and Macon, M. (1998). “Perceptual evaluation of distance measures for concatenative speech synthesis”, Proceedings of ICSLP, Vol 6, pp. 2747–2750, Denver Co.
Google Scholar

Download references

Author information

Authors and Affiliations

Faculte Polytechnique de Mons, Belgium
Thierry Dutoit
Izmir Institute of Technology (IYTE), Izmir, Turkey
Baris Bozkurt

Authors

Thierry Dutoit
View author publications
You can also search for this author in PubMed Google Scholar
Baris Bozkurt
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

National Research Council Institute for Microstructural Sciences, Acoustics and Signal Processing Group, 1200 Montreal Road, Ottawa, ON K1A 0R6, Canada
David Havelock
Department of Environmental Psychology, Osaka University Graduate School of Human Sciences, 1-2 Yamadaok Suita, Osaka, Japan
Sonoko Kuwano
Institute of Technical Acoustics, RWTH Aachen University, Aachen, Germany
Michael Vorländer

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Dutoit, T., Bozkurt, B. (2008). Speech Synthesis. In: Havelock, D., Kuwano, S., Vorländer, M. (eds) Handbook of Signal Processing in Acoustics. Springer, New York, NY. https://doi.org/10.1007/978-0-387-30441-0_30

Download citation

DOI: https://doi.org/10.1007/978-0-387-30441-0_30
Publisher Name: Springer, New York, NY
Print ISBN: 978-0-387-77698-9
Online ISBN: 978-0-387-30441-0
eBook Packages: Physics and AstronomyPhysics and Astronomy (R0)

Publish with us

Policies and ethics