Text-to-speech (TTS) synthesis is the art of designing talking machines. It is often seen by engineers as an easy task, compared to speech recognition.1 It is true, indeed, that it is easier to create a bad, first trial text-to-speech (TTS) system than to design a rudimentary speech recognizer.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Allen, J., Hunnicut, S. and Klatt, D. (1987). From Text to Speech: The MITTALK System, Cambridge University Press.
Badin, P., Bailly, G., Raybaudi, M. and Segebarth, C. (1998). “A three-dimensional linear articulatory model based on MRI data”. Proceedings of the International Conference on Speech and Language Processing, vol. 2, pp. 417–420, Sydney, Australia, November 1998.
Balestri, M., Paechiotti, A., Quazza, S., Salza, P. and L., Sandri, S. (1999). “Choose the best to modify the least: a new generation concatenative synthesis system”, Proceedings of Eurospeech, Budapest, Hungary.
Bozkurt, B. (2005). Zeros of the Z Transform (ZZT) Representation and Chirp Group Delay Processing for the Analysis of Source and Filter Characteristics of Speech Signals, PhD Dissertation, Faculté Polytechnique de Mons, Belgium.
Bozkurt, B., Dutoit, T., Prudon, R., d’Alessandro, C. and Pagel, V. (2004). “Improving quality of MBROLA synthesis for non-uniform units synthesis”. In Narayanan, S. and Alwan, A. (eds.), Text to Speech Synthesis: New Paradigms and Advances. PrenticeHall.
Campbell, N. and Marumoto, T. (2000). “Automatic labelling of voice-quality in speech databases for synthesis”., Proceedings of the International Conference on Spoken Language Processing (ICSLP’00), pp. 468–471, Beijing, China.
Charpentier, F. and Stella, M.G. (1986). “Diphone synthesis using an overlap-add technique for speech waveforms concatenation”, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing 86, pp. 2015–2018.
Charpentier, F. and Moulines, E. (1989), “Pitch-synchronous waveform processing fechniques for text-to-speech synthesis using diphones”, Proceedings of Eurospeech 89, Paris, vol. 2, pp. 13–19.
Cheng, Y.M. and O’Shaughnessy, D. (1986). “Automatic and reliable estimation of glottal closure instant and period”, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no 12, pp. 1805–1815.
Conkie, A. (1999). “A robust unit selection system for speech synthesis”, Proceedings of the 137th meet. ASA/Forum Acusticum, Berlin.
Conkie, A. and Isard, S. (1994). “optimal coupling of diphones”, Proceedings of the 2nd ESCA/IEEE Workshop on Speech Synthesis, Mohonk, Sept. 1994.
D’Alessandro, C. and Doval, B. (2003). “Voice quality modification for emotional speech synthesis”. Proceedings of the European Speech Communication and Technology (Eurospeech’03), pp. 1653–1656. Geneva, Switzerland.
Dixon, N.R. and Maxey, H.D. (1968). “Terminal analog synthesis of continuous speech using the diphone method of segment assembly”, IEEE Transactions on ASSP, AU-16, no. 1, pp. 40–50.
Donovan, R.E. (2001). “A new distance measure for costing spectral discontinuities in concatenative speech synthesisers”. Proceedings of 4th ISCA Tutorial and Research Workshop on Speech Synthesis, Perthshire, Scotland.
Doval, B. and d’Alessandro, C. (1997). “Spectral correlates of glottal waveform models: an analytic study”. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP’97), pp. 1295–1298.
Dutoit, T. (1994). “High quality text-to-speech synthesis : a comparison of four candidate algorithms”. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP’94), pp. 565–568. Adelaide, Australia.
Dutoit, T. (1997). An Introduction to Text-To-Speech Synthesis, Kluwer Academic Publishers.
Dutoit, T. and Leich, H. (1993). “MBR-PSOLA: text-to-speech synthesis based on an MBE resynthesis of the segments database”, Speech Communication, 13, pp. 435–440.
Flanagan, J.L., Ishizaka K. and Shipley K.L. (1975). “Synthesis of speech from a dynamic model of the vocal cords and vocal tract”. Bell System Technical Journal, 54, pp. 485–506.
Hamon, C., Moulines, E. and Charpentier F., (1989). “A diphone system based on time-domain prosodic modifications of speech”, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing 89, S5.7, pp. 238–241.
Harris, C.M. (1953). “A study of the building blocks in speech”, Journal of the Acoustical Society of America, 25, pp. 962–969.
Hess, W. and Indefrey, H. (1987). “Accurate time-domain pitch determination of speech signals by means of a laryngograph”, Speech Communication, 6, pp. 55–68.
Hunt, A.J. and Black A.W. (1996). “Unit selection in a concatenative speech synthesis system using a large speech database”. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP’96), vol. 1, pp. 373–376. Atlanta, Georgia.
Kawai H. and Tsuzaki M. (2004). “Voice quality variation in a long-term recording of a single speaker speech corpus”. In Narayanan, S. and Alwan, A. (eds.), Text to Speech Synthesis: New Paradigms and Advances. Prentice Hall.
Klabbers, E. and Veldhuis, R., (2001). “Reducing audible spectral discontinuities”, IEEE Transactions on Speech and Audio Processing, 9(1):39–51.
Klatt, D.H. (1987). “Text-to-speech conversion”, Journal of the Acoustical Society of America, 82(3), 737–793.
Klatt D.H. and Klatt L.C. (1990). “Analysis, synthesis, and perception of voice quality variations among female and male talkers. ” Journal of the Acoustical Society of America, 87(2):820–57.
Laroche, J. (2003). “Frequency-domain techniques for high-quality voice modification”, Proceedings of the International Conference on Digital Audio Effects (DAFx-03), London.
Lindblom, B.E.F. (1989), “Phonetic Invariance and the Adaptive Nature of Speech”, in B.A.G. Elsendoorn and H. Bouma eds., Working Models of Human Perception, Academi Press, New York, pp. 139–173.
Macchi, M., Altom, M.J., Kahn, D., Singhal, S. and Spiegel, M., (1993). “Intelligibility as a 6th Function of speech coding method for template-based speech synthesis”, Proceedings of Eurospeech 93, Berlin,pp. 893–896.
Macon, M.W. (1996). “Speech Synthesis Based on Sinusoidal Modeling”, Ph. D Dissertation, Georgia Institute of Technology.
Malah, D. (1979). “Time-domain algorithms for harmonic bandwidth reduction and time-scaling of pitch signals”, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 27, no 2, pp. 121–133.
Markel, J.D. and Gray A.H. (1976). Linear Prediction of Speech, Springer.
Meyer, P., Rühl, H.W., Krüger, R., Kluger, M. Vogten, L.L.M., Dirksen, A. and Belhoula, K., (1993). “PHRITTS – A text-to-speech synthesizer for the German language”, Proceedings of Eurospeech 93, Berlin, pp. 877–890.
Mitkov, R. (2003). Handbook of Computational Linguistics, R. Mitkov, ed., Oxford University Press.
Möbius, B. (2000). “Corpus-based speech synthesis: methods and challenges”. Arbeitspapiere des Instituts für Maschinelle Sprachverarbeitung (Univ. Stuttgart), AIMS 6(4), 87–116.
Morita, N. and Itakura, F. (1986). “Time-scale modification algorithm for speech by use of pointer interval control overlap and add (PICOLA) and its evaluation,” Proceedings Of Annual Meeting of Acoustical Society of Japan.
Moulines, E. and Charpentier, F. (1988). “Diphone synthesis using a multipulse LPC technique”, Proceedings of the FASE International Conference, Edinburgh, pp. 47–51.
Moulines, E. and Charpentier F. (1990). “Pitch synchronous waveform processing techniques for text-to-speech synthesis using diphones”. Speech Communication, 9, 5–6.
O’Shaughnessy, D. (1984). “Design of a real-time French text-to-speech system”, Speech Communication, 3, 233–243.
Papoulis, A. (1962). The Fourier Integral and Its Applications, McGraw Hill, p. 47.
Roucos, S. and Wilgus, A. (1985). “High-quality time scale modification of speech”, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing 85, pp. 236–239.
Shannon, R.V., Zeng, F.G., Kamath, V., Wygonski, J. and Ekelid, M. (1995). “Speech recognition with primarily temporal cues”, Science, 13;270(5234):303–4.
Sondhi, M.M. and Schroeter, J. (1997). “Speech production models and their digital implementations”, The Digital Signal Processing Handbook. CRC and IEEE Press.
Stylianou, Y. (1998a). “Concatenative speech synthesis using a Harmonic plus Noise Model”. Proceedings of the 3rd ESCA Speech Synthesis Workshop, 261–266. Jenolan Caves, Australia.
Stylianou, Y. (1998b), “Removing phase mismatches in concatenative speech synthesis”, Proceedings of the 3rd ESCA Speech Synthesis Workshop, pp. 267–272.
Stylianou, Y. (1999). “Assessment and correction of voice quality variabilities in large speech databases for concatenative speech synthesis”. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP’99), 377–380. Phoenix, A2.
Stylianou, Y. and Syrdal, A.K. (2001). “Perceptual and objective detection of discontinuities in concatenative speech synthesis”, Proceedings of ICASSP, Salt Lake City, UT.
Stylianou, Y., Dutoit, T. and Schroeter, J. (1997). “Diphone concatenation using a Harmonic plus Noise Model of speech”, Proceeding of Eurospeech ’97, pp. 613–616.
Sproat, R., ed. (1998). Multilingual Text-to-Speech Synthesis. Kluwer Academic Publishers.
Sproat, R., Ostendorf, M. and A. Hunt, eds. (1999). The Need for Increased Speech Synthesis Research: Report of the 1998 NSF Workshop for Discussing Research Priorities and Evaluation Strategies in Speech Synthesis.
Syrdal, A., Stylianou, Y., Garisson, L., Conkie A. and Schroeter J. (1998). “TD-PSOLA versus Harmonic plus Noise Model in diphone based speech synthesis”. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP’98), 273–276. Seattle, WA.
Tokuda, K., Zen, H. and Black, A. (2004). “An HMM-based approach to multilingual speech synthesis”. In Narayanan, S. and Alwan, A. (eds.), Text to Speech Synthesis: New Paradigms and Advances. Prentice Hall.
Van Santen, J.P.H., Sproat, R., Olive, J., Hirshberg, J. eds. (1997), Progress in Speech Synthesis, Springer.
Vepa, J. and King, S. (2004). “Join cost for unit selection speech synthesis”. In Alwan A. and Narayanan S. (eds.), Speech Synthesis. Prentice Hall.
Verhelst, W. and Roelands, M. (1993). “An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech”, Proceedings of ICASSP-93, Vol. II, pp. 554–557.
Wouters, J. and Macon, M. (1998). “Perceptual evaluation of distance measures for concatenative speech synthesis”, Proceedings of ICSLP, Vol 6, pp. 2747–2750, Denver Co.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Dutoit, T., Bozkurt, B. (2008). Speech Synthesis. In: Havelock, D., Kuwano, S., Vorländer, M. (eds) Handbook of Signal Processing in Acoustics. Springer, New York, NY. https://doi.org/10.1007/978-0-387-30441-0_30
Download citation
DOI: https://doi.org/10.1007/978-0-387-30441-0_30
Publisher Name: Springer, New York, NY
Print ISBN: 978-0-387-77698-9
Online ISBN: 978-0-387-30441-0
eBook Packages: Physics and AstronomyPhysics and Astronomy (R0)