Advertisement

Challenges in Speech Synthesis

  • David Suendermann
  • Harald Höge
  • Alan Black
Chapter

Abstract

Similar to other speech- and language-processing disciplines such as speech recognition or machine translation, speech synthesis, the artificial production of human-like speech, has become very powerful over the last 10 years.

Keywords

Speech Recognition Machine Translation Speech Data Speech Synthesis Word Sequence 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Baum, L., Petrie, T. (1966). Statistical inference for probabilistic functions of finite state Markov chains. Ann. Math. Statistics, 37, 1554–1563.MATHCrossRefGoogle Scholar
  2. 2.
    Bayes, T. (1763). An essay towards solving a problem in the doctrine of chances. Philos. Trans. Roy. Soc. Lond., 53, 370–418.Google Scholar
  3. 3.
    Bell, A. (1922). Prehistoric telephone days. Natl. Geographic Mag., 41, 223–242.Google Scholar
  4. 4.
    Bellmann, R. (1957). Dynamic Programming. Princeton University Press, Princeton, USA.Google Scholar
  5. 5.
    Bennett, C., Black, A. (2006). The Blizzard Challenge 2006. In: Blizzard Challenge Workshop, Pittsburgh, USA.Google Scholar
  6. 6.
    Bennett, W. (1983). Secret telephony as a historical example of spread-spectrum communications. IEEE Trans. Commun., 31(1), 98–104.CrossRefGoogle Scholar
  7. 7.
    Beutnagel, M., Conkie, A., Schroeter, J., Stylianou, Y., Syrdal, A. (2006). The AT&T Next-Gen TTS system. In: Proc. TC-Star Workshop, Barcelona, Spain.Google Scholar
  8. 8.
    Black, A., Tokuda, K. (2005). Blizzard Challenge – 2005: Evaluating corpus-based speech synthesis on common datasets. In: Proc. Interspeech, Lisbon, Portugal.Google Scholar
  9. 9.
    Black, A., Zen, H., Tokuda, K. (2007). Statistical parametric synthesis. In: Proc. ICASSP, Honululu, USA.Google Scholar
  10. 10.
    Bonafonte, A., Höge, H., Tropf, H., Moreno, A., v. d. Heuvel, H., Sündermann, D., Ziegenhain, U., Pérez, J., Kiss, I. (2005). TC-Star: Specifications of language resources for speech synthesis. Technical Report.Google Scholar
  11. 11.
    Butler, E. (1948). The Myth of the Magus. Cambridge University Press, Cambridge, UK.Google Scholar
  12. 12.
    Darlington, O. (1947). Gerbert, the teacher. Am. Historical Rev., 52, 456–476.CrossRefGoogle Scholar
  13. 13.
    Darwin, E. (1806). The Temple of Nature. J. Johnson, London, UK.Google Scholar
  14. 14.
    Dudley, H., Tarnoczy, T. (1950). The speaking machine of Wolfgang von Kempelen. J. Acoust. Soc. Am., 22(2), 151–166.CrossRefGoogle Scholar
  15. 15.
    Dutoit, T. (1997). An Introduction to Text-to-Speech Synthesis. Kluwer Academic Publishers, Dordrecht, Netherlands.CrossRefGoogle Scholar
  16. 16.
    Duxans, H., Erro, D., Pérez, J., Diego, F., Bonafonte, A., Moreno, A. (2006). Voice conversion of non-aligned data using unit selection. In: Proc. TC-Star Workshop, Barcelona, Spain.Google Scholar
  17. 17.
    Flanagan, J. (1972). Voices of men and machines. J. Acoust. Soc. Am., 51, 1375–1387.CrossRefGoogle Scholar
  18. 18.
    Fraser, M. King, S. (2007). The Blizzard challenge 2007. In: Proc. ISCA Workshop on Speech Synthesis, Bonn, Germany.Google Scholar
  19. 19.
    Hand, D., Smyth, P., Mannila, H. (2001). Principles of Data Mining. MIT Press, Cambridge, USA.Google Scholar
  20. 20.
    Höge, H. (2002). Project proposal TC-STAR – Make speech to speech translation real. In: Proc. LREC, Las Palmas, Spain.Google Scholar
  21. 21.
    Holmes, J., Holmes, W. (2001). Speech Synthesis and Recognition. Taylor and Francis, London, UK.Google Scholar
  22. 22.
    Hunt, A., Black, A. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. In: Proc. ICASSP, Atlanta, USA.Google Scholar
  23. 23.
    Kacic, Z. (2004–2007). Proc. 11th–14th Int. Workshops on Advances in Speech Technology. University of Maribor, Maribor, Slovenia.Google Scholar
  24. 24.
    Kain, A., Macon, M. (1998). Spectral voice conversion for text-to-speech synthesis. In: Proc. ICASSP, Seattle, USA.Google Scholar
  25. 25.
    Kaszczuk, M., Osowski, L. (2006). Evaluating Ivona speech synthesis system for Blizzard Challenge 2006. In: Blizzard Challenge Workshop, Pittsburgh, USA.Google Scholar
  26. 26.
    Kominek, J., Black, A. (2004). The CMU arctic speech databases. In: Proc. ISCA Workshop on Speech Synthesis, Pittsburgh, USA.Google Scholar
  27. 27.
    Kostelanetz, R. (1996). Classic Essays on Twentieth-Century Music. Schirmer Books, New York, USA.Google Scholar
  28. 28.
    Ladefoged, P. (1998). A Course in Phonetics. Harcourt Brace Jovanovich, New York, USA.Google Scholar
  29. 29.
    Leonard, R., Doddington, G. (1982). A Speaker-Independent Connected-Digit Database. Texas Instruments, Dallas, USA.Google Scholar
  30. 30.
    Levenshtein, V. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys. Dokl., 10, 707–710.MathSciNetGoogle Scholar
  31. 31.
    Lindsay, D. (1997). Talking head. Am. Heritage Invention Technol., 13(1), 57–63.MathSciNetGoogle Scholar
  32. 32.
    Maia, R., Toda, T., Zen, H., Nankaku, Y., Tokuda, K. (2007). An excitation model for HMM-based speech synthesis based on residual modeling. In: Proc. ISCA Workshop on Speech Synthesis, Bonn, Germany.Google Scholar
  33. 33.
    Markel, J., Gray, A. (1976). Linear Prediction of Speech. Springer, New York, USA.MATHCrossRefGoogle Scholar
  34. 34.
    Mashimo, M., Toda, T., Shikano, K., Campbell, N. (2001). Evaluation of cross-language voice conversion based on GMM and STRAIGHT. In: Proc. Eurospeech, Aalborg, Denmark.Google Scholar
  35. 35.
    Masuko, T. (2002). HMM-based speech synthesis and its applications. PhD thesis, Tokyo Institute of Technology, Tokyo, Japan.Google Scholar
  36. 36.
    Mostefa, D., Garcia, M.-N., Hamon, O., Moreau, N. (2006). TC-Star: D16 Evaluation Report. Technical Report.Google Scholar
  37. 37.
    Mostefa, D., Hamon, O., Moreau, N., Choukri, K. (2007). TC-Star: D30 Evaluation Report. Technical Report.Google Scholar
  38. 38.
    Moulines, E. and Sagisaka, Y. (1995). Voice conversion: State of the art and perspectives. Speech Commun., 16(2), 125–126.Google Scholar
  39. 39.
    Ni, J., Hirai, T., Kawai, H., Toda, T., Tokuda, K., Tsuzaki, M., Sakai, S., Maia, R., Nakamura, S. (2007). ATRECSS – ATR English speech corpus for speech synthesis. In: Proc. ISCA Workshop on Speech Synthesis, Bonn, Germany.Google Scholar
  40. 40.
    Nurminen, J., Popa, V., Tian, J., Tang, Y., Kiss, I. (2006). A parametric approach for voice conversion. In: Proc. TC-Star Workshop, Barcelona, Spain.Google Scholar
  41. 41.
    Pallet, D. (1987). Test procedures for the March 1987 DARPA Benchmark Tests. In: Proc. DARPA Speech Recognition Workshop, San Diego, USA.Google Scholar
  42. 42.
    Pérez, J., Bonafonte, A., Hain, H.-U., Keller, E., Breuer, S., Tian, J. (2006). ECESS inter-module interface specification for speech synthesis. In: Proc. LREC, Genoa, Italy.Google Scholar
  43. 43.
    Pfitzinger, H. (2006). Five dimensions of prosody: Intensity, intonation, timing, voice quality, and degree of reduction. In: Proc. Speech Prosody, Dresden, Germany.Google Scholar
  44. 44.
    Poe, E. (1836). Maelzel’s Chess Player. Southern Literary Messenger, 2(5), 318–326.Google Scholar
  45. 45.
    Rabiner, L. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, 77(2), 257–286.Google Scholar
  46. 46.
    Rabiner, L., Rosenberg, A., Levinson, S. (1978). Considerations in dynamic time warping algorithms for discrete word recognition. IEEE Trans. Acoustics, Speech Signal Process., 26(6), 575–582.Google Scholar
  47. 47.
    Ritter von Kempelen, W. (1791). Mechanismus der menschlichen Sprache nebst der Beschreibung einer sprechenden Maschine. J. V. Degen, Vienna, Austria.Google Scholar
  48. 48.
    Stork, D. (1996). HAL’s Legacy: 2001’s Computer as Dream and Reality. MIT Press, Cambridge, USA.Google Scholar
  49. 49.
    Stylianou, Y., Cappé, O., Moulines, E. (1995). Statistical methods for voice quality transformation. In: Proc. Eurospeech, Madrid, Spain.Google Scholar
  50. 50.
    Stylianou, Y., Laroche, J., Moulines, E. (1995). High-quality speech modification based on a harmonic + noise model. In: Proc. Eurospeech, Madrid, Spain.Google Scholar
  51. 51.
    Suendermann, D., Raeder, H. (1997). Digital Emperator: Out of O2. d.l.h.-productions, Cologne, Germany.Google Scholar
  52. 52.
    Sündermann, D., Bonafonte, A., Ney, H., Höge, H. (2005). A study on residual prediction techniques for voice conversion. In: Proc. ICASSP, Philadelphia, USA.Google Scholar
  53. 53.
    Sündermann, D., Höge, H., Bonafonte, A., Ney, H., Hirschberg, J. (2006). TC-Star: Cross-language voice conversion revisited. In: Proc. TC-Star Workshop, Barcelona, Spain.Google Scholar
  54. 54.
    Young, S., Woodland, P., Byrne, W. (1993). The HTK Book, Version 1.5. Cambridge University Press, Cambridge, UK.Google Scholar
  55. 55.
    Zen, H., Toda, T. (2005). An overview of Nitech HMM-based speech synthesis system for Blizzard Challenge 2005. In: Proc. Interspeech, Lisbon, Portugal.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  1. 1.SpeechCycle, Inc.New YorkUSA
  2. 2.Siemens Corporate TechnologyMunichGermany
  3. 3.Carnegie Mellon UniversityPittsburghUSA

Personalised recommendations