Learning About Speech from Data: Beyond NETtalk
Speech synthesis is an emerging technology with a wide range of potential applications. In most such applications, the message to be spoken will be in the form of text input, so the main focus of development is text-to-speech (TTS) synthesis. Strongly influenced by the academic traditions of generative linguistics, early work on TTS systems took it as axiomatic that a knowledge-based approach was essential to successful implementation. Presumed theoretical constraints on the learnability of their native language by humans were applied by extension to machine learners to conclude the futility of trying to make useful ‘blank slate’ inferences about speech and language simply from exposure. This situation has changed dramatically in recent years with the easy availability of computers to act as machine learners and large databases to act as training resources. Many positive achievements in machine learning have comprehensively proven its usefulness in a range of natural language processing tasks, despite the negative assumptions of earlier times. Thus, contemporary speech synthesis relies heavily on data-driven techniques.
This chapter introduces and motivates the topic of data-driven speech synthesis, and outlines the concepts that will be encountered in the rest of the book. The main problems that any TI’S system must solve are: automatic generation of pronunciation, prosodic adjustment, and synthesis of the final output speech. The first of these problems has been quite well-studied and it is here that machine-learning techniques have been most obviously applied. Indeed, the problem of text-phoneme conversion (the ‘Nettalk’ problem) has become something of a benchmark in machine learning and, hence, we will have most to say on this topic. As the utility of data-driven methods becomes ever more widely accepted, however, attention is starting to turn to the use of these techniques in other areas of synthesis, most particularly modelling and generation of prosody, and the generation of the output speech itself.
KeywordsHide Unit Speech Synthesis Central Letter Natural Language Processing Task Generative Linguistic
Unable to display preview. Download preview PDF.