A corpus is a collection or body of linguistic data, organized in a manner that will facilitate investigation of, and reference to, the data. By today’s standards, corpora are in machine-readable form. Dictionary publishers maintain corpora of citations and word uses, and researchers collect huge (millions of words) corpora of texts of all kinds for many different purposes. Corpus linguistics is both a well-established discipline and an active research area (McEnery and Wilson, 1996). A growing subdiscipline focuses on spoken language (Leech et al, 1995). Spoken corpora are collections of usually transcribed spoken language such as monologues, interviews, conversations or task-oriented dialogues. This chapter focuses on the transcription, markup and coding of spoken dialogue corpora, emphasizing the representations, procedures and tools that are relevant to the design of interactive speech systems.
KeywordsUser Test Annotate Corpus Document Type Definition User Utterance Customer Number
Unable to display preview. Download preview PDF.