Advertisement

Corpus Handling

  • Niels Ole Bernsen
  • Laila Dybkjær
  • Hans Dybkjær

Abstract

A corpus is a collection or body of linguistic data, organized in a manner that will facilitate investigation of, and reference to, the data. By today’s standards, corpora are in machine-readable form. Dictionary publishers maintain corpora of citations and word uses, and researchers collect huge (millions of words) corpora of texts of all kinds for many different purposes. Corpus linguistics is both a well-established discipline and an active research area (McEnery and Wilson, 1996). A growing subdiscipline focuses on spoken language (Leech et al, 1995). Spoken corpora are collections of usually transcribed spoken language such as monologues, interviews, conversations or task-oriented dialogues. This chapter focuses on the transcription, markup and coding of spoken dialogue corpora, emphasizing the representations, procedures and tools that are relevant to the design of interactive speech systems.

Keywords

User Test Annotate Corpus Document Type Definition User Utterance Customer Number 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag London Limited 1998

Authors and Affiliations

  • Niels Ole Bernsen
    • 1
  • Laila Dybkjær
    • 1
  • Hans Dybkjær
    • 2
  1. 1.Maersk Mc-Kinney Moller Institute for Production TechnologyOdense UniversityDenmark
  2. 2.Prolog Development Center A/SCopenhagenDenmark

Personalised recommendations