Linguistic Resources, Development, and Evaluation of Text and Speech Systems
- 502 Downloads
Over the past several decades, research and development of human language technology has been driven or hindered by the availability of data and a number of organizations have arisen to address the demand for greater volumes of linguistic data in a wider variety of languages with more sophisticated annotation and better quality. A great deal of the linguistic data available today results from common task technology evaluation programs that, at least as implemented in the United States, typically involve objective measures of system performance on a benchmark corpus that are compared with human performance over the same data. Data centres play an important role by distributing and archiving, sometimes collecting and annotating, and even by coordinating the efforts of other organizations in the creation of linguistic data. Data planning depends upon the purpose of the project, the linguistic resources needed, the internal and external limitations on acquiring them, availability of data, bandwidth and distribution requirements, available funding, the limits on human annotation, the timeline, the details of the processing pipeline including the ability to parallelize, or the need to serialize steps. Language resource creation includes planning, creation of a specification, collection, segmentation, annotation, quality assurance, preparation for use, distribution, adjudication, refinement, and extension. In preparation for publication, shared corpora are generally associated with metadata and documented to indicate the authors and annotators of the data, the volume and types of raw material included, the percent annotated, the annotation specification, and the quality control measures adopted. This chapter sketches issues involved in identifying and evaluating existing language resources and in planning, creating, validating, and distributing new language resources, especially those used for developing human language technologies with specific examples taken from the collection and annotation of conversational telephone speech and the adjudication of corpora created to support information retrieval.
KeywordsLanguage resources Data Data centres Common task evaluation Specifi- cation Collection Segmentation Annotation Intellectual property rights Informed consent Conversational telephone speech Human subject behaviour Quality assurance Inter-annotator agreement Adjudication Distribution.
Unable to display preview. Download preview PDF.