Skip to main content

Corpora for Topic Detection and Tracking

  • Chapter
Topic Detection and Tracking

Part of the book series: The Information Retrieval Series ((INRE,volume 12))

Abstract

The TDT corpora, developed to support the DARPA-sponsored program in Topic Detection and Tracking, combine data collected over a nine month period from 8 English and 3 Chinese sources. The published corpora contain audio, reference text including written news text and transcripts of the broadcast audio, boundary tables segmenting the broadcasts into stories and relevance tables resulting from millions of human judgments. Sections of the corpora have undergone topic-story, first story and story link annotation. Both the TDT-2 and TDT-3 text corpora and the accompanying broadcast audio are now available from the Linguistic Data Consortium. This paper described the raw material collected for the corpora, the annotation of that material to prepare it for research use and the formats in which it is distributed. Special attention is paid to the quality control measures developed for these data sets.

The Linguistic Data Consortium’s work in building the TDT-2 and TDT-3 corpora was supported in part by grant IRI-9528587 from the Information and Intelligent Systems division of the National Science Foundation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 299.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 379.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 379.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Cieri, Christopher, et al., 2000 Large Multilingual Broadcast News Corpora for Cooperative Research in Topic Detection and Tracking: The TDT2 and TDT3 Corpus Efforts, Proceedings of the Second International Language Resources and Evaluation Conference, Athens, Greece, May 2000.

    Google Scholar 

  • CLSP - The Johns Hopkins University Center for Language and Speech Processing, 1999, Topic-Based Novelty Detection, http://www.clsp.jhu.edu/ws99/projects/tdt/index.html

    Google Scholar 

  • Doddington, George, The Topic Detection and Tracking Phase 2 (TDT-2) Evaluation Plan: Overview & Perspective, Proceedings of the Broadcast News Transcription and Understanding Workshop, Lansdowne, Virginia, February 1998.

    Google Scholar 

  • Doddington, George, 1998, The Topic Detection and Tracking Phase 2 (TDT-2) Evaluation Plan http://www.nist.gov/speech/tdt98/doc/tdt2.eval.plan.98.v3.7.pdf

    Google Scholar 

  • Garofalo, et. al., 2000, The TREC Spoken Document Retrieval Track: A Success Story, April 2000.

    Google Scholar 

  • Linguistic Data Consortium, 2000, Topic Detection and Tracking Pages, http://www.ldc.upenn.edu/TDT

    Google Scholar 

  • NIST — National Institute for Standards and Technology, 1999, 1999 NIST Broadcast News Evaluation, http://www.nist.gov/speech/tests/bnr/bnews_99/bnews_99.htm

    Google Scholar 

  • NIST — National Institute for Standards and Technology, 2000, ACE — Automatic Content Extraction, http://www.nist.gov/speech/tests/ace/

    Google Scholar 

  • NIST — National Institute for Standards and Technology, 2000, The 2000 NIST Hub-5 Evaluation, http://www.nist.gov/speech/tests/ctr/h5_2000/index.htm

    Google Scholar 

  • NIST — National Institute for Standards and Technology, 2000, Topic etection and Tracking, http://www.nist.gov/speech/tests/tdt/tdt2000/index.htm

    Google Scholar 

  • Strassel, Stephanie, et al., 2000), Quality Control in Large Annotation Projects Involving Multiple Judges: The case of the TDT Corpora Proceedings of the Second International Language Resources and Evaluation Conference, Athens, Greece, May 2000.

    Google Scholar 

  • Wayne, Charles, 1998, Topic Detection & Tracking: A Case Study in Corpus Creation & Evaluation Methodologies, Proceedings of the First International Conference on Language Resource and Evaluation, Granada, Spain, May 1998.

    Google Scholar 

  • Wayne, Charles, 1998, Topic Detection and Tracking (TDT): Overview & Perspective, Proceedings of the Broadcast News Transcription and Understanding Workshop, Lansdowne, Virginia, February 1998.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer Science+Business Media New York

About this chapter

Cite this chapter

Cieri, C., Strassel, S., Graff, D., Martey, N., Rennert, K., Liberman, M. (2002). Corpora for Topic Detection and Tracking. In: Allan, J. (eds) Topic Detection and Tracking. The Information Retrieval Series, vol 12. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-0933-2_3

Download citation

  • DOI: https://doi.org/10.1007/978-1-4615-0933-2_3

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4613-5311-9

  • Online ISBN: 978-1-4615-0933-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics