Skip to main content

Topic Detection and Tracking: Event Clustering as a Basis for First Story Detection

  • Chapter

Part of the book series: The Information Retrieval Series ((INRE,volume 7))

Abstract

Topic Detection and Tracking (TDT) is a new research area that investigates the organization of information by event rather than by subject. In this paper, we provide an overview of the TDT research program from its inception to the third phrase that is now underway. We also discuss our approach to two of the TDT problems in detail. For event clustering (Detection), we show that classic Information Retrieval clustering techniques can be modified slightly to provide effective solutions. For first story detection, we show that similar methods provide satisfactory results, although substantial work remains. In both cases, we explore solutions that model the temporal relationship between news stories. We also investigate the use of phrase extraction to capture the who, what, when, and where contained in news.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Allan, J., Carbonell, J., Doddington, G., Yamron, J., and Yang, Y. (1998a). Topic detection and tracking pilot study: Final report. In Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, pages 194–218.

    Google Scholar 

  • Allan, J., Papka, R., and Lavrenko, V. (1998b). On-line new event detection and tracking. In Proceedings of ACM SIGIR, pages 37–45.

    Google Scholar 

  • Brown, P., Cocke, J., Pietra, S. D., Pietra, V. D., Jelinek, E, Lafferty, J., Mercer, R., and Roossin, P. (1990). A statistical approach to machine translation. Computational Linguistics, 16(2):79–85.

    Google Scholar 

  • Callan, J., Croft, B., and Broglio, J. (1994). TREC and TIPSTER experiments with INQUERY. Information Processing and Management, 31(3):327–343.

    Google Scholar 

  • Can, F. and Ozkarahan, E. (1990). Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases. ACM Transactions on Database Systems, 15(4):483–517.

    Article  Google Scholar 

  • Charniak, E. (1996). Tree-bank grammars. Technical Report CS-96-02, Department of Computer Science, Brown University.

    Google Scholar 

  • Charniak, E. (1999). Personal communication.

    Google Scholar 

  • Cohen, P. (1995). Empirical Methods for Artifcial Intelligence. The MIT Press, Cambridge, Massachusetts.

    Google Scholar 

  • Croft, W. and Harper, D. (1979). Using probabilistic models of document retrieval without relevance information. Journal of Documentation, 5(3):285–295.

    Google Scholar 

  • DARPA, editor (1999). Proceedings of the DARPA Broadcast news Workshop, Herndon, Virginia.

    Google Scholar 

  • Department of Defense (1997). Proceedings of the TDT workshop. University of Maryland, College Park, MD (unpublished).

    Google Scholar 

  • Fagan, J. (1987). A Comparison of Syntactic and Non-Syntactic Methods. PhD thesis, Department of Computer Science, Cornell University.

    Google Scholar 

  • Graff, D. (1999). Personal communication.

    Google Scholar 

  • Hearst, M. and Pedersen, J. (1996). Reexamining the cluster hypothesis: Scatter/ gather on retrieval results. In Proceedings of ACM SIGIR, pages 76–84.

    Google Scholar 

  • Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of International Joint Conference on Artifcial Intelligence, pages 1137–1443.

    Google Scholar 

  • Lewis, D. (1991). Representations and Learning in Information Retrieval. PhD thesis, Department of Computer and Information Science, University of Massachussetts.

    Google Scholar 

  • Lewis, D. and Gale, W. (1994). A sequential algorithm for training text classifiers. In Proceedings of ACM SIGIR, pages 3–13.

    Google Scholar 

  • Martin, A., Doddington, G., Kamm, T., Ordowski, M., and Przybocki, M. (1997). The DET curve in assessment of detection task performance. In Proceedings of EuroSpeech’ 97, pages 1895–1898.

    Google Scholar 

  • Papka, R. and Allan, J. (1998). Document classificiation using multiword features. In Proceedings of ACM International Conference on Information and Knowledge Management, pages 124–131.

    Google Scholar 

  • Papka, R., Allan, J., and Lavrenko, V. (1999). UMass approaches to detec-tion and tracking at TDT2. In Proceedings of the DARPA Broadcast News Workshop, pages 111–116.

    Google Scholar 

  • Ponte, J. and Croft, W. (1997). Text segmentation by topic. In Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries, pages 113–125.

    Google Scholar 

  • Provost, F. and Fawcett, T. (1998). Robust classification systems for imprecise environments. In Proceedings of the Fifth National Conference on Artificial Intelligence (AAAI98), pages 3–13.

    Google Scholar 

  • Riloff, E. and Lehnert, W. (1994). Information extraction as a basis for high-precision text classification. ACM Transactions on Information Systems, 12(3):296–333.

    Article  Google Scholar 

  • Robertson, S., Walker, W., Jones, S., Hancock-Beaulieu, M., and M. Gatford (1995). Okapi at TREC-3. In Proceedings ofTREC-3, pages 109–126.

    Google Scholar 

  • Salton, G. (1989). Automatic Text Processing. Addison-Wesley Publishing Co., Reading, MA.

    Google Scholar 

  • Schultz, J. and Liberman, M. (1999). Topic detection and tracking using idf-weighted cosine coefficient. In Proceedings of the DARPA Broadcast News Workshop, pages 189–192.

    Google Scholar 

  • Strzalkowski, T. and Carballo, J. P. (1996). Natural language information retrieval: TREC-4 report. In Proceedings of TREC-4, pages 245–258.

    Google Scholar 

  • Swets, J. (1998). Measuring the accuracy of diagnostic systems. Science, 240:1285–1293.

    MathSciNet  Google Scholar 

  • Tzoukermann, E., Klavans, J., and Jacquemin, C. (1997). Effective use of natural language processing techniques for automatic conflation of multi-word terms: The role of derivational morphology, part of speech tagging, and shallow parsing. In Proceedings of ACM SIGIR, pages 148–155.

    Google Scholar 

  • van Rijsbergen, C. (1979). Information Retrieval. Butterworths, London.

    Google Scholar 

  • Voorhees, E. (1985). The Effectiveness and Efficiency of Agglomerative Hierarchic Clustering in Document Retrieval. PhD thesis, Department of Computer Science, Cornell University, Ithaca, N.Y.

    Google Scholar 

  • Voorhees, E. and Harman, D., editors (1996–1998). Proceedings of Text REtrieval Conferences (TREC-5 through TREC-7). NIST Special Publications.

    Google Scholar 

  • Walls, F., Jin, H., Sista, S., and Schwartz, R. (1999). Topic detection in broadcast news. In Proceedings of the DARPA Broadcast News Workshop, pages 193–198.

    Google Scholar 

  • Willett, P. (1998). Recent trends in hierarchic document clustering: A critical review. Information Processing and Management, 24(5):577–597.

    Google Scholar 

  • Yang, Y., Pierce, T., and Carbonell, J. (1998). A study on retrospective and on-line event detection. In Proceedings of ACM SIGIR, pages 28–36.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Kluwer Academic Publishers

About this chapter

Cite this chapter

Papka, R., Allan, J. (2002). Topic Detection and Tracking: Event Clustering as a Basis for First Story Detection. In: Croft, W.B. (eds) Advances in Information Retrieval. The Information Retrieval Series, vol 7. Springer, Boston, MA. https://doi.org/10.1007/0-306-47019-5_4

Download citation

  • DOI: https://doi.org/10.1007/0-306-47019-5_4

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-0-7923-7812-9

  • Online ISBN: 978-0-306-47019-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics