Topic Detection and Tracking: Event Clustering as a Basis for First Story Detection

Papka, Ron; Allan, James

doi:10.1007/0-306-47019-5_4

Topic Detection and Tracking: Event Clustering as a Basis for First Story Detection

Ron Papka³ &
James Allan⁴

Chapter

299 Accesses
1 Citations

Part of the book series: The Information Retrieval Series ((INRE,volume 7))

Abstract

Topic Detection and Tracking (TDT) is a new research area that investigates the organization of information by event rather than by subject. In this paper, we provide an overview of the TDT research program from its inception to the third phrase that is now underway. We also discuss our approach to two of the TDT problems in detail. For event clustering (Detection), we show that classic Information Retrieval clustering techniques can be modified slightly to provide effective solutions. For first story detection, we show that similar methods provide satisfactory results, although substantial work remains. In both cases, we explore solutions that model the temporal relationship between news stories. We also investigate the use of phrase extraction to capture the who, what, when, and where contained in news.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Allan, J., Carbonell, J., Doddington, G., Yamron, J., and Yang, Y. (1998a). Topic detection and tracking pilot study: Final report. In Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, pages 194–218.
Google Scholar
Allan, J., Papka, R., and Lavrenko, V. (1998b). On-line new event detection and tracking. In Proceedings of ACM SIGIR, pages 37–45.
Google Scholar
Brown, P., Cocke, J., Pietra, S. D., Pietra, V. D., Jelinek, E, Lafferty, J., Mercer, R., and Roossin, P. (1990). A statistical approach to machine translation. Computational Linguistics, 16(2):79–85.
Google Scholar
Callan, J., Croft, B., and Broglio, J. (1994). TREC and TIPSTER experiments with INQUERY. Information Processing and Management, 31(3):327–343.
Google Scholar
Can, F. and Ozkarahan, E. (1990). Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases. ACM Transactions on Database Systems, 15(4):483–517.
Article Google Scholar
Charniak, E. (1996). Tree-bank grammars. Technical Report CS-96-02, Department of Computer Science, Brown University.
Google Scholar
Charniak, E. (1999). Personal communication.
Google Scholar
Cohen, P. (1995). Empirical Methods for Artifcial Intelligence. The MIT Press, Cambridge, Massachusetts.
Google Scholar
Croft, W. and Harper, D. (1979). Using probabilistic models of document retrieval without relevance information. Journal of Documentation, 5(3):285–295.
Google Scholar
DARPA, editor (1999). Proceedings of the DARPA Broadcast news Workshop, Herndon, Virginia.
Google Scholar
Department of Defense (1997). Proceedings of the TDT workshop. University of Maryland, College Park, MD (unpublished).
Google Scholar
Fagan, J. (1987). A Comparison of Syntactic and Non-Syntactic Methods. PhD thesis, Department of Computer Science, Cornell University.
Google Scholar
Graff, D. (1999). Personal communication.
Google Scholar
Hearst, M. and Pedersen, J. (1996). Reexamining the cluster hypothesis: Scatter/ gather on retrieval results. In Proceedings of ACM SIGIR, pages 76–84.
Google Scholar
Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of International Joint Conference on Artifcial Intelligence, pages 1137–1443.
Google Scholar
Lewis, D. (1991). Representations and Learning in Information Retrieval. PhD thesis, Department of Computer and Information Science, University of Massachussetts.
Google Scholar
Lewis, D. and Gale, W. (1994). A sequential algorithm for training text classifiers. In Proceedings of ACM SIGIR, pages 3–13.
Google Scholar
Martin, A., Doddington, G., Kamm, T., Ordowski, M., and Przybocki, M. (1997). The DET curve in assessment of detection task performance. In Proceedings of EuroSpeech’ 97, pages 1895–1898.
Google Scholar
Papka, R. and Allan, J. (1998). Document classificiation using multiword features. In Proceedings of ACM International Conference on Information and Knowledge Management, pages 124–131.
Google Scholar
Papka, R., Allan, J., and Lavrenko, V. (1999). UMass approaches to detec-tion and tracking at TDT2. In Proceedings of the DARPA Broadcast News Workshop, pages 111–116.
Google Scholar
Ponte, J. and Croft, W. (1997). Text segmentation by topic. In Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries, pages 113–125.
Google Scholar
Provost, F. and Fawcett, T. (1998). Robust classification systems for imprecise environments. In Proceedings of the Fifth National Conference on Artificial Intelligence (AAAI98), pages 3–13.
Google Scholar
Riloff, E. and Lehnert, W. (1994). Information extraction as a basis for high-precision text classification. ACM Transactions on Information Systems, 12(3):296–333.
Article Google Scholar
Robertson, S., Walker, W., Jones, S., Hancock-Beaulieu, M., and M. Gatford (1995). Okapi at TREC-3. In Proceedings ofTREC-3, pages 109–126.
Google Scholar
Salton, G. (1989). Automatic Text Processing. Addison-Wesley Publishing Co., Reading, MA.
Google Scholar
Schultz, J. and Liberman, M. (1999). Topic detection and tracking using idf-weighted cosine coefficient. In Proceedings of the DARPA Broadcast News Workshop, pages 189–192.
Google Scholar
Strzalkowski, T. and Carballo, J. P. (1996). Natural language information retrieval: TREC-4 report. In Proceedings of TREC-4, pages 245–258.
Google Scholar
Swets, J. (1998). Measuring the accuracy of diagnostic systems. Science, 240:1285–1293.
MathSciNet Google Scholar
Tzoukermann, E., Klavans, J., and Jacquemin, C. (1997). Effective use of natural language processing techniques for automatic conflation of multi-word terms: The role of derivational morphology, part of speech tagging, and shallow parsing. In Proceedings of ACM SIGIR, pages 148–155.
Google Scholar
van Rijsbergen, C. (1979). Information Retrieval. Butterworths, London.
Google Scholar
Voorhees, E. (1985). The Effectiveness and Efficiency of Agglomerative Hierarchic Clustering in Document Retrieval. PhD thesis, Department of Computer Science, Cornell University, Ithaca, N.Y.
Google Scholar
Voorhees, E. and Harman, D., editors (1996–1998). Proceedings of Text REtrieval Conferences (TREC-5 through TREC-7). NIST Special Publications.
Google Scholar
Walls, F., Jin, H., Sista, S., and Schwartz, R. (1999). Topic detection in broadcast news. In Proceedings of the DARPA Broadcast News Workshop, pages 193–198.
Google Scholar
Willett, P. (1998). Recent trends in hierarchic document clustering: A critical review. Information Processing and Management, 24(5):577–597.
Google Scholar
Yang, Y., Pierce, T., and Carbonell, J. (1998). A study on retrospective and on-line event detection. In Proceedings of ACM SIGIR, pages 28–36.
Google Scholar

Download references

Author information

Authors and Affiliations

Dataware Technologies, 100 Venture Way, Hadley, MA, 01035
Ron Papka
Center for Intelligent lnformation Retrieval Computer Science Department, University of Massachusetts, Amherst, MA, 01003
James Allan

Authors

Ron Papka
View author publications
You can also search for this author in PubMed Google Scholar
James Allan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Massachusetts, Amherst
W. Bruce Croft

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Papka, R., Allan, J. (2002). Topic Detection and Tracking: Event Clustering as a Basis for First Story Detection. In: Croft, W.B. (eds) Advances in Information Retrieval. The Information Retrieval Series, vol 7. Springer, Boston, MA. https://doi.org/10.1007/0-306-47019-5_4

Download citation

DOI: https://doi.org/10.1007/0-306-47019-5_4
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-7923-7812-9
Online ISBN: 978-0-306-47019-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics