Abstract
Topic Detection and Tracking (TDT) is a new research area that investigates the organization of information by event rather than by subject. In this paper, we provide an overview of the TDT research program from its inception to the third phrase that is now underway. We also discuss our approach to two of the TDT problems in detail. For event clustering (Detection), we show that classic Information Retrieval clustering techniques can be modified slightly to provide effective solutions. For first story detection, we show that similar methods provide satisfactory results, although substantial work remains. In both cases, we explore solutions that model the temporal relationship between news stories. We also investigate the use of phrase extraction to capture the who, what, when, and where contained in news.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Allan, J., Carbonell, J., Doddington, G., Yamron, J., and Yang, Y. (1998a). Topic detection and tracking pilot study: Final report. In Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, pages 194–218.
Allan, J., Papka, R., and Lavrenko, V. (1998b). On-line new event detection and tracking. In Proceedings of ACM SIGIR, pages 37–45.
Brown, P., Cocke, J., Pietra, S. D., Pietra, V. D., Jelinek, E, Lafferty, J., Mercer, R., and Roossin, P. (1990). A statistical approach to machine translation. Computational Linguistics, 16(2):79–85.
Callan, J., Croft, B., and Broglio, J. (1994). TREC and TIPSTER experiments with INQUERY. Information Processing and Management, 31(3):327–343.
Can, F. and Ozkarahan, E. (1990). Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases. ACM Transactions on Database Systems, 15(4):483–517.
Charniak, E. (1996). Tree-bank grammars. Technical Report CS-96-02, Department of Computer Science, Brown University.
Charniak, E. (1999). Personal communication.
Cohen, P. (1995). Empirical Methods for Artifcial Intelligence. The MIT Press, Cambridge, Massachusetts.
Croft, W. and Harper, D. (1979). Using probabilistic models of document retrieval without relevance information. Journal of Documentation, 5(3):285–295.
DARPA, editor (1999). Proceedings of the DARPA Broadcast news Workshop, Herndon, Virginia.
Department of Defense (1997). Proceedings of the TDT workshop. University of Maryland, College Park, MD (unpublished).
Fagan, J. (1987). A Comparison of Syntactic and Non-Syntactic Methods. PhD thesis, Department of Computer Science, Cornell University.
Graff, D. (1999). Personal communication.
Hearst, M. and Pedersen, J. (1996). Reexamining the cluster hypothesis: Scatter/ gather on retrieval results. In Proceedings of ACM SIGIR, pages 76–84.
Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of International Joint Conference on Artifcial Intelligence, pages 1137–1443.
Lewis, D. (1991). Representations and Learning in Information Retrieval. PhD thesis, Department of Computer and Information Science, University of Massachussetts.
Lewis, D. and Gale, W. (1994). A sequential algorithm for training text classifiers. In Proceedings of ACM SIGIR, pages 3–13.
Martin, A., Doddington, G., Kamm, T., Ordowski, M., and Przybocki, M. (1997). The DET curve in assessment of detection task performance. In Proceedings of EuroSpeech’ 97, pages 1895–1898.
Papka, R. and Allan, J. (1998). Document classificiation using multiword features. In Proceedings of ACM International Conference on Information and Knowledge Management, pages 124–131.
Papka, R., Allan, J., and Lavrenko, V. (1999). UMass approaches to detec-tion and tracking at TDT2. In Proceedings of the DARPA Broadcast News Workshop, pages 111–116.
Ponte, J. and Croft, W. (1997). Text segmentation by topic. In Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries, pages 113–125.
Provost, F. and Fawcett, T. (1998). Robust classification systems for imprecise environments. In Proceedings of the Fifth National Conference on Artificial Intelligence (AAAI98), pages 3–13.
Riloff, E. and Lehnert, W. (1994). Information extraction as a basis for high-precision text classification. ACM Transactions on Information Systems, 12(3):296–333.
Robertson, S., Walker, W., Jones, S., Hancock-Beaulieu, M., and M. Gatford (1995). Okapi at TREC-3. In Proceedings ofTREC-3, pages 109–126.
Salton, G. (1989). Automatic Text Processing. Addison-Wesley Publishing Co., Reading, MA.
Schultz, J. and Liberman, M. (1999). Topic detection and tracking using idf-weighted cosine coefficient. In Proceedings of the DARPA Broadcast News Workshop, pages 189–192.
Strzalkowski, T. and Carballo, J. P. (1996). Natural language information retrieval: TREC-4 report. In Proceedings of TREC-4, pages 245–258.
Swets, J. (1998). Measuring the accuracy of diagnostic systems. Science, 240:1285–1293.
Tzoukermann, E., Klavans, J., and Jacquemin, C. (1997). Effective use of natural language processing techniques for automatic conflation of multi-word terms: The role of derivational morphology, part of speech tagging, and shallow parsing. In Proceedings of ACM SIGIR, pages 148–155.
van Rijsbergen, C. (1979). Information Retrieval. Butterworths, London.
Voorhees, E. (1985). The Effectiveness and Efficiency of Agglomerative Hierarchic Clustering in Document Retrieval. PhD thesis, Department of Computer Science, Cornell University, Ithaca, N.Y.
Voorhees, E. and Harman, D., editors (1996–1998). Proceedings of Text REtrieval Conferences (TREC-5 through TREC-7). NIST Special Publications.
Walls, F., Jin, H., Sista, S., and Schwartz, R. (1999). Topic detection in broadcast news. In Proceedings of the DARPA Broadcast News Workshop, pages 193–198.
Willett, P. (1998). Recent trends in hierarchic document clustering: A critical review. Information Processing and Management, 24(5):577–597.
Yang, Y., Pierce, T., and Carbonell, J. (1998). A study on retrospective and on-line event detection. In Proceedings of ACM SIGIR, pages 28–36.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Kluwer Academic Publishers
About this chapter
Cite this chapter
Papka, R., Allan, J. (2002). Topic Detection and Tracking: Event Clustering as a Basis for First Story Detection. In: Croft, W.B. (eds) Advances in Information Retrieval. The Information Retrieval Series, vol 7. Springer, Boston, MA. https://doi.org/10.1007/0-306-47019-5_4
Download citation
DOI: https://doi.org/10.1007/0-306-47019-5_4
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-7923-7812-9
Online ISBN: 978-0-306-47019-6
eBook Packages: Springer Book Archive