Abstract
This work provides algorithms and heuristics to index text documents by determining important topics in the documents. To index text documents, the work provides algorithms to generate topic candidates, determine their importance, detect similar and synonym topics, and to eliminate incoherent topics. The indexing algorithm uses topic frequency to determine the importance and the existence of the topics. Repeated phrases are topic candidates. For example, since the phrase ’index text documents’ occurs three times in this abstract, the phrase is one of the topics of this abstract. It is shown that this method is more effective than either a simple word count model or approaches based on term weighting.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aronson, A.R., Bodenreider, O., Chang, H.F., Humphrey, S.M., Mork, J.G., Nelson, S.J., et al.: The NLM indexing initiative. In: Proc AMIA Symp 2000, vol. (20 Suppl.), pp. 17–21 (2000)
Fagan, J.L.: Automatic Phrase Indexing for Document Retrieval: An Examination of Syntactic and Non-Syntactic Methods. In: Proceedings of the Tenth ACM SIGIR Conference on Research and Development in Information Retrieval, June 1987, pp. 91–108 (1987)
Harman, D.: Ranking Algorithms. In: Frakes, W.B., Baeza-Yates, R. (eds.) Information Retrieval Data Structures & Algorithms, pp. 363–392. Prentice Hall, New Jersey (1992)
Johnson, D.B., Chu, W.W., Dionisio, J.D., Taira, R.K., Kangarlo, H.: Creating and Indexing Teaching Files from Text Patient Reports. In: Proc AMIA Symp. 1999, pp. 814–818 (1999)
Kaplan, R.M.: Finite State Technology. In: Cole, R.A. (ed.) Chief, Survey of the State of the Art in Human Language Technology. ch. 11.6, Center for Spoken Language Understanding, Oregon Graduate Institute, USA (1996)
Kelledy, F., Smeaton, A.F.: Automatic Phrase Recognition and Extraction from Text. In: Proceedings of the 19th Annual BCS-IRSG Colloqium on IR Research, Aberden, Scottland (April 1997)
Lin, C.Y.: Robust Automated Topic Identification. PhD Thesis, University of Southern California (1997)
Mitra, M., Buckley, C., Singhal, A., Cardie, C.: An Analysis of Statistical and Syntactic Phrases. In: Proceedings of RIAO 1997, Computer-Assisted Information Searching on the Internet, Montreal, Canada, June 1997, pp. 200–214 (1997)
Miller, G.A.: WordNet: A Lexical Database for English. Communications of the ACM 38(11), 39–41 (1996)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management, 513-523 (1988)
Salton, G.: Automatic Text Processing. Addison-Wesley, Reading (1989)
Silverstein, C., Henzinger, M., Marais, H., Moricz, M.: Analysis of a very large AltaVista query log. Tech. rep. 1998-014, Digital Systems Research Center (1998)
Wacholder, N., Evans, D.K., Klavans, J.L.: Automatic Identification and Organization of Index Terms for Interactive Browsing. In: Joint Conference on Digital Libraries 2001, pp. 126–134 (2001)
Blood Pressure Affiliate Faculty of the American Heart Association of Wisconsin. Blood Pressure Measurement Education Program Manual. American Heart Association of Wisconsin, Milwaukee (1998)
Woods, A.W.: Conceptual Indexing: A Better Way to Organize Knowledge. Technical Report SMLI TR 97-61, Sun Microsystems Laboratories, Mountain View, CA (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Butarbutar, M., McRoy, S. (2004). Indexing Text Documents Based on Topic Identification. In: Apostolico, A., Melucci, M. (eds) String Processing and Information Retrieval. SPIRE 2004. Lecture Notes in Computer Science, vol 3246. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30213-1_15
Download citation
DOI: https://doi.org/10.1007/978-3-540-30213-1_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23210-0
Online ISBN: 978-3-540-30213-1
eBook Packages: Springer Book Archive