Skip to main content

Indexing Text Documents Based on Topic Identification

  • Conference paper
String Processing and Information Retrieval (SPIRE 2004)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3246))

Included in the following conference series:

  • 724 Accesses

Abstract

This work provides algorithms and heuristics to index text documents by determining important topics in the documents. To index text documents, the work provides algorithms to generate topic candidates, determine their importance, detect similar and synonym topics, and to eliminate incoherent topics. The indexing algorithm uses topic frequency to determine the importance and the existence of the topics. Repeated phrases are topic candidates. For example, since the phrase ’index text documents’ occurs three times in this abstract, the phrase is one of the topics of this abstract. It is shown that this method is more effective than either a simple word count model or approaches based on term weighting.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aronson, A.R., Bodenreider, O., Chang, H.F., Humphrey, S.M., Mork, J.G., Nelson, S.J., et al.: The NLM indexing initiative. In: Proc AMIA Symp 2000, vol. (20 Suppl.), pp. 17–21 (2000)

    Google Scholar 

  2. Fagan, J.L.: Automatic Phrase Indexing for Document Retrieval: An Examination of Syntactic and Non-Syntactic Methods. In: Proceedings of the Tenth ACM SIGIR Conference on Research and Development in Information Retrieval, June 1987, pp. 91–108 (1987)

    Google Scholar 

  3. Harman, D.: Ranking Algorithms. In: Frakes, W.B., Baeza-Yates, R. (eds.) Information Retrieval Data Structures & Algorithms, pp. 363–392. Prentice Hall, New Jersey (1992)

    Google Scholar 

  4. Johnson, D.B., Chu, W.W., Dionisio, J.D., Taira, R.K., Kangarlo, H.: Creating and Indexing Teaching Files from Text Patient Reports. In: Proc AMIA Symp. 1999, pp. 814–818 (1999)

    Google Scholar 

  5. Kaplan, R.M.: Finite State Technology. In: Cole, R.A. (ed.) Chief, Survey of the State of the Art in Human Language Technology. ch. 11.6, Center for Spoken Language Understanding, Oregon Graduate Institute, USA (1996)

    Google Scholar 

  6. Kelledy, F., Smeaton, A.F.: Automatic Phrase Recognition and Extraction from Text. In: Proceedings of the 19th Annual BCS-IRSG Colloqium on IR Research, Aberden, Scottland (April 1997)

    Google Scholar 

  7. Lin, C.Y.: Robust Automated Topic Identification. PhD Thesis, University of Southern California (1997)

    Google Scholar 

  8. Mitra, M., Buckley, C., Singhal, A., Cardie, C.: An Analysis of Statistical and Syntactic Phrases. In: Proceedings of RIAO 1997, Computer-Assisted Information Searching on the Internet, Montreal, Canada, June 1997, pp. 200–214 (1997)

    Google Scholar 

  9. Miller, G.A.: WordNet: A Lexical Database for English. Communications of the ACM 38(11), 39–41 (1996)

    Article  Google Scholar 

  10. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management, 513-523 (1988)

    Google Scholar 

  11. Salton, G.: Automatic Text Processing. Addison-Wesley, Reading (1989)

    Google Scholar 

  12. Silverstein, C., Henzinger, M., Marais, H., Moricz, M.: Analysis of a very large AltaVista query log. Tech. rep. 1998-014, Digital Systems Research Center (1998)

    Google Scholar 

  13. Wacholder, N., Evans, D.K., Klavans, J.L.: Automatic Identification and Organization of Index Terms for Interactive Browsing. In: Joint Conference on Digital Libraries 2001, pp. 126–134 (2001)

    Google Scholar 

  14. Blood Pressure Affiliate Faculty of the American Heart Association of Wisconsin. Blood Pressure Measurement Education Program Manual. American Heart Association of Wisconsin, Milwaukee (1998)

    Google Scholar 

  15. Woods, A.W.: Conceptual Indexing: A Better Way to Organize Knowledge. Technical Report SMLI TR 97-61, Sun Microsystems Laboratories, Mountain View, CA (1997)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Butarbutar, M., McRoy, S. (2004). Indexing Text Documents Based on Topic Identification. In: Apostolico, A., Melucci, M. (eds) String Processing and Information Retrieval. SPIRE 2004. Lecture Notes in Computer Science, vol 3246. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30213-1_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30213-1_15

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-23210-0

  • Online ISBN: 978-3-540-30213-1

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics