Skip to main content

Validation of Text Clustering Based on Document Contents

  • Conference paper
  • First Online:
  • 1260 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2123))

Abstract

In this paper some results of a new text clustering methodology are presented. A prototype is an interesting document or a part of an extracted, interesting text. The given prototype is matched with the existing document database or the monitored document flow. Our claim is that the new methodology is capable of automatic content-based clustering using the information of the document. To verify this hypothesis an experiment was designed with the Bible. Four different translations, one Greek, one Latin, and two Finnish translations from years 1933/38 and 1992 were selected as test text material. Validation experiments were performed with a designed prototype version of the software application.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. M. Dewey. A Classification and subject index for cataloguing and arranging the books and pamphlets of a library. Case, Lockwood & Brainard Co., Amherst, MA, USA, 1876.

    Google Scholar 

  2. M. Dewey. Catalogs and Cataloguing: A Decimal Classification and Subject Index. In U.S. Bureau of Education Special Report on Public Libraries Part I, pages 623–648. U.S.G.P.O., Washington DC, USA, 1876.

    Google Scholar 

  3. F. C. Gey. Information Retrieval: Theory, Application, Evaluation. In Tutorial at HICSS-33, Hawaii International Conference on System Sciences (CD-ROM), Maui, Hawaii, USA, Jan. 4–7 2000.

    Google Scholar 

  4. T. Lahtinen. Automatic indexing: an approach using an index term corpus and combining linguistic and statistical methods. PhD thesis, Department of General Linguistics, University of Helsinki, Finland, 2000.

    Google Scholar 

  5. C. D. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, Massachusetts, 1999.

    MATH  Google Scholar 

  6. D. W. Oard and G. Marchionini. A conceptual framework for text filtering. Technical Report CS-TR3643, University of Maryland, May 1996.

    Google Scholar 

  7. G. Salton. Automatic Text Processing. Addison-Wesley, 1989.

    Google Scholar 

  8. G. Salton, A. Wong, and C. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613–620, 1975.

    Article  MATH  Google Scholar 

  9. A. Visa, J. Toivonen, S. Autio, J. Mäkinen, H. Vanharanta, and B. Back. Data Mining of Text as a Tool in Authorship Attribution. In B. V. Dasarathy, editor, Proceedings of AeroSense 2001, SPIE 15th Annual International Symposium on Aerospace/Defense Sensing, Simulation and Controls. Data Mining and Knowledge Discovery: Theory, Tools, and Technology III, volume 4384, Orlando, Florida, USA, April 16–202001.

    Google Scholar 

  10. A. Visa, J. Toivonen, B. Back, and H. Vanharanta. Improvements on a Knowledge Discovery Methodology for Text Documents. In Proceedings of SSGRR 2000-International Conference on Advances in Infrastructure for Electronic Business,Science, and Education on the Internet, L’Aquila, Rome, Italy, July 31-August 6 2000. (CD-ROM).

    Google Scholar 

  11. A. Visa, J. Toivonen, H. Vanharanta, and B. Back. Prototype Matching-Finding Meaning in the Books of the Bible. In J. Ralph H. Sprague, editor, Proceedings of the Thirty-Fourth Annual Hawaii International Conference on System Sciences (HICSS-34), Maui, Hawaii, USA, January 3–6 2001. (CD-ROM).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Toivonen, J., Visa, A., Vesanen, T., Back, B., Vanharanta, H. (2001). Validation of Text Clustering Based on Document Contents. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2001. Lecture Notes in Computer Science(), vol 2123. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44596-X_15

Download citation

  • DOI: https://doi.org/10.1007/3-540-44596-X_15

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-42359-1

  • Online ISBN: 978-3-540-44596-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics