Skip to main content

Enhancing Text Clustering Performance Using Semantic Similarity

  • Conference paper
Enterprise Information Systems (ICEIS 2009)

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 24))

Included in the following conference series:

Abstract

Text documents clustering can be challenging due to complex linguistics properties of the text documents. Most of clustering techniques are based on traditional bag of words to represent the documents. In such document representation, ambiguity, synonymy and semantic similarities may not be captured using traditional text mining techniques that are based on words and/or phrases frequencies in the text.

In this paper, we propose a semantic similarity based model to capture the semantic of the text. The proposed model in conjunction with lexical ontology solves the synonyms and hypernyms problems. It utilizes WordNet as an ontology and uses the adapted Lesk algorithm to examine and extract the relationships between terms. The proposed model reflects the relationships by the semantic weighs added to the term frequency weight to represent the semantic similarity between terms.

Experiments using the proposed semantic similarity based model in text clustering are conducted. The obtained results show promising performance improvements compared to the traditional vector space model as well as other existing methods that include semantic similarity measures in text clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)

    Google Scholar 

  2. Hammouda, K., Kamel, M.: Efficient Phrase-based Document Indexing for Web Document Clustering. IEEE Transactions on Knowledge and Data Engineering 16, 1279–1296 (2004)

    Article  Google Scholar 

  3. Shehata, S., Karray, F., Kamel, M.: A Concept-Based Model for Enhancing Text Categorization. In: The 13th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining, pp. 629–637 (2007)

    Google Scholar 

  4. Hotho, A., Staab, S., Stumme, G.: WordNet Improve Text Document Clustering. In: SIGIR 2003 Semantic Web Workshop, pp. 541–544 (2003)

    Google Scholar 

  5. Jing, L., Zhou, L., Ng, M., Huang, Z.: Ontology-based Distance Measure for Text Clustering. In: IAM SDM Workshop on Text Mining (2003)

    Google Scholar 

  6. Sedding, J., Dimitar, K.: WordNet-based Text Document Clustering. In: COLING 2004 3rd Workshop on Robust Methods in Analysis of Natural Language Data, pp. 104–113 (2004)

    Google Scholar 

  7. Wang, Y., Hodges, J.: Document Clustering with Semantic Analysis. In: The 39th Annual Hawaii International Conference on System Sciences, 2006. HICSS 2006, vol. 3, p. 54c (2006)

    Google Scholar 

  8. Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)

    Google Scholar 

  9. Budanitsky, A., Hirst, G.: Evaluating WordNet-based Measures of Lexical Semantic Relatedness. Computational Linguistics 32, 13–47 (2006)

    Article  Google Scholar 

  10. Rada, R., Mili, H., Bickell, E., Blettner, B.: Development and Application of a Metric on Semantic Nets. IEEE Transactions on Systems, Man and Cybernetics 19, 17–30 (1989)

    Article  Google Scholar 

  11. Wu, Z., Palmer, M.: Verb Semantics and Lexical Selection. In: The 32nd Annual Meeting of the Association for Computational Linguistics, pp. 133–138 (1994)

    Google Scholar 

  12. Li, Y., Zuhair, A., McLean, D.: An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources. IEEE Transactions on Knowledge and Data Engineering 15, 871–882 (2003)

    Article  Google Scholar 

  13. Lord, P., Stevens, R., Brass, A., Goble, C.: Semantic Similarity Measures as Tools for Exploring the Gene Ontology. In: The 8th Pacific Symposium on Biocomputing, vol. 8, pp. 601–612 (1997)

    Google Scholar 

  14. Resnik, P.: Using Information Content to Evaluate Semantic Similarity in Taxonomy. In: The 14th international joint conference Artificial Intelligence, pp. 448–453 (1995)

    Google Scholar 

  15. Jiang, J., Conrath, D.: Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In: International Conference on Research in Computational Linguistics, pp. 19–33 (1997)

    Google Scholar 

  16. Tversky, A.: Features of Similarity. Psychological Review 84, 327–352 (1977)

    Article  Google Scholar 

  17. Porter, M.: An algorithm for Suffix Stripping. Program 14, 130–137 (1980)

    Google Scholar 

  18. Lesk, M.: Automatic Sense Disambiguation Using Machine Readable Dictionaries: How to Tell a Pine Cone from an Ice Cream Cone. In: The ACM SIG-DOC Conference, pp. 24–26 (1986)

    Google Scholar 

  19. Banerjee, S., Pedersen, T.: Extended Gloss Overlaps as a Measure of Semantic Relatedness. In: 8th International Joint Conference on Artificial Intelligence (IJCAI 2003), pp. 805–810 (2003)

    Google Scholar 

  20. Patwardhan, S., Banerjee, S., Pedersen, T.: Using Measures of Semantic Relatedness for Word Sense Disambiguation. In: Gelbukh, A. (ed.) CICLING 2003. LNCS, vol. 2588, pp. 241–257. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  21. Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. In: KDD Workshop on Text Mining (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Gad, W.K., Kamel, M.S. (2009). Enhancing Text Clustering Performance Using Semantic Similarity. In: Filipe, J., Cordeiro, J. (eds) Enterprise Information Systems. ICEIS 2009. Lecture Notes in Business Information Processing, vol 24. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01347-8_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-01347-8_28

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-01346-1

  • Online ISBN: 978-3-642-01347-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics