Abstract
Text documents clustering can be challenging due to complex linguistics properties of the text documents. Most of clustering techniques are based on traditional bag of words to represent the documents. In such document representation, ambiguity, synonymy and semantic similarities may not be captured using traditional text mining techniques that are based on words and/or phrases frequencies in the text.
In this paper, we propose a semantic similarity based model to capture the semantic of the text. The proposed model in conjunction with lexical ontology solves the synonyms and hypernyms problems. It utilizes WordNet as an ontology and uses the adapted Lesk algorithm to examine and extract the relationships between terms. The proposed model reflects the relationships by the semantic weighs added to the term frequency weight to represent the semantic similarity between terms.
Experiments using the proposed semantic similarity based model in text clustering are conducted. The obtained results show promising performance improvements compared to the traditional vector space model as well as other existing methods that include semantic similarity measures in text clustering.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
Hammouda, K., Kamel, M.: Efficient Phrase-based Document Indexing for Web Document Clustering. IEEE Transactions on Knowledge and Data Engineering 16, 1279–1296 (2004)
Shehata, S., Karray, F., Kamel, M.: A Concept-Based Model for Enhancing Text Categorization. In: The 13th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining, pp. 629–637 (2007)
Hotho, A., Staab, S., Stumme, G.: WordNet Improve Text Document Clustering. In: SIGIR 2003 Semantic Web Workshop, pp. 541–544 (2003)
Jing, L., Zhou, L., Ng, M., Huang, Z.: Ontology-based Distance Measure for Text Clustering. In: IAM SDM Workshop on Text Mining (2003)
Sedding, J., Dimitar, K.: WordNet-based Text Document Clustering. In: COLING 2004 3rd Workshop on Robust Methods in Analysis of Natural Language Data, pp. 104–113 (2004)
Wang, Y., Hodges, J.: Document Clustering with Semantic Analysis. In: The 39th Annual Hawaii International Conference on System Sciences, 2006. HICSS 2006, vol. 3, p. 54c (2006)
Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)
Budanitsky, A., Hirst, G.: Evaluating WordNet-based Measures of Lexical Semantic Relatedness. Computational Linguistics 32, 13–47 (2006)
Rada, R., Mili, H., Bickell, E., Blettner, B.: Development and Application of a Metric on Semantic Nets. IEEE Transactions on Systems, Man and Cybernetics 19, 17–30 (1989)
Wu, Z., Palmer, M.: Verb Semantics and Lexical Selection. In: The 32nd Annual Meeting of the Association for Computational Linguistics, pp. 133–138 (1994)
Li, Y., Zuhair, A., McLean, D.: An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources. IEEE Transactions on Knowledge and Data Engineering 15, 871–882 (2003)
Lord, P., Stevens, R., Brass, A., Goble, C.: Semantic Similarity Measures as Tools for Exploring the Gene Ontology. In: The 8th Pacific Symposium on Biocomputing, vol. 8, pp. 601–612 (1997)
Resnik, P.: Using Information Content to Evaluate Semantic Similarity in Taxonomy. In: The 14th international joint conference Artificial Intelligence, pp. 448–453 (1995)
Jiang, J., Conrath, D.: Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In: International Conference on Research in Computational Linguistics, pp. 19–33 (1997)
Tversky, A.: Features of Similarity. Psychological Review 84, 327–352 (1977)
Porter, M.: An algorithm for Suffix Stripping. Program 14, 130–137 (1980)
Lesk, M.: Automatic Sense Disambiguation Using Machine Readable Dictionaries: How to Tell a Pine Cone from an Ice Cream Cone. In: The ACM SIG-DOC Conference, pp. 24–26 (1986)
Banerjee, S., Pedersen, T.: Extended Gloss Overlaps as a Measure of Semantic Relatedness. In: 8th International Joint Conference on Artificial Intelligence (IJCAI 2003), pp. 805–810 (2003)
Patwardhan, S., Banerjee, S., Pedersen, T.: Using Measures of Semantic Relatedness for Word Sense Disambiguation. In: Gelbukh, A. (ed.) CICLING 2003. LNCS, vol. 2588, pp. 241–257. Springer, Heidelberg (2003)
Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. In: KDD Workshop on Text Mining (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Gad, W.K., Kamel, M.S. (2009). Enhancing Text Clustering Performance Using Semantic Similarity. In: Filipe, J., Cordeiro, J. (eds) Enterprise Information Systems. ICEIS 2009. Lecture Notes in Business Information Processing, vol 24. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01347-8_28
Download citation
DOI: https://doi.org/10.1007/978-3-642-01347-8_28
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01346-1
Online ISBN: 978-3-642-01347-8
eBook Packages: Computer ScienceComputer Science (R0)