Abstract
With the rapid growth of text documents, document clustering has become one of the main techniques for organizing large amount of documents into a small number of meaningful clusters. However, there still exist several challenges for document clustering, such as high dimensionality, scalability, accuracy, meaningful cluster labels, and extracting semantics from texts. In order to improve the quality of document clustering results, we propose an effective Fuzzy Frequent Itemset-based Document Clustering (F2IDC) approach that combines fuzzy association rule mining with the background knowledge embedded in WordNet. A term hierarchy generated from WordNet is applied to discovery fuzzy frequent itemsets as candidate cluster labels for grouping documents. We have conducted experiments to evaluate our approach on Reuters-21578 dataset. The experimental result shows that our proposed method outperforms the accuracy quality of FIHC, HFTC, and UPGMA.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Beil, F., Ester, M., Xu, X.: Frequent Term-based Text Clustering. In: Int’l. Conf. on knowledge Discovery and Data Mining (KDD 2002), pp. 436–442 (2002)
Cutting, D.R., Karger, D.R., Pederson, J.O., Tukey, J.W.: Scatter/gather: a Cluster-based approach to Browsing Large Document Collections. In: 15th Int’l ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 318–329 (1992)
de Campos, L.M., Moral, S.: Learning Rules for a Fuzzy Inference Model. J. Fuzzy Sets and Systems. 59, 247–257 (1993)
Fung, B., Wang, K., Ester, M.: Hierarchical Document Clustering Using Frequent Itemsets. In: SIAM Int’l Conf. on Data Mining (SDM 2003), pp. 59–70 (2003)
Hong, T.P., Lin, K.Y., Wang, S.L.: Fuzzy Data Mining for Interesting Generalized Association Rules. J. Fuzzy Sets and Systems 138(2), 255–269 (2003)
Hotho, A., Staab, S., Stumme, G.: Wordnet Improves Text Document Clustering. In: SIGIR Int’l Conf. on Semantic Web Workshop (2003)
Kaya, M., Alhajj, R.: Utilizing genetic algorithms to optimize membership functions for fuzzy weighted association rule mining. Applied Intelligence 24(1), 7–15 (2006)
Kushal Dave, D.M.P., Lawrence, S.: Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews. In: 12thInt’l Conf. on World Wide Web (WWW 2003), pp. 519–528 (2003)
MartÃn-Bautista, M.J., Sánchez, D., Chamorro-MartÃnez, J., Serrano, J.M., Vila, M.A.: Mining web documents to find additional query terms using fuzzy association rules. Fuzzy Sets and Systems 148(1), 85–104 (2004)
Miller, G.A.: WordNet: a Lexical Database for English. J. Communications of the ACM 38(11), 39–41 (1995)
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Sedding, J., Kazakov, D.: WordNet-based Text Document Clustering. In: COLING 2004 Workshop on Robust Methods in Analysis of Natural Language Data (2004)
Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. In: KDD 2000 Workshop on Text Mining, ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining (KDD) (2000)
Yu, H., Searsmith, D., Li, X., Han, J.: Scalable Construction of Topic Directory with Nonparametric Closed Termset Mining. In: ICDM 2004, pp. 563–566 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chen, CL., Tseng, F.S.C., Liang, T. (2009). An Integration of Fuzzy Association Rules and WordNet for Document Clustering. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, TB. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2009. Lecture Notes in Computer Science(), vol 5476. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01307-2_16
Download citation
DOI: https://doi.org/10.1007/978-3-642-01307-2_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01306-5
Online ISBN: 978-3-642-01307-2
eBook Packages: Computer ScienceComputer Science (R0)