Frequent Itemset Based Hierarchical Document Clustering Using Wikipedia as External Knowledge

G.V.R., Kiran; Shankar, Ravi; Pudi, Vikram

doi:10.1007/978-3-642-15390-7_2

Kiran G.V.R.²³,
Ravi Shankar²³ &
Vikram Pudi²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6277))

Included in the following conference series:

International Conference on Knowledge-Based and Intelligent Information and Engineering Systems

1696 Accesses
11 Citations

Abstract

High dimensionality is a major challenge in document clustering. Some of the recent algorithms address this problem by using frequent itemsets for clustering. But, most of these algorithms neglect the semantic relationship between the words. On the other hand there are algorithms that take care of the semantic relations between the words by making use of external knowledge contained in WordNet, Mesh, Wikipedia, etc but do not handle the high dimensionality. In this paper we present an efficient solution that addresses both these problems. We propose a hierarchical clustering algorithm using closed frequent itemsets that use Wikipedia as an external knowledge to enhance the document representation. We evaluate our methods based on F-Score on standard datasets and show our results to be better than existing approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Pudi, V., Haritsa, J.R.: Generalized Closed Itemsets for Association Rule Mining. In: Proc. of IEEE Conf. on Data Engineering (2003)
Google Scholar
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data. In: An introduction to Cluster Analysis. John Wiley & Sons, Inc., Chichester (1990)
Google Scholar
Zhao, Y., Karypis, G.: Evaluation of Hierarchial Clustering Algorithms for Document Datasets. In: Proc. of Intl. Conf. on Information and Knowledge Management (2002)
Google Scholar
Beil, F., Ester, M., Xu, X.: Frequent Term-based Text Clustering. In: Proc. of Intl. Conf. on Knowledge Discovery and Data Mining (2002)
Google Scholar
Fung, B., Wang, K., Ester, M.: Hierarchical Document Clustering using Frequent Itemsets. In: Proc. of SIAM Intl. Conf. on Data Mining (2003)
Google Scholar
Malik, H.H., Kender, J.R.: High Quality, Efficient Hierarchical Document Clustering Using Closed Interesting Itemsets. In: Proc. of IEEE Intl. Conf. on Data Mining (2006)
Google Scholar
Yu, H., Searsmith, D., Li, X., Han, J.: Scalable Construction of Topic Directory with Nonparametric Closed Termset Mining. In: Proc. of Fourth IEEE Intl. Conf. on Data Mining (2004)
Google Scholar
Hotho, A., Staab, S., et al.: Wordnet Improves Text Document Clustering. In: The 26th Annual Intl. ACM SIGIR Conf. on Proc. of Semantic Web Workshop (2003)
Google Scholar
Hotho, A., Maedche, A., Staab, S.: Text Clustering Based on Good Aggregations. In: Proc. of IEEE Intl. Conf. on Data Mining (2001)
Google Scholar
Zhang, X., Jing, L., Hu, X., et al.: A Comparative Study of Ontology Based Term Similarity Measures on Document Clustering. In: Proc. of 12th Intl. Conf. on Database Systems for Advanced Applications (2007)
Google Scholar
Gabrilovich, E., Markovitch, S.: Overcoming the Brittleness Bottleneck Using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge. In: Proc. of The 21st National Conf. on Artificial Intelligence (2006)
Google Scholar
Gabrilovich, E., Markovitch, S.: Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis. In: Proc. of The 20th Intl. Joint Conf. on Artificial Intelligence (2007)
Google Scholar
Hu, X., Zhang, X., Lu, C., et al.: Exploiting Wikipedia as External Knowledge for Document Clustering. In: Proc. of Knowledge Discovery and Data Mining (2009)
Google Scholar
Hu, J., Fang, L., Cao, Y., et al.: Enhancing Text Clustering by Leveraging Wikipedia Semantics. In: Proc. of 31st Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval (2008)
Google Scholar
Cluto: http://glaros.dtc.umn.edu/gkhome/views/cluto

Download references

Author information

Authors and Affiliations

International Institute of Information Technology, Hyderabad
Kiran G.V.R., Ravi Shankar & Vikram Pudi

Authors

Kiran G.V.R.
View author publications
You can also search for this author in PubMed Google Scholar
Ravi Shankar
View author publications
You can also search for this author in PubMed Google Scholar
Vikram Pudi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Engineering, The Parade, Cardiff University, CF24 3AA, Cardiff, UK
Rossitza Setchi
Dept. of Computer Science and Software Engineering, BUckingham Building, Lion Terrace, University of Portsmouth, PO1 3HE, Portsmouth, UK
Ivan Jordanov
KES International, 145-157 St. John Street, EC1V 4PY, London, UK
Robert J. Howlett
School of Electrical and Information Engineering, University of South Australia, Adelaide, Mawson Lakes Campus, 5095, SA, Australia
Lakhmi C. Jain

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

G.V.R., K., Shankar, R., Pudi, V. (2010). Frequent Itemset Based Hierarchical Document Clustering Using Wikipedia as External Knowledge. In: Setchi, R., Jordanov, I., Howlett, R.J., Jain, L.C. (eds) Knowledge-Based and Intelligent Information and Engineering Systems. KES 2010. Lecture Notes in Computer Science(), vol 6277. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15390-7_2

Download citation

DOI: https://doi.org/10.1007/978-3-642-15390-7_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15389-1
Online ISBN: 978-3-642-15390-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics