Clustering XML Documents Using Frequent Subtrees

Kutty, Sangeetha; Tran, Tien; Nayak, Richi; Li, Yuefeng

doi:10.1007/978-3-642-03761-0_45

Sangeetha Kutty¹⁹,
Tien Tran¹⁹,
Richi Nayak¹⁹ &
…
Yuefeng Li¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5631))

Included in the following conference series:

International Workshop of the Initiative for the Evaluation of XML Retrieval

Abstract

This paper presents an experimental study conducted over the INEX 2008 Document Mining Challenge corpus using both the structure and the content of XML documents for clustering them. The concise common substructures known as the closed frequent subtrees are generated using the structural information of the XML documents. The closed frequent subtrees are then used to extract the constrained content from the documents. A matrix containing the term distribution of the documents in the dataset is developed using the extracted constrained content. The k-way clustering algorithm is applied to the matrix to obtain the required clusters. In spite of the large number of documents in the INEX 2008 Wikipedia dataset, the proposed frequent subtree-based clustering approach was successful in clustering the documents. This approach significantly reduces the dimensionality of the terms used for clustering without much loss in accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Nayak, R., Witt, R., Tonev, A.: Data Mining and XML Documents. In: International Conference on Internet Computing (2002)
Google Scholar
Tran, T., Nayak, R.: Evaluating the Performance of XML Document Clustering by Structure Only. In: Comparative Evaluation of XML Information Retrieval Systems, pp. 473–484 (2007)
Google Scholar
Kutty, S., Nayak, R., Li, Y.: PCITMiner-Prefix-based Closed Induced Tree Miner for finding closed induced frequent subtrees. In: Sixth Australasian Data Mining Conference (AusDM 2007). ACS, Gold Coast (2007)
Google Scholar
Nayak, R.: Investigating Semantic Measures in XML Clustering. In: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 1042–1045. IEEE Computer Society Press, Los Alamitos (2006)
Google Scholar
Aggarwal, C.C., et al.: Xproj: a framework for projected structural clustering of xml documents. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 46–55. ACM, San Jose (2007)
Chapter Google Scholar
Chi, Y., et al.: Frequent Subtree Mining-An Overview. In: Fundamenta Informaticae, pp. 161–198. IOS Press, Amsterdam (2005)
Google Scholar
Kutty, S., Nayak, R., Li, Y.: XML Data Mining: Process and Applications. In: Song, M., Wu, Y.-F. (eds.) Handbook of Research on Text and Web Mining Technologies. Idea Group Inc., USA (2008)
Google Scholar
Rijsbergen, C.J.v.: Information Retrieval. Butterworth, London (1979)
Google Scholar
Fox, C.: A stop list for general text. ACM SIGIR Forum 24(1-2), 19–35 (1989)
Article Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Article Google Scholar
Karypis, G.: CLUTO-Software for Clustering High-Dimensional Datasets | Karypis Lab, May 25 (2007), http://glaros.dtc.umn.edu/gkhome/views/cluto

Download references

Author information

Authors and Affiliations

Faculty of Science and Technology, Queensland University of Technology, GPO Box 2434, Brisbane, Qld, 4001, Australia
Sangeetha Kutty, Tien Tran, Richi Nayak & Yuefeng Li

Authors

Sangeetha Kutty
View author publications
You can also search for this author in PubMed Google Scholar
Tien Tran
View author publications
You can also search for this author in PubMed Google Scholar
Richi Nayak
View author publications
You can also search for this author in PubMed Google Scholar
Yuefeng Li
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Science and Technology, Queensland University of Technology, GPO Box 2434, 4001, Brisband, Qld, Australia
Shlomo Geva
Archives and Information Studies/Humanities, University of Amsterdam, Turfdraagsterpad 9, 1012 XT, Amsterdam, The Netherlands
Jaap Kamps
Department of Computer Science, University of Otago, P.O. Box 56, 9054, Dunedin, New Zealand
Andrew Trotman

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kutty, S., Tran, T., Nayak, R., Li, Y. (2009). Clustering XML Documents Using Frequent Subtrees. In: Geva, S., Kamps, J., Trotman, A. (eds) Advances in Focused Retrieval. INEX 2008. Lecture Notes in Computer Science, vol 5631. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03761-0_45

Download citation

DOI: https://doi.org/10.1007/978-3-642-03761-0_45
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03760-3
Online ISBN: 978-3-642-03761-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics