Clustering Large Scale of XML Documents

Wang, Tong; Liu, Da-Xin; Lin, Xuan-Zuo; Sun, Wei; Ahmad, Gufran

doi:10.1007/11745693_44

Tong Wang¹⁸,
Da-Xin Liu¹⁸,
Xuan-Zuo Lin¹⁹,
Wei Sun¹⁸ &
…
Gufran Ahmad¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3947))

Included in the following conference series:

International Conference on Grid and Pervasive Computing

519 Accesses
2 Citations

Abstract

Clustering is able to facilitate Information Retrieval. This paper addresses the issue of clustering a large number of XML documents. We propose ICX algorithm with a novel similarity metric based on quantitative path. In our approach, each document is firstly represented by path sequences extracted from XML trees. Then these sequences are mapped into quantitative path, by which the distance between documents can be computed with low complexity. Finally, the desired clusters are constructed by utilizing ICX method with literal local search. Experimental results, based on XML documents obtained from DBLP, show the effectiveness and good performance of the proposed techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Faloutsos, C., Oard, D.: A survey of information retrieval and filtering methods, Department of Computer Science, University of Maryland, Technical Report, CS-TR- 35l4 (1995)
Google Scholar
Zamir, O., Etzioni, O., Madani, O., Karp, R.M.: Fast and Intuitive Clustering of Web Documents. In: Proc. Second Int’l. Conf. Knowledge Discovery and Data Mining, pp. 287–290 (1997)
Google Scholar
Nierman, A., Jagadish, H.V.: Evaluating Structural Similarity in XML Documents. In: Proc. Fifth Int’l. Workshop Web and Databases (2002)
Google Scholar
Costa, G., Manco, G., Ortale, R., Tagarelli, A.: A Tree-Based Approach to Clustering XML Documents by Structure. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 137–148. Springer, Heidelberg (2004)
Chapter Google Scholar
Dalamagas, T., Cheng, T., Winkel, K.-J., Sellis, T.K.: A Methodology for Clustering XML documents using Tree Summaries and Structural Distance Metrics. In: HDMS (2004)
Google Scholar
Al-Sultan, K.S., Khan, M.M.: Computational experience on four algorithms for the hard clustering problem. Pattern Recogn. 17(3), 295–308 (1996)
Article Google Scholar
Miller, G.A., Beckwith, R.: Introduction to WordNet. An On-line Lexical Database International journal of Lexicography 3(4), 235–312 (1990)
Google Scholar
Lee, M.-L., Yang, L.H., Hsu, W., Yang, X.: XClust: Clustering XML schemas for effective integration. In: CIKM 2002, pp. 292–299 (2002)
Google Scholar
Zhou, A., Qian, W., Qian, H.: Clustering DTDs: An Interactive Two-Level Ap-proach. J. Comput. Sci. Technol. 17(6), 807–819 (2002)
Article MATH Google Scholar
Jagadish, H.V., Koudas, N., Srivastava, D.: On Effective Multi-Dimensional Indexing for Strings. In: Proceedings of the ACM SIGMOD Conference on Management of Data, pp. 403–414 (2000)
Google Scholar
Doucet, A., Ahonen-Myka, H.: Naive Clustering of a large XML Document Collection. In: INEX Workshop 2002, pp. 81–87 (2002)
Google Scholar
Cui, X., Potok, T.E., Palathingal, P.: Document Clustering using Particle Swarm Optimization. In: Proceedings of the 2005 IEEE Swarm Intelligence Symposium, June 2005, Pasadena, California, USA (2005)
Google Scholar
Abiteboul, S., Buneman, P., Suciu, D.: Data On The Web: From relations to Semistructured Data and XML. Morgan Kaufmann Publishers, San Francisco (2000)
Google Scholar
Selim, S.Z., Ismail, M.A.: K-means type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Trans. Pattern Anal. Mach. Intell. 6, 81–87 (1984)
Article MATH Google Scholar
Zhang, S., Wang, J.T.L., Herbert, K.G.: Xml query by example. International Journal of Computational Intelligence and Applications 2(3), 329–337 (2002)
Article Google Scholar
DBLP Computer Science Bibliography (2004), http://www.informatik.uni-trier.de/~ley/db/

Download references

Author information

Authors and Affiliations

Department of Computer Science and Technology, Harbin Engineering University, China
Tong Wang, Da-Xin Liu, Wei Sun & Gufran Ahmad
Northeast Agriculture University, Harbin, China
Xuan-Zuo Lin

Authors

Tong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Da-Xin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xuan-Zuo Lin
View author publications
You can also search for this author in PubMed Google Scholar
Wei Sun
View author publications
You can also search for this author in PubMed Google Scholar
Gufran Ahmad
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, National Tsing Hua University, 30013, Hsinchu, Taiwan
Yeh-Ching Chung
IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA
José E. Moreira

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, T., Liu, DX., Lin, XZ., Sun, W., Ahmad, G. (2006). Clustering Large Scale of XML Documents. In: Chung, YC., Moreira, J.E. (eds) Advances in Grid and Pervasive Computing. GPC 2006. Lecture Notes in Computer Science, vol 3947. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11745693_44

Download citation

DOI: https://doi.org/10.1007/11745693_44
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33809-3
Online ISBN: 978-3-540-33810-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics