Abstract
Clustering is able to facilitate Information Retrieval. This paper addresses the issue of clustering a large number of XML documents. We propose ICX algorithm with a novel similarity metric based on quantitative path. In our approach, each document is firstly represented by path sequences extracted from XML trees. Then these sequences are mapped into quantitative path, by which the distance between documents can be computed with low complexity. Finally, the desired clusters are constructed by utilizing ICX method with literal local search. Experimental results, based on XML documents obtained from DBLP, show the effectiveness and good performance of the proposed techniques.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Faloutsos, C., Oard, D.: A survey of information retrieval and filtering methods, Department of Computer Science, University of Maryland, Technical Report, CS-TR- 35l4 (1995)
Zamir, O., Etzioni, O., Madani, O., Karp, R.M.: Fast and Intuitive Clustering of Web Documents. In: Proc. Second Int’l. Conf. Knowledge Discovery and Data Mining, pp. 287–290 (1997)
Nierman, A., Jagadish, H.V.: Evaluating Structural Similarity in XML Documents. In: Proc. Fifth Int’l. Workshop Web and Databases (2002)
Costa, G., Manco, G., Ortale, R., Tagarelli, A.: A Tree-Based Approach to Clustering XML Documents by Structure. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 137–148. Springer, Heidelberg (2004)
Dalamagas, T., Cheng, T., Winkel, K.-J., Sellis, T.K.: A Methodology for Clustering XML documents using Tree Summaries and Structural Distance Metrics. In: HDMS (2004)
Al-Sultan, K.S., Khan, M.M.: Computational experience on four algorithms for the hard clustering problem. Pattern Recogn. 17(3), 295–308 (1996)
Miller, G.A., Beckwith, R.: Introduction to WordNet. An On-line Lexical Database International journal of Lexicography 3(4), 235–312 (1990)
Lee, M.-L., Yang, L.H., Hsu, W., Yang, X.: XClust: Clustering XML schemas for effective integration. In: CIKM 2002, pp. 292–299 (2002)
Zhou, A., Qian, W., Qian, H.: Clustering DTDs: An Interactive Two-Level Ap-proach. J. Comput. Sci. Technol. 17(6), 807–819 (2002)
Jagadish, H.V., Koudas, N., Srivastava, D.: On Effective Multi-Dimensional Indexing for Strings. In: Proceedings of the ACM SIGMOD Conference on Management of Data, pp. 403–414 (2000)
Doucet, A., Ahonen-Myka, H.: Naive Clustering of a large XML Document Collection. In: INEX Workshop 2002, pp. 81–87 (2002)
Cui, X., Potok, T.E., Palathingal, P.: Document Clustering using Particle Swarm Optimization. In: Proceedings of the 2005 IEEE Swarm Intelligence Symposium, June 2005, Pasadena, California, USA (2005)
Abiteboul, S., Buneman, P., Suciu, D.: Data On The Web: From relations to Semistructured Data and XML. Morgan Kaufmann Publishers, San Francisco (2000)
Selim, S.Z., Ismail, M.A.: K-means type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Trans. Pattern Anal. Mach. Intell. 6, 81–87 (1984)
Zhang, S., Wang, J.T.L., Herbert, K.G.: Xml query by example. International Journal of Computational Intelligence and Applications 2(3), 329–337 (2002)
DBLP Computer Science Bibliography (2004), http://www.informatik.uni-trier.de/~ley/db/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wang, T., Liu, DX., Lin, XZ., Sun, W., Ahmad, G. (2006). Clustering Large Scale of XML Documents. In: Chung, YC., Moreira, J.E. (eds) Advances in Grid and Pervasive Computing. GPC 2006. Lecture Notes in Computer Science, vol 3947. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11745693_44
Download citation
DOI: https://doi.org/10.1007/11745693_44
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33809-3
Online ISBN: 978-3-540-33810-9
eBook Packages: Computer ScienceComputer Science (R0)