Abstract
XML has recently become very popular for representing semi-structured data and a standard for data exchange over the web because of its varied applicability in a number of applications. Therefore, XML documents form an important data mining domain. In this paper, we propose a new XML document clustering technique using sequential pattern mining algorithm. Our approach first extracts the representative structures of frequent patterns from schemaless XML documents by using a sequential pattern mining algorithm. And then, unlike most previous document clustering methods, we apply clustering algorithm for transactional data without a measure of pairwise similarity, considering that an XML document as a transaction and the extracted frequent structures of documents as the items of the transaction. We have experimented our clustering algorithm by comparing it with the previous methods. The experimental results show the effectiveness of the proposed method in performance and in producing clusters with higher cluster cohesion.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Kotasek, P., Zendulka, J.: An XML Framework Proposal for Knowledge Discovery in Database. In: The Fourth European Conference on Principles and Practice Knowledge Discovery in Databases (2000)
Wang, K., Liu, H.: Discovery Typical Structures of Documents: A Road Map Approach. In: ACM SIGIR, pp. 146–154 (1998)
Widom, J.: Data Management for XML: Research Directions. IEEE Computer Society Technical Commitee on Data Engineering, 44-52 (1999)
Nayak, R., Witt, R., Tonev, A.: Data Mining and XML Documents. Int. Conf. on Internetc Computing, 660–666 (2002)
Lee, M.L., Yang, L.H., Hsu, W., Yang, X.: XClust: Clustering XML Schemas for Effective Integration. In: Proc. 11th ACM Int. Conf. on Information and Knowledge Management, pp. 292–299 (2002)
Shen, Y., Wang, B.: Clustering Schemaless XML Document. In: Proc. of the 11th Int. Conf. on Cooperative Information System, pp. 767–784 (2003)
Yoon, J., Raghavan, V., Chakilam, V.: BitCube: Clustering and Statistical Analysis for XML Documents. In: Proc. of the 13th Int. Conf. on Scientific and Statistical Database Management, pp. 241–254 (2001)
Doucet, A., Myka, H.A.: Naive Clustering of a Large XML Document Collection. In: The Proceedings of the 1st INEX, Germany (2002)
Lee, J.W., Lee, K., Kim, W.: Preparation for Semantics-Based XML Mining. In: IEEE Int. Conf. on Data Mining(ICDM), pp. 345–352 (2001)
Asai, T., Abe, K., Kawasoe, S., Arimura, S.H.: Efficient Substructure Discovery from Large Semi-structured Data. In: Proc. of the Second SIAM Int. Conf. on Data Mining, pp. 158–174 (2002)
Termier, A., Rouster, M.C., Sebag, M.: TreeFinder: A First Step towards XML Data Mining. In: IEEE Int. Conf. on Data Mining (ICDM), pp. 450–457 (2002)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: a review. ACM Computing Surveys 31 (1999)
Yang, Y., Guan, X., You, J.: CLOPE: A Fast and Effective Clustering Algorithm for Transaction Data. In: Proc. of the 8th ACM SIGKDD Int. Conf on Knowledge Discovery and Data Mining, pp. 682–687 (2002)
Wang, K., Xu, C.: Clustering Transactions Using Large Items. In: Proc. of ACM CIKM 1999, pp. 483–490 (1999)
Pei, J., Han, J., Asi, B.M., Pinto, H.: PrefixSpan: Mining Sequential Pattern Efficiently by Prefix-Projected Pattern Growth. In: Int. Conf. Data Engineering(ICDE), pp. 215–224 (2001)
NIAGARA query engine., http://www.cs.wisc.edu/niagara/data.html
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hwang, J.H., Ryu, K.H. (2005). A New Sequential Mining Approach to XML Document Clustering*. In: Zhang, Y., Tanaka, K., Yu, J.X., Wang, S., Li, M. (eds) Web Technologies Research and Development - APWeb 2005. APWeb 2005. Lecture Notes in Computer Science, vol 3399. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31849-1_27
Download citation
DOI: https://doi.org/10.1007/978-3-540-31849-1_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25207-8
Online ISBN: 978-3-540-31849-1
eBook Packages: Computer ScienceComputer Science (R0)