A New Sequential Mining Approach to XML Document Clustering*

Hwang, Jeong Hee; Ryu, Keun Ho

doi:10.1007/978-3-540-31849-1_27

Jeong Hee Hwang²¹ &
Keun Ho Ryu²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3399))

Included in the following conference series:

Asia-Pacific Web Conference

537 Accesses
1 Citations

Abstract

XML has recently become very popular for representing semi-structured data and a standard for data exchange over the web because of its varied applicability in a number of applications. Therefore, XML documents form an important data mining domain. In this paper, we propose a new XML document clustering technique using sequential pattern mining algorithm. Our approach first extracts the representative structures of frequent patterns from schemaless XML documents by using a sequential pattern mining algorithm. And then, unlike most previous document clustering methods, we apply clustering algorithm for transactional data without a measure of pairwise similarity, considering that an XML document as a transaction and the extracted frequent structures of documents as the items of the transaction. We have experimented our clustering algorithm by comparing it with the previous methods. The experimental results show the effectiveness of the proposed method in performance and in producing clusters with higher cluster cohesion.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Kotasek, P., Zendulka, J.: An XML Framework Proposal for Knowledge Discovery in Database. In: The Fourth European Conference on Principles and Practice Knowledge Discovery in Databases (2000)
Google Scholar
Wang, K., Liu, H.: Discovery Typical Structures of Documents: A Road Map Approach. In: ACM SIGIR, pp. 146–154 (1998)
Google Scholar
Widom, J.: Data Management for XML: Research Directions. IEEE Computer Society Technical Commitee on Data Engineering, 44-52 (1999)
Google Scholar
Nayak, R., Witt, R., Tonev, A.: Data Mining and XML Documents. Int. Conf. on Internetc Computing, 660–666 (2002)
Google Scholar
Lee, M.L., Yang, L.H., Hsu, W., Yang, X.: XClust: Clustering XML Schemas for Effective Integration. In: Proc. 11th ACM Int. Conf. on Information and Knowledge Management, pp. 292–299 (2002)
Google Scholar
Shen, Y., Wang, B.: Clustering Schemaless XML Document. In: Proc. of the 11th Int. Conf. on Cooperative Information System, pp. 767–784 (2003)
Google Scholar
Yoon, J., Raghavan, V., Chakilam, V.: BitCube: Clustering and Statistical Analysis for XML Documents. In: Proc. of the 13th Int. Conf. on Scientific and Statistical Database Management, pp. 241–254 (2001)
Google Scholar
Doucet, A., Myka, H.A.: Naive Clustering of a Large XML Document Collection. In: The Proceedings of the 1st INEX, Germany (2002)
Google Scholar
Lee, J.W., Lee, K., Kim, W.: Preparation for Semantics-Based XML Mining. In: IEEE Int. Conf. on Data Mining(ICDM), pp. 345–352 (2001)
Google Scholar
Asai, T., Abe, K., Kawasoe, S., Arimura, S.H.: Efficient Substructure Discovery from Large Semi-structured Data. In: Proc. of the Second SIAM Int. Conf. on Data Mining, pp. 158–174 (2002)
Google Scholar
Termier, A., Rouster, M.C., Sebag, M.: TreeFinder: A First Step towards XML Data Mining. In: IEEE Int. Conf. on Data Mining (ICDM), pp. 450–457 (2002)
Google Scholar
Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: a review. ACM Computing Surveys 31 (1999)
Google Scholar
Yang, Y., Guan, X., You, J.: CLOPE: A Fast and Effective Clustering Algorithm for Transaction Data. In: Proc. of the 8th ACM SIGKDD Int. Conf on Knowledge Discovery and Data Mining, pp. 682–687 (2002)
Google Scholar
Wang, K., Xu, C.: Clustering Transactions Using Large Items. In: Proc. of ACM CIKM 1999, pp. 483–490 (1999)
Google Scholar
http://sourceforge.net/projects/javawn
Pei, J., Han, J., Asi, B.M., Pinto, H.: PrefixSpan: Mining Sequential Pattern Efficiently by Prefix-Projected Pattern Growth. In: Int. Conf. Data Engineering(ICDE), pp. 215–224 (2001)
Google Scholar
NIAGARA query engine., http://www.cs.wisc.edu/niagara/data.html

Download references

Author information

Authors and Affiliations

Database Laboratory, Chungbuk National University, Korea
Jeong Hee Hwang & Keun Ho Ryu

Authors

Jeong Hee Hwang
View author publications
You can also search for this author in PubMed Google Scholar
Keun Ho Ryu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Victoria University, Australia
Yanchun Zhang
University of Kyoto, Japan
Katsumi Tanaka
Chinese University of Hong Kong, Hong Kong, China
Jeffrey Xu Yu
Key Laboratory of Data Engineering and Knowledge Engineering, Renmin University of China, MOE, 100872, Beijing, P.R. China
Shan Wang
Department of Computer Science and Engineering, Shanghai Jiatong University, 80 Dongcuan Road, 200240, Shanghai, China
Minglu Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hwang, J.H., Ryu, K.H. (2005). A New Sequential Mining Approach to XML Document Clustering*. In: Zhang, Y., Tanaka, K., Yu, J.X., Wang, S., Li, M. (eds) Web Technologies Research and Development - APWeb 2005. APWeb 2005. Lecture Notes in Computer Science, vol 3399. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31849-1_27

Download citation

DOI: https://doi.org/10.1007/978-3-540-31849-1_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25207-8
Online ISBN: 978-3-540-31849-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics