XML Document Clustering Using Structure-Preserving Flat Representation of XML Content and Structure

Hadzic, Fedja; Hecker, Michael; Tagarelli, Andrea

doi:10.1007/978-3-642-25856-5_30

Fedja Hadzic²²,
Michael Hecker²² &
Andrea Tagarelli²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7121))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

1391 Accesses
3 Citations

Abstract

With the increasing use of XML in many domains, XML document clustering has been a central research topic in semistructured data management and mining. Due to the semistructured nature of XML data, the clustering problem becomes particularly challenging, mainly because structural similarity measures specifically designed to deal with tree/graph-shaped data can be quite expensive. Specialized clustering techniques are being developed to account for this difficulty, however most of them still assume that XML documents are represented using a semistructured data model. In this paper we take a simpler approach whereby XML structural aspects are extracted from the documents to generate a flat data format to which well-established clustering methods can be directly applied. Hence, the expensive process of tree/graph data mining is avoided, while the structural properties are still preserved. Our experimental evaluation using a number of real world datasets and comparing with existing structural clustering methods, has demonstrated the significance of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aggarwal, C.C., Ta, N., Wang, J., Feng, J., Zaki, M.: XProj: a framework for projected structural clustering of XML documents. In: Proc. ACM KDD Conf., pp. 46–55 (2007)
Google Scholar
Bille, P.: A survey on tree edit distance and related problems. Theoretical Computer Science 337(1–3), 217–239 (2005)
Article MATH Google Scholar
Costa, G., Manco, G., Ortale, R., Tagarelli, A.: A Tree-Based Approach to Clustering XML Documents by Structure. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 137–148. Springer, Heidelberg (2004)
Chapter Google Scholar
Dalamagas, T., Cheng, T., Winkel, K.-J., Sellis, T.K.: A methodology for clustering XML documents by structure. Information Systems 31(3) (2006)
Google Scholar
Doucet, A., Lehtonen, M.: Unsupervised Classification of Text-Centric XML Document Collections. In: Fuhr, N., Lalmas, M., Trotman, A. (eds.) INEX 2006. LNCS, vol. 4518, pp. 497–509. Springer, Heidelberg (2007)
Chapter Google Scholar
Hadzic, F.: A Structure Preserving Flat Data Format Representation for Tree-Structured Data. In: Proc. PAKDD Workshops (QIME 2011), Springer, Heidelberg (2011)
Google Scholar
Hadzic, F., Tan, H., Dillon, T.S.: Mining of Data with Complex Structures, 1st edn. SCI, vol. 333. Springer, Heidelberg (2011)
MATH Google Scholar
Joshi, S., Agrawal, N., Krishnapuram, R., Negi, S.: A bag of paths model for measuring structural similarity in Web documents. In: Proc. ACM KDD Conf., pp. 577–582 (2003)
Google Scholar
Karypis, G.: CLUTO - Software for Clustering High-Dimensional Datasets (2002/2007), http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download
Kutty, S., Nayak, R., Li, Y.: HCX: an efficient hybrid clustering approach for XML documents. In: Proc. ACM Symposium on Document Engineering, pp. 94–97 (2009)
Google Scholar
Kutty, S., Nayak, R., Li, Y.: XML Documents Clustering using a Tensor Space Model. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011, Part I. LNCS, vol. 6634, pp. 488–499. Springer, Heidelberg (2011)
Chapter Google Scholar
Lian, W., Cheung, D.W.-L., Mamoulis, N., Yiu, S.-M.: An Efficient and Scalable Algorithm for Clustering XML Documents by Structure. IEEE Transactions on Knowledge Data Engineering 16(1), 82–96 (2004)
Article Google Scholar
Nierman, A., Jagadish, H.V.: Evaluating Structural Similarity in XML Documents. In: Proc. WebDB Workshop, pp. 61–66 (2002)
Google Scholar
Punin, J.R., Krishnamoorthy, M.S., Zaki, M.J.: LOGML: Log Markup Language for Web Usage Mining. In: Kohavi, R., Masand, B., Spiliopoulou, M., Srivastava, J. (eds.) WebKDD 2001. LNCS (LNAI), vol. 2356, pp. 88–112. Springer, Heidelberg (2002)
Chapter Google Scholar
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: Proc. KDD Workshop on Text Mining (2000)
Google Scholar
Tagarelli, A., Greco, S.: Semantic clustering of XML documents. ACM Transactions on Information Systems 28(1) (2010)
Google Scholar
Yao, J.T., Varde, A., Rundensteiner, E., Fahrenholz, S.: XML Based Markup Languages for Specific Domains. In: Web-based Support Systems. Advanced Information and Knowledge Processing, pp. 215–238. Springer, London (2010)
Chapter Google Scholar
Yoon, J.P., Raghavan, V., Chakilam, V., Kerschberg, L.: BitCube: A Three-Dimensional Bitmap Indexing for XML Documents. Journal of Intelligent Information Systems 17(2–3), 241–254 (2001)
Article MATH Google Scholar
Zaki, M.J.: Efficiently Mining Frequent Trees in a Forest: Algorithms and Applications. IEEE Transactions on Knowledge and Data Engineering 17(8), 1021–1035 (2005)
Article Google Scholar
Zhao, Y., Karypis, G.: Empirical and Theoretical Comparison of Selected Criterion Functions for Document Clustering. Machine Learning 55(3), 311–331 (2004)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Digital Ecosystems and Business Intelligence Institute, Curtin University, Australia
Fedja Hadzic & Michael Hecker
Dept. of Electronics, Computer and Systems Sciences, University of Calabria, Italy
Andrea Tagarelli

Authors

Fedja Hadzic
View author publications
You can also search for this author in PubMed Google Scholar
Michael Hecker
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Tagarelli
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Technology, Tsinghua University, 100084, Beijing, China
Jie Tang & Jianyong Wang &
Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, SAR, China
Irwin King
Faculty of Engineering and Information Technology, University of Technology, 2007, Sydney, NSW, Australia
Ling Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hadzic, F., Hecker, M., Tagarelli, A. (2011). XML Document Clustering Using Structure-Preserving Flat Representation of XML Content and Structure. In: Tang, J., King, I., Chen, L., Wang, J. (eds) Advanced Data Mining and Applications. ADMA 2011. Lecture Notes in Computer Science(), vol 7121. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25856-5_30

Download citation

DOI: https://doi.org/10.1007/978-3-642-25856-5_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25855-8
Online ISBN: 978-3-642-25856-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics