Abstract
With the increasing use of XML in many domains, XML document clustering has been a central research topic in semistructured data management and mining. Due to the semistructured nature of XML data, the clustering problem becomes particularly challenging, mainly because structural similarity measures specifically designed to deal with tree/graph-shaped data can be quite expensive. Specialized clustering techniques are being developed to account for this difficulty, however most of them still assume that XML documents are represented using a semistructured data model. In this paper we take a simpler approach whereby XML structural aspects are extracted from the documents to generate a flat data format to which well-established clustering methods can be directly applied. Hence, the expensive process of tree/graph data mining is avoided, while the structural properties are still preserved. Our experimental evaluation using a number of real world datasets and comparing with existing structural clustering methods, has demonstrated the significance of our approach.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aggarwal, C.C., Ta, N., Wang, J., Feng, J., Zaki, M.: XProj: a framework for projected structural clustering of XML documents. In: Proc. ACM KDD Conf., pp. 46–55 (2007)
Bille, P.: A survey on tree edit distance and related problems. Theoretical Computer Science 337(1–3), 217–239 (2005)
Costa, G., Manco, G., Ortale, R., Tagarelli, A.: A Tree-Based Approach to Clustering XML Documents by Structure. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 137–148. Springer, Heidelberg (2004)
Dalamagas, T., Cheng, T., Winkel, K.-J., Sellis, T.K.: A methodology for clustering XML documents by structure. Information Systems 31(3) (2006)
Doucet, A., Lehtonen, M.: Unsupervised Classification of Text-Centric XML Document Collections. In: Fuhr, N., Lalmas, M., Trotman, A. (eds.) INEX 2006. LNCS, vol. 4518, pp. 497–509. Springer, Heidelberg (2007)
Hadzic, F.: A Structure Preserving Flat Data Format Representation for Tree-Structured Data. In: Proc. PAKDD Workshops (QIME 2011), Springer, Heidelberg (2011)
Hadzic, F., Tan, H., Dillon, T.S.: Mining of Data with Complex Structures, 1st edn. SCI, vol. 333. Springer, Heidelberg (2011)
Joshi, S., Agrawal, N., Krishnapuram, R., Negi, S.: A bag of paths model for measuring structural similarity in Web documents. In: Proc. ACM KDD Conf., pp. 577–582 (2003)
Karypis, G.: CLUTO - Software for Clustering High-Dimensional Datasets (2002/2007), http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download
Kutty, S., Nayak, R., Li, Y.: HCX: an efficient hybrid clustering approach for XML documents. In: Proc. ACM Symposium on Document Engineering, pp. 94–97 (2009)
Kutty, S., Nayak, R., Li, Y.: XML Documents Clustering using a Tensor Space Model. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011, Part I. LNCS, vol. 6634, pp. 488–499. Springer, Heidelberg (2011)
Lian, W., Cheung, D.W.-L., Mamoulis, N., Yiu, S.-M.: An Efficient and Scalable Algorithm for Clustering XML Documents by Structure. IEEE Transactions on Knowledge Data Engineering 16(1), 82–96 (2004)
Nierman, A., Jagadish, H.V.: Evaluating Structural Similarity in XML Documents. In: Proc. WebDB Workshop, pp. 61–66 (2002)
Punin, J.R., Krishnamoorthy, M.S., Zaki, M.J.: LOGML: Log Markup Language for Web Usage Mining. In: Kohavi, R., Masand, B., Spiliopoulou, M., Srivastava, J. (eds.) WebKDD 2001. LNCS (LNAI), vol. 2356, pp. 88–112. Springer, Heidelberg (2002)
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: Proc. KDD Workshop on Text Mining (2000)
Tagarelli, A., Greco, S.: Semantic clustering of XML documents. ACM Transactions on Information Systems 28(1) (2010)
Yao, J.T., Varde, A., Rundensteiner, E., Fahrenholz, S.: XML Based Markup Languages for Specific Domains. In: Web-based Support Systems. Advanced Information and Knowledge Processing, pp. 215–238. Springer, London (2010)
Yoon, J.P., Raghavan, V., Chakilam, V., Kerschberg, L.: BitCube: A Three-Dimensional Bitmap Indexing for XML Documents. Journal of Intelligent Information Systems 17(2–3), 241–254 (2001)
Zaki, M.J.: Efficiently Mining Frequent Trees in a Forest: Algorithms and Applications. IEEE Transactions on Knowledge and Data Engineering 17(8), 1021–1035 (2005)
Zhao, Y., Karypis, G.: Empirical and Theoretical Comparison of Selected Criterion Functions for Document Clustering. Machine Learning 55(3), 311–331 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hadzic, F., Hecker, M., Tagarelli, A. (2011). XML Document Clustering Using Structure-Preserving Flat Representation of XML Content and Structure. In: Tang, J., King, I., Chen, L., Wang, J. (eds) Advanced Data Mining and Applications. ADMA 2011. Lecture Notes in Computer Science(), vol 7121. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25856-5_30
Download citation
DOI: https://doi.org/10.1007/978-3-642-25856-5_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25855-8
Online ISBN: 978-3-642-25856-5
eBook Packages: Computer ScienceComputer Science (R0)