Abstract
In this paper, we present a framework for classifying XML documents based on structure/content similarity between XML documents. Firstly, an algorithm is proposed for computing the edit distance between an ordered labeled tree and a regular hedge grammar. The new edit distance gives a more precise measure for structural similarity than existing distance metrics in the literature. Secondly, we study schema extraction from XML documents, and an effective solution based on minimum length description (MLD) principle is given. Our schema extraction method allows trade off between schema simplicity and precision based on the user’s specification. Thirdly, classification of XML documents is discussed. Representation of XML documents based on the structures and contents is also studied. The efficacy and efficiency of our methodology have been tested using the data sets from XML Mining Challenge.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Suzuki, N.: Finding an Optimum Edit Script between an XML Document and a DTD. In: Proceedings of ACM Symposium on Applied Computing, Santa Fe, NM, pp. 647–653 (March 2005)
Xing, G.: Fast Approximate Matching Between XML Documents and Schemata. In: Zhou, X., Li, J., Shen, H.T., Kitsuregawa, M., Zhang, Y. (eds.) APWeb 2006. LNCS, vol. 3841, pp. 425–436. Springer, Heidelberg (2006)
Canfield, R., Xing, G.: Approximate XML Document Matching (Poster). In: Proceedings of ACM Symposium on Applied Computing, Santa Fe, NM (March 2005)
Dalamagas, T., Cheng, T., Winkel, K.-J., Sellis, T.K.: A methodology for clustering XML documents by structure. Information Systems 31(3), 187–228 (2006)
Thompson, K.: Regular Expression Search Algorithm. Communications of ACM 11(6), 419–422 (1968)
Shasha, D., Zhang, K.: Approximate Tree Pattern Matching. In: Apostolico, A., Galil, Z. (eds.) Pattern Matching Algorithms, ch. 14, Oxford University Press, Oxford (1997)
Zhang, K.: Algorithms for the constrained editing distance between ordered labeled trees and related problems. Pattern Recognition 28(3), 463–474 (1995)
Murata, M.: Hedge Automata: A Formal Model for XML Schemata, http://www.xml.gr.jp/relax/hedge_nice.html
Myers, G.: Approximately Matching Context Free Languages. Information Processing Letters 54(2), 85–92 (1995)
Chen, W.: New Algorithm for Ordered Tree-to-Tree Correction Problem. J. of Algorithm 40, 135–158 (2001)
Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: WebDB 2002, Madison, Wisconsin, (June 2002)
XML Document Mining Challenge, http://xmlmining.lip6.fr/
Denoyer, L., Gallinari, P.: Report on the XML Mining Track at INEX 2005 and INEX 2006. In: Proceedings of INEX (2006)
Chidlovskii, B.: Schema Extraction from XML Data: A Grammatical Inference Approach. In: KRDB 2001 Workshop, Rome, Italy, (September 15, 2001)
Garofalakis, M.N., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: Xtract: A System for Extracting Document Type Descriptors from XML Documents. In: SIGMOD Conference 2000, Dallas, Texas, USA pp. 165-176 (May 16-18, 2000)
WEKA Project, http://www.cs.waikato.ac.nz/ml/weka/
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Karypis, G.: CLUTO A clustering toolkit Technical Report 02017, University of Minnesota, Department of Computer Science, Minneapolis, MN 55455, (August 2002)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Xing, G., Guo, J., Xia, Z. (2007). Classifying XML Documents Based on Structure/Content Similarity. In: Fuhr, N., Lalmas, M., Trotman, A. (eds) Comparative Evaluation of XML Information Retrieval Systems. INEX 2006. Lecture Notes in Computer Science, vol 4518. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73888-6_42
Download citation
DOI: https://doi.org/10.1007/978-3-540-73888-6_42
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73887-9
Online ISBN: 978-3-540-73888-6
eBook Packages: Computer ScienceComputer Science (R0)