Abstract
The Web is increasingly used as a source of information for learning. Hence it is necessary that information on the web should be organized so that it can be used by the stakeholders efficiently. Most of the information in web is available in the form of XML documents. Grouping/clustering XML documents enhances the information retrieval process effectiveness. Computation of XML document similarity is a crucial task in clustering XML documents. In this paper we proposed a novel method to compute semantic structural similarity of an XML document by merging similar paths to address the above issues. In this method XML documents to be compared are represented by extracting all the paths from the root to the leaves and the comparison of paths is done based on a newly developed path matching algorithm. Similarity scores are given for exact, partial and contained in matches. In case of partial match merge operations are used namely the insertion of a new child (or descendants), parent (or ancestors) or both, and the creation of reference edges. More the number of merge operations more the dissimilarity of paths. Based on a similarity threshold the paths of XML documents are merged together and put in the same cluster and therefore avoiding pairwise similarity computations. Also, the matching process ensures the semantic structural similarity of the paths (i.e.) two XML paths may have a different order of hierarchy but semantically similar. Our proposed method shows an improved clustering accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Tekli, J., Chbeir, R., Yetongnon, K.: An overview on XML similarity: background, current trends and future directions. Comput. Sci. Rev. 3(3), 151–173 (2009)
Aggarwal, C.C., Ta, N., Wang, J., Feng, J., Zaki, M.: Xproj: a framework for projected structural clustering of xml documents. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 46–55 (2007)
Tai, K.C.: The tree-to-tree correction problem. J. ACM (JACM) 26, 433 (1979)
Chawathe, S.S.: Comparing hierarchical data in external memory. In: Proceedings of the International Conference on Very Large Data Bases, pp. 90–101 (1999)
Shasha, D., Zhang, K.: Approximate tree pattern matching, Pattern Matching in Strings. Trees and Arrays. Oxford University Press, Oxford (1995)
Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: Proceedings of ACM SIGMOD WebDB, pp. 61–66 (2002)
Tekli, J., Chbeir, R.: A novel XML document structure comparison framework based-on sub-tree commonalities and label semantics. J. Web Semant. 11, 14–40 (2012)
Rafiei, D., Moise, D,, Sun, D.: Finding syntactic similarities between xml documents. In: Proceedings of the 17th International Conference on Database and Expert Systems Applications, pp. 512–516 (2006)
Buttler, D.: A short survey of document structure similarity algorithms. In: The 5th International Conference on Internet Computing, Las Vegas (2004)
Joshi, S., Agrawal, N., Krishnapuram, R., Negi, S.: A bag of paths model for measuring structural similarity in web documents. In: Proceedings of the ACM SIGKKD Conference on Knowledge Discovery and Data Mining, pp. 577–582, USA (2003)
Vacharaskunee, S., Sarun, I..: XML path matching for different hierarchy order of elements in XML documents. In: Proceedings of the 11th IEEE ACIS International Conference on Software Engineering Artificial Intelligence Networking and Parallel/Distributed Computing (SNPD) (2010)
Choi, I., Moon, B., Kim, H.-J.: A clustering method based on path similarities of XML data. Data Knowl. Eng 60, 361–376 (2007)
Vinson, A.R., Heuser, C.A., da Silva, A.S., De Moura, E.S.: An approach to XML path matching. In: The 9th Annual ACM International Workshop on Web Information and Data Management, pp. 17–24 (2007)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Periakaruppan, R., Nadarajan, R. (2015). Clustering XML Documents for Web Based Learning. In: Chiu, D., et al. Advances in Web-Based Learning – ICWL 2013 Workshops. ICWL 2013. Lecture Notes in Computer Science(), vol 8390. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-46315-4_24
Download citation
DOI: https://doi.org/10.1007/978-3-662-46315-4_24
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-46314-7
Online ISBN: 978-3-662-46315-4
eBook Packages: Computer ScienceComputer Science (R0)