Abstract
The semi-structured texts including Xml and Html texts are a basic information format in the Internet and World Wide Web. The text content values and the tree-organized structure are two aspects of a semi-structured text. Usually, the same text contents with different structures imply different objects. So the structural similarity of semi-structured texts is an essential key point to search, index, retrieve, query, or compare information in web pages. We presents a Wavelet Transform Based Structural Similarity Model (WTBSSM) in order to fast measure the structural similarity of semi-structured texts and compress the structural information into a short vector so as to develop an efficient semi-structured text index system. This paper introduces the Binary Encoding Method to convert a semi-structured text into a {-1, 1} sequence. Then the text structure signals are decomposed by means of Discrete Wavelet Transform to get the approximation coefficients, which is only a half length of the original signals. Finally, the structure similarity is measured by the Euclidean distance of approximation coefficients. The experimental results show that the WTBSSM can keep almost the same distance distribution to the direct distance of the original signals with a half or a quarter of information. The comparisons with a method of shorten DWT coefficients suggests that WTBSSM is better than it.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Zheng, S.H., Zhou, A.Y., Zhang, L.: Similarity Measure and Structural Index of XML Documents. Chinese Journal of Computers 26(9), 1116–1122 (2003)
Tekli, J., Chbeir, R., Yetongnon, K.: A Fine-Grained XML Structural Comparison Approach. In: Parent, C., Schewe, K.-D., Storey, V.C., Thalheim, B. (eds.) ER 2007. LNCS, vol. 4801, pp. 582–598. Springer, Heidelberg (2007)
Xie, T., Sha, C., Wang, X., Zhou, A.: Approximate Top-k Structural Similarity Search over XML Documents. In: Zhou, X., Li, J., Shen, H.T., Kitsuregawa, M., Zhang, Y. (eds.) APWeb 2006. LNCS, vol. 3841, pp. 319–330. Springer, Heidelberg (2006)
Moon, H.J., Yoo, J.W., Choi, J.: An Effective Detection Method for Clustering Similar XML DTDs Using Tag Sequences. In: Gervasi, O., Gavrilova, M.L. (eds.) ICCSA 2007, Part II. LNCS, vol. 4706, pp. 849–860. Springer, Heidelberg (2007)
Viyanon, W., Madria, S.K.: XML-SIM-CHANGE: Structure and Content Semantic Similarity Detection among XML Document Versions. In: Meersman, R., Dillon, T., Herrero, P. (eds.) OTM 2010. LNCS, vol. 6427, pp. 1061–1078. Springer, Heidelberg (2010)
Leung, H.P., Chung, F.L., Chan, S.C.: On the use of hierarchical information in sequential mining-based XML document similarity computation. Knowledge and Information Systems 7, 476–498 (2005)
Flesca, S., Manco, G., Masciari, E., Pontieri, L.: Fast Detection of XML Structural Similarity. IEEE Transactions on Knowledge and Data Engineering 17(2), 160–175 (2005)
Yang, J.W., Chen, X.O.: Similarity measures for XML documents based on kernel matrix learning. Journal of Software 17(5), 991–1000 (2006)
Jeong, B., Lee, D., Cho, H., Kulvatunyou, B.: A kernel method for measuring structural similarity between XML documents. In: Proceedings of the 20th International Conference on Industrial Engineering and other Applications of Applied Intelligent Systems, pp. 572–581 (2007)
Zhang, L.J., Li, Z.H., Chen, Q., Li, N.: Structure and Content Similarity for Clustering XML Documents. In: Shen, H.T., Pei, J., Özsu, M.T., Zou, L., Lu, J., Ling, T.-W., Yu, G., Zhuang, Y., Shao, J. (eds.) WAIM 2010. LNCS, vol. 6185, pp. 116–124. Springer, Heidelberg (2010)
Antonellis, P., Makris, C., Tsirakis, N.: XEdge: Clustering Homogeneous and Heterogeneous XML Documents Using Edge Summaries. In: Proceedings of the 2008 ACM Symposium on Applied Computing, pp. 1081–1088 (2008)
Kim, W.: XML document similarity measure in terms of the structure and contents. In: Proceedings of the 2nd WSEAS International Conference on Computer Engineering and Applications, pp. 205–212 (2008)
Wen, L., Amagasa, T., Kitagawa, H.: An Approach for XML Similarity Join Using Tree Serialization. In: Haritsa, J.R., Kotagiri, R., Pudi, V. (eds.) DASFAA 2008. LNCS, vol. 4947, pp. 562–570. Springer, Heidelberg (2008)
Bertino, E., Guerrini, G., Mesiti, M.: Measuring the structural similarity among XML documents and DTDs. Journal of Intelligent Information Systems 30(1), 55–92 (2008)
Sigmod Record, http://www.sigmod.org/publications/sigmod-record/Xml-edition
Chan, F.K.P., Fu, A.W., Yu, C.: Haar Wavelets for Efficient Similarity Search of Time-Series: With and Without Time Warping. IEEE Transactions on Knowledge and Data Engineering 15(3), 686–705 (2003)
Liu, B., Wang, Z., Li, J.-T., Wang, W., Shi, B.-L.: Tight Bounds on the Estimation Distance Using Wavelet. In: Yu, J.X., Kitsuregawa, M., Leong, H.-V. (eds.) WAIM 2006. LNCS, vol. 4016, pp. 460–471. Springer, Heidelberg (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag GmbH Berlin Heidelberg
About this chapter
Cite this chapter
Su, J., Bao, J. (2012). A Wavelet Transform Based Structural Similarity Model for Semi-structured Texts. In: Tan, H. (eds) Knowledge Discovery and Data Mining. Advances in Intelligent and Soft Computing, vol 135. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-27708-5_22
Download citation
DOI: https://doi.org/10.1007/978-3-642-27708-5_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-27707-8
Online ISBN: 978-3-642-27708-5
eBook Packages: EngineeringEngineering (R0)