A Wavelet Transform Based Structural Similarity Model for Semi-structured Texts

Su, Jie; Bao, Junpeng

doi:10.1007/978-3-642-27708-5_22

Jie Su² &
Junpeng Bao²

Part of the book series: Advances in Intelligent and Soft Computing ((AINSC,volume 135))

216 Accesses

Abstract

The semi-structured texts including Xml and Html texts are a basic information format in the Internet and World Wide Web. The text content values and the tree-organized structure are two aspects of a semi-structured text. Usually, the same text contents with different structures imply different objects. So the structural similarity of semi-structured texts is an essential key point to search, index, retrieve, query, or compare information in web pages. We presents a Wavelet Transform Based Structural Similarity Model (WTBSSM) in order to fast measure the structural similarity of semi-structured texts and compress the structural information into a short vector so as to develop an efficient semi-structured text index system. This paper introduces the Binary Encoding Method to convert a semi-structured text into a {-1, 1} sequence. Then the text structure signals are decomposed by means of Discrete Wavelet Transform to get the approximation coefficients, which is only a half length of the original signals. Finally, the structure similarity is measured by the Euclidean distance of approximation coefficients. The experimental results show that the WTBSSM can keep almost the same distance distribution to the direct distance of the original signals with a half or a quarter of information. The comparisons with a method of shorten DWT coefficients suggests that WTBSSM is better than it.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Zheng, S.H., Zhou, A.Y., Zhang, L.: Similarity Measure and Structural Index of XML Documents. Chinese Journal of Computers 26(9), 1116–1122 (2003)
MathSciNet Google Scholar
Tekli, J., Chbeir, R., Yetongnon, K.: A Fine-Grained XML Structural Comparison Approach. In: Parent, C., Schewe, K.-D., Storey, V.C., Thalheim, B. (eds.) ER 2007. LNCS, vol. 4801, pp. 582–598. Springer, Heidelberg (2007)
Chapter Google Scholar
Xie, T., Sha, C., Wang, X., Zhou, A.: Approximate Top-k Structural Similarity Search over XML Documents. In: Zhou, X., Li, J., Shen, H.T., Kitsuregawa, M., Zhang, Y. (eds.) APWeb 2006. LNCS, vol. 3841, pp. 319–330. Springer, Heidelberg (2006)
Chapter Google Scholar
Moon, H.J., Yoo, J.W., Choi, J.: An Effective Detection Method for Clustering Similar XML DTDs Using Tag Sequences. In: Gervasi, O., Gavrilova, M.L. (eds.) ICCSA 2007, Part II. LNCS, vol. 4706, pp. 849–860. Springer, Heidelberg (2007)
Chapter Google Scholar
Viyanon, W., Madria, S.K.: XML-SIM-CHANGE: Structure and Content Semantic Similarity Detection among XML Document Versions. In: Meersman, R., Dillon, T., Herrero, P. (eds.) OTM 2010. LNCS, vol. 6427, pp. 1061–1078. Springer, Heidelberg (2010)
Chapter Google Scholar
Leung, H.P., Chung, F.L., Chan, S.C.: On the use of hierarchical information in sequential mining-based XML document similarity computation. Knowledge and Information Systems 7, 476–498 (2005)
Article Google Scholar
Flesca, S., Manco, G., Masciari, E., Pontieri, L.: Fast Detection of XML Structural Similarity. IEEE Transactions on Knowledge and Data Engineering 17(2), 160–175 (2005)
Article Google Scholar
Yang, J.W., Chen, X.O.: Similarity measures for XML documents based on kernel matrix learning. Journal of Software 17(5), 991–1000 (2006)
Article MathSciNet Google Scholar
Jeong, B., Lee, D., Cho, H., Kulvatunyou, B.: A kernel method for measuring structural similarity between XML documents. In: Proceedings of the 20th International Conference on Industrial Engineering and other Applications of Applied Intelligent Systems, pp. 572–581 (2007)
Google Scholar
Zhang, L.J., Li, Z.H., Chen, Q., Li, N.: Structure and Content Similarity for Clustering XML Documents. In: Shen, H.T., Pei, J., Özsu, M.T., Zou, L., Lu, J., Ling, T.-W., Yu, G., Zhuang, Y., Shao, J. (eds.) WAIM 2010. LNCS, vol. 6185, pp. 116–124. Springer, Heidelberg (2010)
Chapter Google Scholar
Antonellis, P., Makris, C., Tsirakis, N.: XEdge: Clustering Homogeneous and Heterogeneous XML Documents Using Edge Summaries. In: Proceedings of the 2008 ACM Symposium on Applied Computing, pp. 1081–1088 (2008)
Google Scholar
Kim, W.: XML document similarity measure in terms of the structure and contents. In: Proceedings of the 2nd WSEAS International Conference on Computer Engineering and Applications, pp. 205–212 (2008)
Google Scholar
Wen, L., Amagasa, T., Kitagawa, H.: An Approach for XML Similarity Join Using Tree Serialization. In: Haritsa, J.R., Kotagiri, R., Pudi, V. (eds.) DASFAA 2008. LNCS, vol. 4947, pp. 562–570. Springer, Heidelberg (2008)
Chapter Google Scholar
Bertino, E., Guerrini, G., Mesiti, M.: Measuring the structural similarity among XML documents and DTDs. Journal of Intelligent Information Systems 30(1), 55–92 (2008)
Article Google Scholar
Sigmod Record, http://www.sigmod.org/publications/sigmod-record/Xml-edition
Chan, F.K.P., Fu, A.W., Yu, C.: Haar Wavelets for Efficient Similarity Search of Time-Series: With and Without Time Warping. IEEE Transactions on Knowledge and Data Engineering 15(3), 686–705 (2003)
Article Google Scholar
Liu, B., Wang, Z., Li, J.-T., Wang, W., Shi, B.-L.: Tight Bounds on the Estimation Distance Using Wavelet. In: Yu, J.X., Kitsuregawa, M., Leong, H.-V. (eds.) WAIM 2006. LNCS, vol. 4016, pp. 460–471. Springer, Heidelberg (2006)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science & Technology, Xi’an Jiaotong University, Xi’an, 710049, P.R. China
Jie Su & Junpeng Bao

Authors

Jie Su
View author publications
You can also search for this author in PubMed Google Scholar
Junpeng Bao
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

, School of Electronic Engineering, Wuhan Institute of Technology, Lvting yajing 10-3-102, Wuhan, 430079, China, People's Republic
Honghua Tan

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Su, J., Bao, J. (2012). A Wavelet Transform Based Structural Similarity Model for Semi-structured Texts. In: Tan, H. (eds) Knowledge Discovery and Data Mining. Advances in Intelligent and Soft Computing, vol 135. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-27708-5_22

Download citation

DOI: https://doi.org/10.1007/978-3-642-27708-5_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-27707-8
Online ISBN: 978-3-642-27708-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics