Skip to main content

A Wavelet Transform Based Structural Similarity Model for Semi-structured Texts

  • Chapter
Knowledge Discovery and Data Mining

Part of the book series: Advances in Intelligent and Soft Computing ((AINSC,volume 135))

  • 216 Accesses

Abstract

The semi-structured texts including Xml and Html texts are a basic information format in the Internet and World Wide Web. The text content values and the tree-organized structure are two aspects of a semi-structured text. Usually, the same text contents with different structures imply different objects. So the structural similarity of semi-structured texts is an essential key point to search, index, retrieve, query, or compare information in web pages. We presents a Wavelet Transform Based Structural Similarity Model (WTBSSM) in order to fast measure the structural similarity of semi-structured texts and compress the structural information into a short vector so as to develop an efficient semi-structured text index system. This paper introduces the Binary Encoding Method to convert a semi-structured text into a {-1, 1} sequence. Then the text structure signals are decomposed by means of Discrete Wavelet Transform to get the approximation coefficients, which is only a half length of the original signals. Finally, the structure similarity is measured by the Euclidean distance of approximation coefficients. The experimental results show that the WTBSSM can keep almost the same distance distribution to the direct distance of the original signals with a half or a quarter of information. The comparisons with a method of shorten DWT coefficients suggests that WTBSSM is better than it.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Zheng, S.H., Zhou, A.Y., Zhang, L.: Similarity Measure and Structural Index of XML Documents. Chinese Journal of Computers 26(9), 1116–1122 (2003)

    MathSciNet  Google Scholar 

  2. Tekli, J., Chbeir, R., Yetongnon, K.: A Fine-Grained XML Structural Comparison Approach. In: Parent, C., Schewe, K.-D., Storey, V.C., Thalheim, B. (eds.) ER 2007. LNCS, vol. 4801, pp. 582–598. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  3. Xie, T., Sha, C., Wang, X., Zhou, A.: Approximate Top-k Structural Similarity Search over XML Documents. In: Zhou, X., Li, J., Shen, H.T., Kitsuregawa, M., Zhang, Y. (eds.) APWeb 2006. LNCS, vol. 3841, pp. 319–330. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  4. Moon, H.J., Yoo, J.W., Choi, J.: An Effective Detection Method for Clustering Similar XML DTDs Using Tag Sequences. In: Gervasi, O., Gavrilova, M.L. (eds.) ICCSA 2007, Part II. LNCS, vol. 4706, pp. 849–860. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  5. Viyanon, W., Madria, S.K.: XML-SIM-CHANGE: Structure and Content Semantic Similarity Detection among XML Document Versions. In: Meersman, R., Dillon, T., Herrero, P. (eds.) OTM 2010. LNCS, vol. 6427, pp. 1061–1078. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  6. Leung, H.P., Chung, F.L., Chan, S.C.: On the use of hierarchical information in sequential mining-based XML document similarity computation. Knowledge and Information Systems 7, 476–498 (2005)

    Article  Google Scholar 

  7. Flesca, S., Manco, G., Masciari, E., Pontieri, L.: Fast Detection of XML Structural Similarity. IEEE Transactions on Knowledge and Data Engineering 17(2), 160–175 (2005)

    Article  Google Scholar 

  8. Yang, J.W., Chen, X.O.: Similarity measures for XML documents based on kernel matrix learning. Journal of Software 17(5), 991–1000 (2006)

    Article  MathSciNet  Google Scholar 

  9. Jeong, B., Lee, D., Cho, H., Kulvatunyou, B.: A kernel method for measuring structural similarity between XML documents. In: Proceedings of the 20th International Conference on Industrial Engineering and other Applications of Applied Intelligent Systems, pp. 572–581 (2007)

    Google Scholar 

  10. Zhang, L.J., Li, Z.H., Chen, Q., Li, N.: Structure and Content Similarity for Clustering XML Documents. In: Shen, H.T., Pei, J., Özsu, M.T., Zou, L., Lu, J., Ling, T.-W., Yu, G., Zhuang, Y., Shao, J. (eds.) WAIM 2010. LNCS, vol. 6185, pp. 116–124. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  11. Antonellis, P., Makris, C., Tsirakis, N.: XEdge: Clustering Homogeneous and Heterogeneous XML Documents Using Edge Summaries. In: Proceedings of the 2008 ACM Symposium on Applied Computing, pp. 1081–1088 (2008)

    Google Scholar 

  12. Kim, W.: XML document similarity measure in terms of the structure and contents. In: Proceedings of the 2nd WSEAS International Conference on Computer Engineering and Applications, pp. 205–212 (2008)

    Google Scholar 

  13. Wen, L., Amagasa, T., Kitagawa, H.: An Approach for XML Similarity Join Using Tree Serialization. In: Haritsa, J.R., Kotagiri, R., Pudi, V. (eds.) DASFAA 2008. LNCS, vol. 4947, pp. 562–570. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  14. Bertino, E., Guerrini, G., Mesiti, M.: Measuring the structural similarity among XML documents and DTDs. Journal of Intelligent Information Systems 30(1), 55–92 (2008)

    Article  Google Scholar 

  15. Sigmod Record, http://www.sigmod.org/publications/sigmod-record/Xml-edition

  16. Chan, F.K.P., Fu, A.W., Yu, C.: Haar Wavelets for Efficient Similarity Search of Time-Series: With and Without Time Warping. IEEE Transactions on Knowledge and Data Engineering 15(3), 686–705 (2003)

    Article  Google Scholar 

  17. Liu, B., Wang, Z., Li, J.-T., Wang, W., Shi, B.-L.: Tight Bounds on the Estimation Distance Using Wavelet. In: Yu, J.X., Kitsuregawa, M., Leong, H.-V. (eds.) WAIM 2006. LNCS, vol. 4016, pp. 460–471. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag GmbH Berlin Heidelberg

About this chapter

Cite this chapter

Su, J., Bao, J. (2012). A Wavelet Transform Based Structural Similarity Model for Semi-structured Texts. In: Tan, H. (eds) Knowledge Discovery and Data Mining. Advances in Intelligent and Soft Computing, vol 135. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-27708-5_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-27708-5_22

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-27707-8

  • Online ISBN: 978-3-642-27708-5

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics