Skip to main content

Block-Based Similarity Search on the Web Using Manifold-Ranking

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4255))

Abstract

Similarity search on the web aims to find web pages similar to a query page and return a ranked list of similar web pages. The popular approach to web page similarity search is to calculate the pairwise similarity between web pages using the Cosine measure and then rank the web pages by their similarity values with the query page. In this paper, we proposed a novel similarity search approach based on manifold-ranking of page blocks to re-rank the initially retrieved web pages. First, web pages are segmented into semantic blocks with the VIPS algorithm. Second, the blocks get their ranking scores based on the manifold-ranking algorithm. Finally, web pages are re-ranked according to the overall retrieval scores obtained by fusing the ranking scores of the corresponding blocks. The proposed approach evaluates web page similarity at a finer granularity of page block instead of at the traditionally coarse granularity of the whole web page. Moreover, it can make full use of the intrinsic global manifold structure of the blocks to rank the blocks more appropriately. Experimental results on the ODP data demonstrate that the proposed approach can significantly outperform the popular Cosine measure. Semantic block is validated to be a better unit than the whole web page in the manifold-ranking process.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrival. ACM Press and Addison Wesley (1999)

    Google Scholar 

  2. Cai, D., He, X., Li, Z., Ma, W.-Y., Wen, J.-R.: Hierarchical clustering of WWW image search results using visual, textual and link analysis. In: Proceedings of the 12th ACM International Conference on Multimedia (2004)

    Google Scholar 

  3. Cai, D., He, X., Ma, W.-Y., Wen, J.-R., Zhang, H.-J.: Organizing WWW images based on the analysis of page layout and web link structure. In: Proceedings of the 2004 IEEE International Conference on Multimedia and EXPO (ICME 2004) (2004)

    Google Scholar 

  4. Cai, D., He, X., Wen, J.-R., Ma, W.-Y.: Block-level link analysis. In: Proceedings of the 27th Annual International ACM SIGIR Conference (SIGIR 2004) (2004)

    Google Scholar 

  5. Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: VIPS: a vision based page segmentation algorithm. Microsoft Technical Report, MSR-TR-2003-79 (2003)

    Google Scholar 

  6. Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Block-based web Search. In: Proceedings of the 27th Annual International ACM SIGIR Conference (SIGIR 2004) (2004)

    Google Scholar 

  7. Chen, J., Zhou, B., Shi, J., Zhang, H.-J., Qiu, F.: Function-based object model towards website adaptation. In: Proceedings of the 10th World Wide Web conference (WWW10) (2001)

    Google Scholar 

  8. Cruz, I.F., Borisov, S., Marks, M.A., Webb, T.R.: Measuring structural similarity among web documents: preliminary results. In: Hersch, R.D., André, J., Brown, H. (eds.) RIDT 1998 and EPub 1998. LNCS, vol. 1375, pp. 513–524. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  9. Dean, J., Henzinger, M.R.: Finding related pages in the World Wide Web. In: Proceedings of the Eighth International Conference on World Wide Web, pp. 1467–1479

    Google Scholar 

  10. Fogaras, D., Rácz, B.: Scaling link-based similarity search. Technical Report (2004)

    Google Scholar 

  11. Haveliwala, T.H., Gionis, A., Klein, D., Indyk, P.: Evaluating strategies for similarity search on the Web. In: Proceedings of WWW 2002, pp. 432–442 (2002)

    Google Scholar 

  12. Jeh, G., Widom, J.: SimRank: A measure of structural-context similarity. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2002)

    Google Scholar 

  13. Joshi, S., Agrawal, N., Krishnapuram, R., Negi, S.: A bag of paths model for measuring structural similarity in web documents. In: Proceedings of the 9th ACM SIGKDD Conference, pp. 577–582 (2003)

    Google Scholar 

  14. Kovacevic, M., Diligenti, M., Gori, M., Milutinovic, V.: Recognition of common areas in a web page using visual information: a possible application in a page classification. In: Proceedings of 2002 IEEE International Conference on Data Mining (ICDM 2002), Maebashi City, Japan (2002)

    Google Scholar 

  15. Lin, Z., Lyu, M.R., King, I.: PageSim: a novel link-based measure of web page similarity. In: Proceeding of the 15th International World Wide Web Conference (2006)

    Google Scholar 

  16. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Google Scholar 

  17. Song, R., Liu, H., Wen, J.-R., Ma, W.-Y.: Learning block importance models for web pages. In: Proceeding of the Thirteenth World Wide Web conference (WWW 2004), pp. 203–211 (2004)

    Google Scholar 

  18. Tombros, A., Ali, Z.: Factors affecting web page similarity. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 487–501. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  19. van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1979)

    Google Scholar 

  20. Wan, X.: Link-based search of similar pages on the web. Master Thesis. Dalhouse University (2004)

    Google Scholar 

  21. Xue, G.-R., Zeng, H.-J., Chen, Z., Yu, Y.: MRSSA: an iterative algorithm for similarity spreading over interrelated objects. In: Proceedings of CIKM 2004 (2004)

    Google Scholar 

  22. Yu, S., Cai, D., Wen, J.-R., Ma, W.-Y.: Improving pseudo-relevance feedback in web information retrieval using web page segmentation. In: Proceedings of the Twelfth International World Wide Web Conference (WWW 2003) (2003)

    Google Scholar 

  23. Zhou, D., Bousquet, O., Lal, T.N., Weston, J., SchÖlkopf, B.: Learning with local and global consistency. In: Proceedings of NIPS 2003 (2003)

    Google Scholar 

  24. Zhou, D., Weston, J., Gretton, A., Bousquet, O., SchÖlkopf, B.: Ranking on data manifolds. In: Proceedings of NIPS 2003 (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wan, X., Yang, J., Xiao, J. (2006). Block-Based Similarity Search on the Web Using Manifold-Ranking. In: Aberer, K., Peng, Z., Rundensteiner, E.A., Zhang, Y., Li, X. (eds) Web Information Systems – WISE 2006. WISE 2006. Lecture Notes in Computer Science, vol 4255. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11912873_9

Download citation

  • DOI: https://doi.org/10.1007/11912873_9

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-48105-8

  • Online ISBN: 978-3-540-48107-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics