Skip to main content

Part of the book series: Atlantis Ambient and Pervasive Intelligence ((ATLANTISAPI,volume 2))

  • 420 Accesses

Abstract

The existing architecture of WWW uses URL to identify web pages. Web crawlers rely on URL normalization in order to identify equivalent URLs, which link to the same web pages. In the standard URL normalization, URLs are transformed syntactically into a canonical form and the duplicates are considered as equivalent and thus eliminated for avoiding redundant crawling. Nevertheless, it is common to encounter equivalent URLs which are syntactically different. Redundant web pages that are linked by syntactically different yet equivalent URLs are downloaded and unnecessarily processed. In this chapter, we propose to reduce the processing of redundant web pages by using URL signatures, which are constructed using the body texts of the web pages. URL signature is constructed by hashing the body text of a web page using Message-Digest algorithm 5. Web pages which share identical signatures are considered to be redundant and hence will not be further processed by web crawlers. The experimental results show that our proposed method manages to reduce 11.43% of processing the redundant web pages in comparison with only 3.02% by the standard URL normalization mechanism, at the cost of 0.54% of false positive.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Bibliography

  1. F. Menczer, Web Crawling, in Web Data Mining, Exploring Hyperlinks, Contents and Usage Data, B. Liu, Springer Berlin Heidelberg New York, pp. 273–321 (2007).

    Google Scholar 

  2. G. Pant, P. Srinivasan, and F. Menczer, Crawling the Web, Web Dynamics 2004, pp. 153–178 (2004).

    Google Scholar 

  3. S. Brin and L. Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, in Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pp.107–117 (1998).

    Google Scholar 

  4. S. Chakrabarti, Mining the Web, Discovering Knowledge from Hypertext Data, Morgan Kaufmann Publishers, Elservier (2003).

    Google Scholar 

  5. T. Berners-Lee, R. Fielding, and L. Masinter, Uniform Resource Identifier (URI): General Syntax, Available at http://gbiv.com/protocols/uri/rfc/rfc3986.html.

  6. S.H. Lee, S.J. Kim and S.H. Hong, On URL Normalization, in Proceedings of the 2005 International Conference on Computational Science and its Applications (ICCSA), Singapore, pp. 1076–1085 (2005).

    Google Scholar 

  7. H.-T. Lee, D. Leonard, X. Wang, and D. Loguinov, IRLbot: Scaling to 6 Billion Pages and Beyond, in Proceedings of the 17th International Conference on World Wide Web, Beijing, China, pp. 427–436 (2008).

    Google Scholar 

  8. U. Schonfeld, Z. Bar-Yossef, and I. Keidar, Do Not Crawl in the DUST: Different URLs with Similar Text, in the Proceedings of the International Conference on World Wide Web, Edinburgh, Scotland, pp. 1015–1016 (2006).

    Google Scholar 

  9. Z. Bar-Yossef, I. Keidar, and U. Schonfeld, Do Not Crawl in the DUST: Different URLs with Similar Text, in Proceedings of the International World Wide web Conference (WWW 2007), pp. 111–120 (2007).

    Google Scholar 

  10. The MD5 Message-Digest Algorithm, Available at: http://tools.ietf.org/html/rfc1321

  11. C. Castillo, Effective Web Crawling, PhD Thesis, Department of Computer Science, University of Chile (2004).

    Google Scholar 

  12. J. Cho and H. Garcia-Molina, Effective Page Refresh Policies forWeb Crawlers, in ACMTransaction on Database Systems, Vol. 28, No. 4, pp. 390–426 (2003).

    Google Scholar 

  13. M. Burner, Crawling Towards Eternity: Building an archive of the World Wide Web, in Web Techniques Magazine, 2 (5) (1997).

    Google Scholar 

  14. S.J.Kim, H.S. Jeong, and S.H. Lee, Reliable Evaluations of URL Normalization, in Proceedings of the 2006 International Conference on Computational Science and its Applications (ICCSA), Glasgow, pp. 609–617 (2006).

    Google Scholar 

  15. A. Heydon and M. Najork, Mercator: A Scalable, ExtensibleWeb Crawler, inWorldWideWeb, Vol. 2, No. 4, pp. 219–229 (1999).

    Google Scholar 

  16. L.-K. Soon and S.H. Lee, Enhancing URL Normalization using Metadata ofWeb Pages, in Proceedings of the 2008 International Conference on Computer and Electrical Engineering (ICCEE 2008), Phuket Island, Thailand, pp. 331 ? 335 (2008).

    Google Scholar 

  17. L.-K. Soon, Towards a Semantically-Driven Focused Crawling, Ph.D Thesis, School of Computing, Soongsil University (2009).

    Google Scholar 

  18. L.-K. Soon and S.H. Lee, Identifying Equivalent URLs using URL Signatures, in Proceedings of the 4th IEEE International Conference on Signal-Image Technology & Internet-Based Systems (SITIS 2008), Bali, Indonesia, pp. 203–210 (2008).

    Google Scholar 

  19. J. Cho, N. Shivakumar, and H. Garcia-Molina, Finding Replicated Web Collections, in Proceedings of the ACM SIGMOD International Conference on Management of Data, New York, United States, pp. 355–366 (2000).

    Google Scholar 

  20. J. Han and M. Kamber, Data Mining Concepts and Techniques, Morgan Kaufmann Publishers,

    Google Scholar 

  21. Elsevier, San Francisco, CA (2006).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lay-Ki Soon .

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Atlantis Press/World Scientific

About this chapter

Cite this chapter

Soon, LK., Lee, S.H. (2010). Reducing RedundantWeb Crawling Using URL Signatures. In: Web-Based Information Technologies and Distributed Systems. Atlantis Ambient and Pervasive Intelligence, vol 2. Atlantis Press. https://doi.org/10.2991/978-94-91216-32-9_6

Download citation

Publish with us

Policies and ethics

Societies and partnerships