Reducing RedundantWeb Crawling Using URL Signatures

Soon, Lay-Ki; Lee, Sang Ho

doi:10.2991/978-94-91216-32-9_6

Lay-Ki Soon⁵ &
Sang Ho Lee⁵

Part of the book series: Atlantis Ambient and Pervasive Intelligence ((ATLANTISAPI,volume 2))

420 Accesses

Abstract

The existing architecture of WWW uses URL to identify web pages. Web crawlers rely on URL normalization in order to identify equivalent URLs, which link to the same web pages. In the standard URL normalization, URLs are transformed syntactically into a canonical form and the duplicates are considered as equivalent and thus eliminated for avoiding redundant crawling. Nevertheless, it is common to encounter equivalent URLs which are syntactically different. Redundant web pages that are linked by syntactically different yet equivalent URLs are downloaded and unnecessarily processed. In this chapter, we propose to reduce the processing of redundant web pages by using URL signatures, which are constructed using the body texts of the web pages. URL signature is constructed by hashing the body text of a web page using Message-Digest algorithm 5. Web pages which share identical signatures are considered to be redundant and hence will not be further processed by web crawlers. The experimental results show that our proposed method manages to reduce 11.43% of processing the redundant web pages in comparison with only 3.02% by the standard URL normalization mechanism, at the cost of 0.54% of false positive.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Bibliography

F. Menczer, Web Crawling, in Web Data Mining, Exploring Hyperlinks, Contents and Usage Data, B. Liu, Springer Berlin Heidelberg New York, pp. 273–321 (2007).
Google Scholar
G. Pant, P. Srinivasan, and F. Menczer, Crawling the Web, Web Dynamics 2004, pp. 153–178 (2004).
Google Scholar
S. Brin and L. Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, in Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pp.107–117 (1998).
Google Scholar
S. Chakrabarti, Mining the Web, Discovering Knowledge from Hypertext Data, Morgan Kaufmann Publishers, Elservier (2003).
Google Scholar
T. Berners-Lee, R. Fielding, and L. Masinter, Uniform Resource Identifier (URI): General Syntax, Available at http://gbiv.com/protocols/uri/rfc/rfc3986.html.
S.H. Lee, S.J. Kim and S.H. Hong, On URL Normalization, in Proceedings of the 2005 International Conference on Computational Science and its Applications (ICCSA), Singapore, pp. 1076–1085 (2005).
Google Scholar
H.-T. Lee, D. Leonard, X. Wang, and D. Loguinov, IRLbot: Scaling to 6 Billion Pages and Beyond, in Proceedings of the 17th International Conference on World Wide Web, Beijing, China, pp. 427–436 (2008).
Google Scholar
U. Schonfeld, Z. Bar-Yossef, and I. Keidar, Do Not Crawl in the DUST: Different URLs with Similar Text, in the Proceedings of the International Conference on World Wide Web, Edinburgh, Scotland, pp. 1015–1016 (2006).
Google Scholar
Z. Bar-Yossef, I. Keidar, and U. Schonfeld, Do Not Crawl in the DUST: Different URLs with Similar Text, in Proceedings of the International World Wide web Conference (WWW 2007), pp. 111–120 (2007).
Google Scholar
The MD5 Message-Digest Algorithm, Available at: http://tools.ietf.org/html/rfc1321
C. Castillo, Effective Web Crawling, PhD Thesis, Department of Computer Science, University of Chile (2004).
Google Scholar
J. Cho and H. Garcia-Molina, Effective Page Refresh Policies forWeb Crawlers, in ACMTransaction on Database Systems, Vol. 28, No. 4, pp. 390–426 (2003).
Google Scholar
M. Burner, Crawling Towards Eternity: Building an archive of the World Wide Web, in Web Techniques Magazine, 2 (5) (1997).
Google Scholar
S.J.Kim, H.S. Jeong, and S.H. Lee, Reliable Evaluations of URL Normalization, in Proceedings of the 2006 International Conference on Computational Science and its Applications (ICCSA), Glasgow, pp. 609–617 (2006).
Google Scholar
A. Heydon and M. Najork, Mercator: A Scalable, ExtensibleWeb Crawler, inWorldWideWeb, Vol. 2, No. 4, pp. 219–229 (1999).
Google Scholar
L.-K. Soon and S.H. Lee, Enhancing URL Normalization using Metadata ofWeb Pages, in Proceedings of the 2008 International Conference on Computer and Electrical Engineering (ICCEE 2008), Phuket Island, Thailand, pp. 331 ? 335 (2008).
Google Scholar
L.-K. Soon, Towards a Semantically-Driven Focused Crawling, Ph.D Thesis, School of Computing, Soongsil University (2009).
Google Scholar
L.-K. Soon and S.H. Lee, Identifying Equivalent URLs using URL Signatures, in Proceedings of the 4th IEEE International Conference on Signal-Image Technology & Internet-Based Systems (SITIS 2008), Bali, Indonesia, pp. 203–210 (2008).
Google Scholar
J. Cho, N. Shivakumar, and H. Garcia-Molina, Finding Replicated Web Collections, in Proceedings of the ACM SIGMOD International Conference on Management of Data, New York, United States, pp. 355–366 (2000).
Google Scholar
J. Han and M. Kamber, Data Mining Concepts and Techniques, Morgan Kaufmann Publishers,
Google Scholar
Elsevier, San Francisco, CA (2006).
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing, Soongsil University, Seoul, Korea
Lay-Ki Soon & Sang Ho Lee

Authors

Lay-Ki Soon
View author publications
You can also search for this author in PubMed Google Scholar
Sang Ho Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lay-Ki Soon .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Soon, LK., Lee, S.H. (2010). Reducing RedundantWeb Crawling Using URL Signatures. In: Web-Based Information Technologies and Distributed Systems. Atlantis Ambient and Pervasive Intelligence, vol 2. Atlantis Press. https://doi.org/10.2991/978-94-91216-32-9_6

Download citation

DOI: https://doi.org/10.2991/978-94-91216-32-9_6
Publisher Name: Atlantis Press
Online ISBN: 978-94-91216-32-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

Atlantis Press (opens in a new tab)