Abstract
The existing architecture of WWW uses URL to identify web pages. Web crawlers rely on URL normalization in order to identify equivalent URLs, which link to the same web pages. In the standard URL normalization, URLs are transformed syntactically into a canonical form and the duplicates are considered as equivalent and thus eliminated for avoiding redundant crawling. Nevertheless, it is common to encounter equivalent URLs which are syntactically different. Redundant web pages that are linked by syntactically different yet equivalent URLs are downloaded and unnecessarily processed. In this chapter, we propose to reduce the processing of redundant web pages by using URL signatures, which are constructed using the body texts of the web pages. URL signature is constructed by hashing the body text of a web page using Message-Digest algorithm 5. Web pages which share identical signatures are considered to be redundant and hence will not be further processed by web crawlers. The experimental results show that our proposed method manages to reduce 11.43% of processing the redundant web pages in comparison with only 3.02% by the standard URL normalization mechanism, at the cost of 0.54% of false positive.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Bibliography
F. Menczer, Web Crawling, in Web Data Mining, Exploring Hyperlinks, Contents and Usage Data, B. Liu, Springer Berlin Heidelberg New York, pp. 273–321 (2007).
G. Pant, P. Srinivasan, and F. Menczer, Crawling the Web, Web Dynamics 2004, pp. 153–178 (2004).
S. Brin and L. Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, in Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pp.107–117 (1998).
S. Chakrabarti, Mining the Web, Discovering Knowledge from Hypertext Data, Morgan Kaufmann Publishers, Elservier (2003).
T. Berners-Lee, R. Fielding, and L. Masinter, Uniform Resource Identifier (URI): General Syntax, Available at http://gbiv.com/protocols/uri/rfc/rfc3986.html.
S.H. Lee, S.J. Kim and S.H. Hong, On URL Normalization, in Proceedings of the 2005 International Conference on Computational Science and its Applications (ICCSA), Singapore, pp. 1076–1085 (2005).
H.-T. Lee, D. Leonard, X. Wang, and D. Loguinov, IRLbot: Scaling to 6 Billion Pages and Beyond, in Proceedings of the 17th International Conference on World Wide Web, Beijing, China, pp. 427–436 (2008).
U. Schonfeld, Z. Bar-Yossef, and I. Keidar, Do Not Crawl in the DUST: Different URLs with Similar Text, in the Proceedings of the International Conference on World Wide Web, Edinburgh, Scotland, pp. 1015–1016 (2006).
Z. Bar-Yossef, I. Keidar, and U. Schonfeld, Do Not Crawl in the DUST: Different URLs with Similar Text, in Proceedings of the International World Wide web Conference (WWW 2007), pp. 111–120 (2007).
The MD5 Message-Digest Algorithm, Available at: http://tools.ietf.org/html/rfc1321
C. Castillo, Effective Web Crawling, PhD Thesis, Department of Computer Science, University of Chile (2004).
J. Cho and H. Garcia-Molina, Effective Page Refresh Policies forWeb Crawlers, in ACMTransaction on Database Systems, Vol. 28, No. 4, pp. 390–426 (2003).
M. Burner, Crawling Towards Eternity: Building an archive of the World Wide Web, in Web Techniques Magazine, 2 (5) (1997).
S.J.Kim, H.S. Jeong, and S.H. Lee, Reliable Evaluations of URL Normalization, in Proceedings of the 2006 International Conference on Computational Science and its Applications (ICCSA), Glasgow, pp. 609–617 (2006).
A. Heydon and M. Najork, Mercator: A Scalable, ExtensibleWeb Crawler, inWorldWideWeb, Vol. 2, No. 4, pp. 219–229 (1999).
L.-K. Soon and S.H. Lee, Enhancing URL Normalization using Metadata ofWeb Pages, in Proceedings of the 2008 International Conference on Computer and Electrical Engineering (ICCEE 2008), Phuket Island, Thailand, pp. 331 ? 335 (2008).
L.-K. Soon, Towards a Semantically-Driven Focused Crawling, Ph.D Thesis, School of Computing, Soongsil University (2009).
L.-K. Soon and S.H. Lee, Identifying Equivalent URLs using URL Signatures, in Proceedings of the 4th IEEE International Conference on Signal-Image Technology & Internet-Based Systems (SITIS 2008), Bali, Indonesia, pp. 203–210 (2008).
J. Cho, N. Shivakumar, and H. Garcia-Molina, Finding Replicated Web Collections, in Proceedings of the ACM SIGMOD International Conference on Management of Data, New York, United States, pp. 355–366 (2000).
J. Han and M. Kamber, Data Mining Concepts and Techniques, Morgan Kaufmann Publishers,
Elsevier, San Francisco, CA (2006).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2010 Atlantis Press/World Scientific
About this chapter
Cite this chapter
Soon, LK., Lee, S.H. (2010). Reducing RedundantWeb Crawling Using URL Signatures. In: Web-Based Information Technologies and Distributed Systems. Atlantis Ambient and Pervasive Intelligence, vol 2. Atlantis Press. https://doi.org/10.2991/978-94-91216-32-9_6
Download citation
DOI: https://doi.org/10.2991/978-94-91216-32-9_6
Publisher Name: Atlantis Press
Online ISBN: 978-94-91216-32-9
eBook Packages: Computer ScienceComputer Science (R0)