Skip to main content

Part of the book series: Advances in Intelligent and Soft Computing ((AINSC,volume 98))

Abstract

Uniform Resource Locator (URL) ordering algorithms are used by Web crawlers to determine the order in which to download pages from the Web. The current approaches for URL ordering based on link structure are expensive and/or miss many good pages, particularly in social network environments. In this paper, we present a novel URL ordering system that relies on a cooperative approach between crawlers and web servers based on file system and Web log information. In particular, we develop algorithms based on file timestamps and Web log internal and external counts. By using this change and popularity information for URL ordering, we are able to retrieve high quality pages earlier in the crawl while avoiding requests for pages that are unchanged or no longer available. We perform our experiments on two data sets using the Web logs from university and CiteSeer websites. On these data sets, we achieve a statistically significant improvement in the ordering of the high quality pages (as indicated by Google’s PageRank) of 57.2% and 65.7% over that of a breadth-first search crawl while increasing the number of unique pages gathered by skipping unchanged or deleted pages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Brandman, O., Cho, J., Garcia-Molina, H., Shivakumar, N.: Crawler friendly Web servers. In: Proc Workshop on Performance and Architecture of Web Servers (PAWS), Santa Clara, California (2000)

    Google Scholar 

  2. Buzzi, M.: Cooperative crawling. In: Proc. Latin American Conference on World Wide Web (LA-Web), Santiago, Chile, pp. 209–211 (2003)

    Google Scholar 

  3. Castillo, C.: Effective Web crawling PhD Thesis, University of Chile, Chile (2004)

    Google Scholar 

  4. Castillo, C., Marin, M., Rodriguez, A., Baeza-Yates, R.: Scheduling algorithms for web crawling. In: Proc. Latin American Web Conference, Brazil, pp. 10–17 (2004)

    Google Scholar 

  5. Cho, J., Garcia-Molina, H., Page, L.: Efficient crawling through URL ordering. In: Proc. 7th World Wide Web Conference, Brisbane, Australia, pp. 161–172 (1998)

    Google Scholar 

  6. Cho, J., Roy, S., Adams, R.E.: Page quality: In search of an unbiased web ranking. In: Proc. 2005 ACM SIGMOD International Conference on Management of Data, Baltimore, Maryland, pp. 551–562 (2005)

    Google Scholar 

  7. Cho, J., Schonfeld, U.: RankMass crawler: A crawler with high PageRank coverage guarantee. In: Proc. 33rd International Conference on Very Large Data Bases, Vienna, Austria, pp. 375–396 (2007)

    Google Scholar 

  8. Najork, M., Wiener, J.L.: Breadth-first search crawling yields high-quality. In: Proc. 10th International World Wide Web Conference, Hong Kong, pp. 114–118 (2001)

    Google Scholar 

  9. Pandey, S., Olston, C.: Crawl ordering by search impact. In: Proc. of the International Conference on Web Search and Data Mining, Palo Alto, California, pp. 3–14 (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Chandramouli, A., Gauch, S., Eno, J. (2012). A Cooperative Approach to Web Crawler URL Ordering. In: Hippe, Z.S., Kulikowski, J.L., Mroczek, T. (eds) Human – Computer Systems Interaction: Backgrounds and Applications 2. Advances in Intelligent and Soft Computing, vol 98. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23187-2_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-23187-2_22

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-23186-5

  • Online ISBN: 978-3-642-23187-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics