Advertisement

A Parallel Crawling Schema Using Dynamic Partition

  • Shoubin Dong
  • Xiaofeng Lu
  • Ling Zhang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3036)

Abstract

Parallel crawling is a key issue for search engine. In this paper we propose a parallel crawling schema based on dynamic partition, in order to fully utilize the available resources and achieve the best of load balance. The crawling schema is evaluated based on parallel metrics and performance of load balance. A prototype system built on Grid middleware has been constructed to demonstrate its efficiency and flexibility.

Keywords

Load Balance Hash Table Central Database Download Time Static Partition 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Computer Networks, 107–117 (1998)Google Scholar
  2. 2.
    Boldi, P., Codenotti, B., Santini, M., Vigna, S.: Ubicrawler: A scalable fully distributed web crawler. In: Proc. AusWeb 2002. The Eighth Australian World Wide Web Conference (2002)Google Scholar
  3. 3.
    Zeinalipour-Yazti, D., Dikaiakos, M.: Design and Implementation of a Distributed Crawler and Filtering Processor. In: Halevy, A.Y., Gal, A. (eds.) NGITS 2002. LNCS, vol. 2382, pp. 58–74. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  4. 4.
    Shkapenyuk, V., Suel, T.: Design and implementation of a high-performance distributed Web crawler. In: Proceedings of the 18th International Conference on Data Engineering (ICDE 2002), San Jose, CA, pp. 357–368 (2002)Google Scholar
  5. 5.
    Walker, R.L.: Dynamic load balancing model: Preliminary results for parallel pseudosearch engine indexers/crawler mechanisms using MPI and genetic programming. In: Palma, J.M.L.M., Dongarra, J., Hernández, V. (eds.) VECPAR 2000. LNCS, vol. 1981, pp. 61–74. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  6. 6.
    Boldi, P., Codenotti, B., Santini, M., Vigna, S.: Trovatore: Towards a highly scalable distributed web crawler. In: Proc. of 10th International World Wide Web Conference, Hong Kong, China (2001)Google Scholar
  7. 7.
    Cho, J., Garcia-Molina, H.: Parallel crawlers. In: Proc. of the 11th International World–Wide Web Conference (2002)Google Scholar
  8. 8.
    Andrews, P., Sherwin, T., Banister, B.: A centralized data access model for grid computing. In: Proceeding of the 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies (MSS 2003), San Diego, California (2003)Google Scholar
  9. 9.
    Najork, M., Wiener, J.L.: Breadth-first search crawling yields high quality pages. In: Proc. of 10th International World Wide Web Conference, Hong Kong, China (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Shoubin Dong
    • 1
  • Xiaofeng Lu
    • 1
  • Ling Zhang
    • 1
  1. 1.Network Research CenterSouth China University of TechnologyGuangzhouChina

Personalised recommendations