Mercator: A scalable, extensible Web crawler

Abstract

This paper describes Mercator, a scalable, extensible Web crawler written entirely in Java. Scalable Web crawlers are an important component of many Web services, but their design is not well‐documented in the literature. We enumerate the major components of any scalable Web crawler, comment on alternatives and tradeoffs in their design, and describe the particular components used in Mercator. We also describe Mercator's support for extensibility and customizability. Finally, we comment on Mercator's performance, which we have found to be comparable to that of other crawlers for which performance numbers have been published.

This is a preview of subscription content, access via your institution.

References

  1. AltaVista, “AltaVista Software Search Intranet Home Page,” altavista.software.digital.com/search/intranet.

  2. BIND, “Berkeley Internet Name Domain (BIND),” www.isc.org/bind.html.

  3. Bloom, B. (1970), “Space/Time Trade-Offs in Hash Coding with Allowable Errors,” Communications of the ACM 13, 7, 422–426.

    Google Scholar 

  4. Brin, S. and L. Page (1998), “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” In Proceedings of the Seventh International World Wide Web Conference, pp. 107–117.

  5. Broder, A. (1993), “Some Applications of Rabin's Fingerprinting Method,” In Sequences II: Methods in Communications, Security, and Computer Science, R. Capocelli, A. De Santis, and U. Vaccaro, Eds., Springer-Verlag, pp. 143–152.

  6. Burner, M. (1977), “Crawling Towards Eternity: Building an Archive of the World Wide Web,” Web Techniques Magazine 2, 5.

  7. Cho, J., H. Garcia-Molina, and L. Page (1998), “Efficient Crawling Through URL Ordering,” In Proceedings of the Seventh International World Wide Web Conference, pp. 161–172.

  8. DCPI, “Digital Continuous Profiling Infrastructure,” www.research.digital.com/SRC/dcpi/.

  9. Eichmann, D. (1994), “The RBSE Spider - Balancing Effective Search Against Web Load,” In Proceedings of the First International World Wide Web Conference, pp. 113–120.

  10. Ghemawat, S., “srcjava home page,” www.research.digital.com/SRC/java/.

  11. Google, “Google! Search Engine,” google.stanford.edu/.

  12. Gray, M., “Internet Growth and Statistics: Credits and Background,” www.mit.edu/people/mkgray/net/background.html.

  13. Henzinger, M., A. Heydon, M. Mitzenmacher, and M.A. Najork (1999), “Measuring Index Quality Using Random Walks on the Web,” In Proceedings of the Eighth International World Wide Web Conference, pp. 213–225.

  14. Heydon, A. and M. Najork (1999), “Performance Limitations of the Java Core Libraries,” In Proceedings of the 1999 ACM Java Grande Conference, pp. 35–41.

  15. InternetArchive, “The Internet Archive,” www.archive.org/.

  16. Koster, M., “The Web Robots Pages,” info.webcrawler.com/mak/projects/robots/robots. html.

  17. McBryan, O.A. (1994), “GENVL and WWWW: Tools for Taming the Web,” In Proceedings of the First International World Wide Web Conference, pp. 79–90.

  18. Miller, R.C. and K. Bharat (1998), “SPHINX: A Framework for Creating Personal, Site-Specific Web Crawlers,” In Proceedings of the Seventh International World Wide Web Conference, pp. 119–130.

  19. Pinkerton, B. (1994), “Finding What People Want: Experiences with the WebCrawler,” In Proceedings of the Second International World Wide Web Conference.

  20. Rabin, M.O. (1981), “Fingerprinting by Random Polynomials,” Technical Report TR–15-81, Center for Research in Computing Technology, Harvard University.

  21. RobotsExclusion, “The Robots Exclusion Protocol,” info.webcrawler.com/mak/projects/robots/ exclusion.html.

  22. Smith, Z. (1997), “The Truth About the Web: Crawling Towards Eternity,” Web Techniques Magazine 2, 5.

Download references

Author information

Affiliations

Authors

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Heydon, A., Najork, M. Mercator: A scalable, extensible Web crawler. World Wide Web 2, 219–229 (1999). https://doi.org/10.1023/A:1019213109274

Download citation

Keywords

  • Performance Number
  • Computing Profession