Advertisement

What’s Changed? Measuring Document Change in Web Crawling for Search Engines

  • Halil Ali
  • Hugh E. Williams
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2857)

Abstract

To provide fast, scalable search facilities, web search engines store collections locally. The collections are gathered by crawling the Web. A problem with crawling is determining when to revisit resources because they have changed: stale documents contribute towards poor search results, while unnecessary refreshing is expensive. However, some changes — such as in images, advertisements, and headers — are unlikely to affect query results. In this paper, we investigate measures for determining whether documents have changed and should be recrawled. We show that content-based measures are more effective than the traditional approach of using HTTP headers. Refreshing based on HTTP headers typically recrawls 16% of the collection each day, but users do not retrieve the majority of refreshed documents. In contrast, refreshing documents when more than twenty words change recrawls 22% of the collection but updates documents more effectively. We conclude that our simple measures are an effective component of a web crawling strategy.

Keywords

Search Engine User Agent Document Version Document Content International World Wide 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Cho, J., Garcia-Molina, H.: The Evolution of the Web and Implications for an Incremental Crawler. In: Abbadi, A.E., Brodie, M.L., Chakravarthy, S., Dayal, U., Kamel, N., Schlageter, G., Whang, K.-Y. (eds.) Proceedings of 26th International Conference on Very Large Data Bases, pp. 200–209. Morgan Kaufmann, Cairo Egypt (2000)Google Scholar
  2. 2.
    Cho, J., Garcia-Molina, H.: Estimating Frequency of Change. Stanford University, Computer Science Department (November 2000)Google Scholar
  3. 3.
    Edwards, J., McCurley, K.S., Tomlin, J.A.: An adaptive model for optimizing performance of an incremental web crawler. In: Proceedings of the Tenth International World Wide Web Conference, pp. 106–113. ACM Press, Hong Kong (2001)CrossRefGoogle Scholar
  4. 4.
    Brewington, B.E., Cybenko, G.: How dynamic is the Web? In: Proceedings of the 9th international World Wide Web Conference on Computer Networks, Amsterdam, Netherlands, vol. 33(1–6), pp. 257–276 (2000)Google Scholar
  5. 5.
    Liu, L., Pu, C., Tang, W.: WebCQ: Detecting and delivering information changes on the Web. In: Proceedings of the International Conference on Information and Knowledge Management (CIKM), pp. 512–519. ACM Press, McLean (2000)Google Scholar
  6. 6.
    Broder, A., Glassman, S., Manasse, M., Zweig, G.: Syntactic clustering of the Web. In: Proceedings of the 6th International World Wide Web Conference, Santa Clara, California, pp. 391–404 (1997)Google Scholar
  7. 7.
    Wills, C.E., Mikhailov, M.: Towards a better understanding of Web resources and server responses for improved caching. Computer Networks 31(11-16), 1286–1389 (1999)CrossRefGoogle Scholar
  8. 8.
    Williams, H.E., Zobel, J.: Searchable Words on the Web. International Journal of Digital Libraries (to appear)Google Scholar
  9. 9.
    Spink, A., Wolfram, D., Jansen, B.J., Saracevic, T.: Searching the web: The public and their queries. Journal of the American Society for Information Science 52(3), 226–234 (2001)CrossRefGoogle Scholar
  10. 10.
    Harman, D.: Overview of the second text retrieval conference (TREC-2). Information Processing & Management 31(3), 271–289 (1995)CrossRefGoogle Scholar
  11. 11.
    Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann Publishers, Los Altos (1999)Google Scholar
  12. 12.
    Robertson, S.E., Walker, S.: Okapi/Keenbow at TREC-8. In: Voorhees, E.M., Harman, D. (eds.) Proceedings Text Retrieval Conference (TREC), National Institute of Standards and Technology, Washington, pp. 151–162 (1999)Google Scholar
  13. 13.
    Hawking, D., Craswell, N., Thistlewaite, P.: Overview of TREC-7 Very Large Collection Track. In: The Eighth Text Retrieval Conference (TREC 8), National Institute of Standards and Technology Special Publication 500-246, Washington DC, pp. 91–104 (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Halil Ali
    • 1
  • Hugh E. Williams
    • 1
  1. 1.School of Computer Science and Information TechnologyRMIT UniversityMelbourneAustralia

Personalised recommendations