Skip to main content

Learning to Schedule Webpage Updates Using Genetic Programming

  • Conference paper
String Processing and Information Retrieval (SPIRE 2013)

Abstract

A key challenge endured when designing a scheduling policy regarding freshness is to estimate the likelihood of a previously crawled webpage being modified on the web. This estimate is used to define the order in which those pages should be visited, and can be explored to reduce the cost of monitoring crawled webpages for keeping updated versions. We here present a novel approach to generate score functions that produce accurate rankings of pages regarding their probability of being modified when compared to their previously crawled versions. We propose a flexible framework that uses genetic programming to evolve score functions to estimate the likelihood that a webpage has been modified. We present a thorough experimental evaluation of the benefits of our framework over five state-of-the-art baselines.

The original version of this chapter was revised: The copyright line was incorrect. This has been corrected. The Erratum to this chapter is available at DOI: 10.1007/978-3-319-02432-5_33

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Carvalho, A.L., Rossi, C., de Moura, E.S., da Silva, A.S., Fernandes, D.: Lepref: Learn to precompute evidence fusion for efficient query evaluation. Journal of the American Society for Information Science and TechnologyĀ 63(7), 1383ā€“1397 (2012)

    ArticleĀ  Google ScholarĀ 

  2. Cho, J., Garcia-Molina, H.: Synchronizing a database to improve freshness. In: SIGMOD Record, pp. 117ā€“128 (2000)

    Google ScholarĀ 

  3. Cho, J., Garcia-Molina, H.: Estimating frequency of change. ACM Transactions on Internet TechnologyĀ 3, 256ā€“290 (2003)

    ArticleĀ  Google ScholarĀ 

  4. Cho, J., Ntoulas, A.: Effective change detection using sampling. In: VLDB, pp. 514ā€“525 (2002)

    Google ScholarĀ 

  5. Coffman, E.G., Liu, Z., Weber, R.R.: Optimal robot scheduling for web search engines. Journal of SchedulingĀ 1(1) (1998)

    Google ScholarĀ 

  6. de Almeida, H.M., GonƧalves, M.A., Cristo, M., Calado, P.: A combined component approach for finding collection-adapted ranking functions based on genetic programming. In: SIGIR, pp. 399ā€“406 (2007)

    Google ScholarĀ 

  7. Douglis, F., Feldmann, A., Krishnamurthy, B., Mogul, J.: Rate of change and other metrics: a live study of the world wide web. In: USENIX Symposium on Internet Technologies and Systems, p. 14 (1997)

    Google ScholarĀ 

  8. Henrique, W.F., Ziviani, N., Cristo, M.A., de Moura, E.S., da Silva, A.S., Carvalho, C.: A new approach for verifying URL uniqueness in web crawlers. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds.) SPIRE 2011. LNCS, vol.Ā 7024, pp. 237ā€“248. Springer, Heidelberg (2011)

    ChapterĀ  Google ScholarĀ 

  9. Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press (1992)

    Google ScholarĀ 

  10. Radinsky, K., Bennett, P.: Predicting content change on the web. In: WSDM (2013)

    Google ScholarĀ 

  11. Tan, Q., Mitra, P.: Clustering-based incremental web crawling. ACM Transactions on Information SystemsĀ 28, 17:1ā€“17:27 (2010)

    Google ScholarĀ 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

Ā© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Santos, A.S.R., Ziviani, N., Almeida, J., Carvalho, C.R., de Moura, E.S., da Silva, A.S. (2013). Learning to Schedule Webpage Updates Using Genetic Programming. In: Kurland, O., Lewenstein, M., Porat, E. (eds) String Processing and Information Retrieval. SPIRE 2013. Lecture Notes in Computer Science, vol 8214. Springer, Cham. https://doi.org/10.1007/978-3-319-02432-5_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-02432-5_30

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-02431-8

  • Online ISBN: 978-3-319-02432-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics