Skip to main content

Decomposition-Based Optimization of Reload Strategies in the World Wide Web

  • Conference paper
Web Information Systems Engineering – WISE 2005 (WISE 2005)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3806))

Included in the following conference series:

Abstract

Web sites, Web pages and the data on pages are available only for specific periods of time and are deleted afterwards from a client’s point of view. An important task in order to retrieve information from the Web is to consider Web information in the course of time. Different strategies like push and pull strategies may be applied for this task. Since push services are usually not available, pull strategies have to be conceived in order to optimize the retrieved information with respect to the age of retrieved data and its completeness. In this article we present a new procedure to optimize retrieved data from Web pages by page decomposition. By deploying an automatic Wrapper induction technique a page is decomposed into functional segments. Each segment is considered as an independent component for the analysis of the time behavior of the page. Based on this decomposition we present a new component-based download strategy. By applying this method to Web pages it is shown that for a fraction of Web data the freshness of retrieved data may be improved significantly compared to traditional methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., Raghavan, S.: Searching the web. ACM Trans. Inter. Tech. 1(1), 2–43 (2001)

    Article  Google Scholar 

  2. Arasu, A., Garcia-Molina, H., University, S.: Extracting structured data from web pages. In: SIGMOD 2003: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pp. 337–348. ACM Press, New York (2003)

    Chapter  Google Scholar 

  3. Cho, J., Garcia-Molina, H.: Estimating frequency of change. ACM Trans. Inter. Tech. 3(3), 256–290 (2003)

    Article  Google Scholar 

  4. Coffman, E., Liu, Z., Weber, R.R.: Optimal robot scheduling for web search engines. Journal of Scheduling 1(1), 15–29 (1998)

    Article  MATH  MathSciNet  Google Scholar 

  5. Crescenzi, V., Mecca, G.: Automatic information extraction from large websites. J. ACM 51(5), 731–779 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  6. Grumbach, S., Mecca, G.: In search of the lost schema. In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 314–331. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  7. Kendall, J.E., Kendall, K.E.: Information delivery systems: an exploration of web pull and push technologies. Commun. AIS 1(4es), 1–43 (1999)

    Google Scholar 

  8. Kukulenz, D.: Capturing web dynamics by regular approximation. In: Zhou, X., Su, S., Papazoglou, M.P., Orlowska, M.E., Jeffery, K. (eds.) WISE 2004. LNCS, vol. 3306, pp. 528–540. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  9. Laender, A., Ribeiro-Neto, B., Silva, A., Teixeira, J.: A brief survey of web data extraction tools. SIGMOD Record (June 2002)

    Google Scholar 

  10. Olston, C., Widom, J.: Best-effort cache synchronization with source cooperation. In: Proceedings of SIGMOD, May 2002, pp. 73–84 (2002)

    Google Scholar 

  11. Pandey, S., Ramamritham, K., Chakrabarti, S.: Monitoring the dynamic web to respond to continuous queries. In: WWW 2003: Proc. of the 12th int. conf. on World Wide Web, pp. 659–668. ACM Press, New York (2003)

    Google Scholar 

  12. Sharaf, M.A., Labrinidis, A., Chrysanthis, P.K., Pruhs, K.: Freshness-aware scheduling of continuous queries in the dynamic web. In: 8th Int. Workshop on the Web and Databases (WebDB 2005), Baltimore, Maryland, pp. 73–78 (2005)

    Google Scholar 

  13. Wolf, J.L., Squillante, M.S., Yu, P.S., Sethuraman, J., Ozsen, L.: Optimal crawling strategies for web search engines. In: Proceedings of the eleventh international conference on World Wide Web, pp. 136–147. ACM Press, New York (2002)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kukulenz, D. (2005). Decomposition-Based Optimization of Reload Strategies in the World Wide Web. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, JY., Sheng, Q.Z. (eds) Web Information Systems Engineering – WISE 2005. WISE 2005. Lecture Notes in Computer Science, vol 3806. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11581062_6

Download citation

  • DOI: https://doi.org/10.1007/11581062_6

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-30017-5

  • Online ISBN: 978-3-540-32286-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics