Abstract
Web sites, Web pages and the data on pages are available only for specific periods of time and are deleted afterwards from a client’s point of view. An important task in order to retrieve information from the Web is to consider Web information in the course of time. Different strategies like push and pull strategies may be applied for this task. Since push services are usually not available, pull strategies have to be conceived in order to optimize the retrieved information with respect to the age of retrieved data and its completeness. In this article we present a new procedure to optimize retrieved data from Web pages by page decomposition. By deploying an automatic Wrapper induction technique a page is decomposed into functional segments. Each segment is considered as an independent component for the analysis of the time behavior of the page. Based on this decomposition we present a new component-based download strategy. By applying this method to Web pages it is shown that for a fraction of Web data the freshness of retrieved data may be improved significantly compared to traditional methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., Raghavan, S.: Searching the web. ACM Trans. Inter. Tech. 1(1), 2–43 (2001)
Arasu, A., Garcia-Molina, H., University, S.: Extracting structured data from web pages. In: SIGMOD 2003: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pp. 337–348. ACM Press, New York (2003)
Cho, J., Garcia-Molina, H.: Estimating frequency of change. ACM Trans. Inter. Tech. 3(3), 256–290 (2003)
Coffman, E., Liu, Z., Weber, R.R.: Optimal robot scheduling for web search engines. Journal of Scheduling 1(1), 15–29 (1998)
Crescenzi, V., Mecca, G.: Automatic information extraction from large websites. J. ACM 51(5), 731–779 (2004)
Grumbach, S., Mecca, G.: In search of the lost schema. In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 314–331. Springer, Heidelberg (1998)
Kendall, J.E., Kendall, K.E.: Information delivery systems: an exploration of web pull and push technologies. Commun. AIS 1(4es), 1–43 (1999)
Kukulenz, D.: Capturing web dynamics by regular approximation. In: Zhou, X., Su, S., Papazoglou, M.P., Orlowska, M.E., Jeffery, K. (eds.) WISE 2004. LNCS, vol. 3306, pp. 528–540. Springer, Heidelberg (2004)
Laender, A., Ribeiro-Neto, B., Silva, A., Teixeira, J.: A brief survey of web data extraction tools. SIGMOD Record (June 2002)
Olston, C., Widom, J.: Best-effort cache synchronization with source cooperation. In: Proceedings of SIGMOD, May 2002, pp. 73–84 (2002)
Pandey, S., Ramamritham, K., Chakrabarti, S.: Monitoring the dynamic web to respond to continuous queries. In: WWW 2003: Proc. of the 12th int. conf. on World Wide Web, pp. 659–668. ACM Press, New York (2003)
Sharaf, M.A., Labrinidis, A., Chrysanthis, P.K., Pruhs, K.: Freshness-aware scheduling of continuous queries in the dynamic web. In: 8th Int. Workshop on the Web and Databases (WebDB 2005), Baltimore, Maryland, pp. 73–78 (2005)
Wolf, J.L., Squillante, M.S., Yu, P.S., Sethuraman, J., Ozsen, L.: Optimal crawling strategies for web search engines. In: Proceedings of the eleventh international conference on World Wide Web, pp. 136–147. ACM Press, New York (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kukulenz, D. (2005). Decomposition-Based Optimization of Reload Strategies in the World Wide Web. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, JY., Sheng, Q.Z. (eds) Web Information Systems Engineering – WISE 2005. WISE 2005. Lecture Notes in Computer Science, vol 3806. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11581062_6
Download citation
DOI: https://doi.org/10.1007/11581062_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-30017-5
Online ISBN: 978-3-540-32286-3
eBook Packages: Computer ScienceComputer Science (R0)