Abstract
Web sites are dynamic in nature with content and structure changing overtime. Many pages on the Web are produced by content management systems (CMSs) such as WordPress, vBulletin, or phpBB. Tools currently used by Web archivists to preserve the content of the Web blindly crawl and store Web pages, disregarding the CMS the site is based on (leading to suboptimal crawling strategies) and whatever structured content is contained in Web pages (resulting in page-level archives whose content is hard to exploit). We present in this paper an application-aware helper (AAH) that fits into an archiving crawl processing chain to perform intelligent and adaptive crawling of Web applications (e.g., the pages served by a CMS). Because the AAH is aware of the Web application currently crawled, it is able to refine the list of URLs to process and to extend the archive with semantic information about extracted content. To deal with possible changes in structure of Web applications, our AAH includes an adaptation module that makes crawling resilient to small changes in the structure of Web site. We show the value of our approach by comparing the output and efficiency of the AAH with respect to regular Web crawlers, also in the presence of structure change.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Download to read the full chapter text
Chapter PDF
References
Jupp, E.: Obama’s victory tweet ‘four more years’ makes history. The Independent (November 2012), http://ind.pn/RF5Q6O
Coleman, S.: Blogs and the new politics of listening. The Political Quarterly 76(2) (2008)
Mulvenon, J.C., Chase, M.: You’ve Got Dissent! Chinese Dissident Use of the Internet and Beijing’s Counter Strategies. Rand Publishing (2002)
Giles, J.: Internet encyclopaedias go head to head. Nature 438 (2005)
Masanès, J.: Web archiving. Springer (2006)
Sigurðsson, K.: Incremental crawling with Heritrix. In: IWAW (2005)
Faheem, M.: Intelligent crawling of Web applications for Web archiving. In: WWW PhD Symposium (2012)
Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: A new approach to topic-specific Web resource discovery. Comp. Networks 31(11-16) (1999)
Gibson, D., Punera, K., Tomkins, A.: The volume and evolution of Web page templates. In: WWW (2005)
Guo, Y., Li, K., Zhang, K., Zhang, G.: Board forum crawling: A Web crawling method for Web forums. In: Web Intelligence (2006)
Cai, R., Yang, J.M., Lai, W., Wang, Y., Zhang, L.: iRobot: An intelligent crawler for Web forums. In: WWW (2008)
Ying, H.M., Thing, V.: An enhanced intelligent forum crawler. In: CISDA (2012)
Edmonds, J.: Optimum branchings. J. Res. Nat. Bureau Standards 71B (1967)
Kolari, P., Finin, T., Joshi, A.: SVMs for the blogosphere: Blog identification and splog detection. In: AAAI (2006)
Kushmerick, N.: Regression testing for wrapper maintenance. In: AAAI (1999)
Chidlovskii, B.: Automatic repairing of Web wrappers. In: WIDM (2001)
Meng, X., Hu, D., Li, C.: Schema-guided wrapper maintenance for Web-data extraction. In: WIDM (2003)
Lerman, K., Minton, S.N., Knoblock, C.A.: Wrapper maintenance: A machine learning approach. J. A. I. Res. (2003)
Lim, S.J., Ng, Y.K.: An automated change-detection algorithm for HTML documents based on semantic hierarchies. In: ICDE (2001)
Artail, H., Fawaz, K.: A fast HTML Web page change detection approach based on hashing and reducing the number of similarity computations. Data Knowl. Eng. (2008)
Ferrara, E., Baumgartner, R.: Automatic wrapper adaptation by tree edit distance matching. In: Hatzilygeroudis, I., Prentzas, J. (eds.) Combinations of Intelligent Methods and Applications. SIST, vol. 8, pp. 41–54. Springer, Heidelberg (2011)
Gulhane, P., Madaan, A., Mehta, R., Ramamirtham, J., Rastogi, R., Satpal, S., Sengamedu, S.H., Tengli, A., Tiwari, C.: Web-scale information extraction with vertex. In: ICDE (2011)
W3C: Web application description language (2009), http://www.w3.org/Submission/wadl/
Diao, Y., Altinel, M., Franklin, M.J., Zhang, H., Fischer, P.: Path sharing and predicate evaluation for high-performance XML filtering. ACM TODS (2003)
ISO: ISO 28500:2009, Information and documentation – WARC file format
WordPress: WordPress sites in the world (2012), http://en.wordpress.com/stats/
The Future Buzz: Social media, Web 2.0 and internet stats (2009), http://goo.gl/H0FNF
Royal Pingdom: WordPress completely dominates top 100 blogs (2012), http://goo.gl/eifRJ
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Faheem, M., Senellart, P. (2013). Intelligent and Adaptive Crawling of Web Applications for Web Archiving. In: Daniel, F., Dolog, P., Li, Q. (eds) Web Engineering. ICWE 2013. Lecture Notes in Computer Science, vol 7977. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39200-9_26
Download citation
DOI: https://doi.org/10.1007/978-3-642-39200-9_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-39199-6
Online ISBN: 978-3-642-39200-9
eBook Packages: Computer ScienceComputer Science (R0)