Abstract
The crawler engines of today cannot reach most of the information contained in the Web. A great amount of valuable information is “hidden” behind the query forms of online databases, and/or is dynamically generated by technologies such as JavaScript. This portion of the web is usually known as the Deep Web or the Hidden Web. We have built DeepBot, a prototype hidden-web crawler able to access such content. DeepBot receives as input a set of domain definitions, each one describing a specific data-collecting task and automatically identifies and learns to execute queries on the forms relevant to them. In this paper we describe the techniques employed for building DeepBot and report the experimental results obtained when testing it with several real world data collection tasks.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This research was partially supported by the Spanish Ministry of Education and Science under project TSI2005-07730.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Álvarez, M., Pan, A., Raposo, J., Hidalgo, J.: Crawling Web Pages with Support for Client-Side Dynamism. In: Yu, J.X., Kitsuregawa, M., Leong, H.V. (eds.) WAIM 2006. LNCS, vol. 4016, pp. 252–262. Springer, Heidelberg (2006)
Álvarez, M., Raposo J., Pan, A., Cacheda, F., Bellas, F., Carneiro, V. Crawling the Content Hidden Behind Web Forms, http://www.tic.udc.es/~mad/publications/cchiddenbwf_extended.pdf
Bergholz, A., Chidlovskii, B.: Crawling for Domain-Specific Hidden Web Resources. In: Proceedings of the 4th Int. Conference on Web Information Systems Engineering (2003)
Bergman, M.: The Deep Web. Surfacing Hidden Value (2001), http://brightplanet.com/technology/deepweb.asp
Chang, C.-C.K., He, B., Patel, M., Zhang, Z.: Structured Databases on the Web: Observations and Implications. SIGMOD Record 33(3) (2004)
Chang, C.-C.K., He, B., Zhang, Z.: MetaQuerier over the Deep Web: Shallow Integration Across Holistic Sources. In: Proceedings of the VLDB Workshop on Information Integration on the Web (2004)
Cohen, W., Ravikumar, P., Fienberg, S.: A Comparison of String Distance Metrics for Name-Matching Tasks. In: Proceedings of IJCAI-03 Workshop (2003)
Gravano, L., Ipeirotis, P., Sahami, M.: QProber: A System for Automatic Classification of Hidden-Web Databases. ACM Transactions on Information Systems 21(1) (2003)
He, H., Meng, W., Yu, C., Wu, Z.: Automatic Integration of Web Search Interfaces with WISE-Integrator. VLDB Journal 13(3), 256–273 (2004)
Ipeirotis, P., Gravano, L.: Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection. In: Proceedings of the 28th Very Large DataBases Conference (2002)
Liddle, S., Embley, D., Scott, D., Yau Ho, S.: Extracting Data Behind Web Forms. In: Proceedings of the 28th Intl. Conference on Very Large Databases (2002)
Ntoulas, A., Zerfos, et al.: Downloading Textual Hidden Web Content Through Keyword Queries. In: Proceedings of the 5th ACM/IEEE Joint Conference on Digital Libraries (2005)
Pan, A., Raposo, J., Álvarez, M., Hidalgo, J., Viña, A.: Semi-Automatic Wrapper Generation for Commercial Web Sources. In: Proceedings of IFIP WG8.1 Working Conference on Engineering Information Systems in the Internet Context. (2002)
Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. Technical Report 2000 -36, Computer Science Department, Stanford University, (December 2000), Available at http://dbpubs.stanford.edu/pub/2000-36
Zhang, Z., He, B., Chang, C.-C.K.: Light-weight Domain-based Form Assistant: Querying Web Databases On the Fly. In: Proceedings of the 31st Very Large Data Bases Conference (2005)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Álvarez, M., Raposo, J., Pan, A., Cacheda, F., Bellas, F., Carneiro, V. (2007). Crawling the Content Hidden Behind Web Forms. In: Gervasi, O., Gavrilova, M.L. (eds) Computational Science and Its Applications – ICCSA 2007. ICCSA 2007. Lecture Notes in Computer Science, vol 4706. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74477-1_31
Download citation
DOI: https://doi.org/10.1007/978-3-540-74477-1_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74475-7
Online ISBN: 978-3-540-74477-1
eBook Packages: Computer ScienceComputer Science (R0)