Topic-Sensitive Hidden-Web Crawling

Liakos, Panagiotis; Ntoulas, Alexandros

doi:10.1007/978-3-642-35063-4_39

Panagiotis Liakos²⁰ &
Alexandros Ntoulas^20,21

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7651))

Included in the following conference series:

International Conference on Web Information Systems Engineering

2572 Accesses
2 Citations

Abstract

A constantly growing amount of high-quality information is stored in pages coming from the Hidden Web. Such pages are accessible only through a query interface that a Hidden-Web site provides and may span a variety of topics.

In order to provide centralized access to the Hidden Web, previous works have focused on query generation techniques that aim at downloading all content of a given Hidden Web site with the minimum cost. In certain settings however, we are interested in downloading only a specific part of such a site. For example, in a news database, a user may be interested in retrieving only sports articles but no politics. In this case, we need to make the best use of our resources in downloading only the portion of the Hidden Web site that we are interested in.

In this paper, we study how we can build a topically-focused Hidden Web crawler that can autonomously extract topic-specific pages from the Hidden Web by searching only the subset that is related to the corresponding category. To this end, we present query generation techniques that take into account the topic that we are interested in. We propose a number of different crawling policies and we experimentally evaluate them with data from two popular sites.

Partially supported by PIRG06-GA-2009-256603.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

The Open Directory Project, http://www.dmoz.org
Stack Exchange, http://stackexchange.com/
Álvarez, M., Raposo, J., Pan, A., Cacheda, F., Bellas, F., Carneiro, V.: Deepbot: a focused crawler for accessing hidden web content. In: Proc. of Int. Workshop on Data Enginering Issues in E-commerce and Services, DEECS 2007, NY, USA (2007)
Google Scholar
Angkawattanawit, N., Rungsawang, A.: Learnable crawling: An efficient approach to topic-specific web resource discovery (2002)
Google Scholar
Barbosa, L., Freire, J.: Siphoning hidden-web data through keyword-based interfaces. In: Proceedings of SBBD, Brazil, (2004)
Google Scholar
Barbosa, L., Freire, J.: An adaptive crawler for locating hidden-web entry points. In: Proceedings of the WWW Conference, NY, USA (2007)
Google Scholar
Baykan, E., Henzinger, M., Marian, L., Weber, I.: Purely url-based topic classification. In: Proceedings of the WWW Conference, Madrid, Spain (2009)
Google Scholar
Bergholz, A., Chidlovskii, B.: Crawling for domain-specific hidden web resources. In: Proceedings of WISE, DC, USA (2003)
Google Scholar
Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. In: Proceedings of the WWW Conference, NY, USA (1999)
Google Scholar
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: Focused crawling using context graphs. In: Proceedings of VLDB, CA, USA (2000)
Google Scholar
Ipeirotis, P.G., Gravano, L.: Distributed search over the hidden web: hierarchical database sampling and selection. In: Proceedings of VLDB, Hong Kong (2002)
Google Scholar
Ipeirotis, P.G., Gravano, L., Sahami, M.: Probe, count, and classify: categorizing hidden web databases. SIGMOD Rec. 30, 67–78 (2001)
Article Google Scholar
Liu, W., Xiao, J., Yang, J.: A sample-guided approach to incremental structured web database crawling. In: Proceedings of ICIA, Harbin, China (2010)
Google Scholar
Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s deep web crawl. In: Proceedings of VLDB, Auckland, New Zealand (2008)
Google Scholar
Menczer, F., Pant, G., Srinivasan, P.: Topic-driven crawlers: Machine learning issues. ACM TOIT (submitted, 2002)
Google Scholar
Noh, S., Choi, Y., Seo, H., Choi, K., Jung, G.: An Intelligent Topic-Specific Crawler Using Degree of Relevance. In: Yang, Z.R., Yin, H., Everson, R.M. (eds.) IDEAL 2004. LNCS, vol. 3177, pp. 491–498. Springer, Heidelberg (2004)
Chapter Google Scholar
Ntoulas, A., Zerfos, P., Cho, J.: Downloading textual hidden web content through keyword queries. In: Proceedings of JCDL, NY, USA (2005)
Google Scholar
Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: Proceedings of VLDB, San Francisco, CA, USA (2001)
Google Scholar
Seshadri, S., Cooper, B.F.: Routing queries through a peer-to-peer infobeacons network using information retrieval techniques. IEEE TPDS 18, 1754–1765 (2007)
Google Scholar
Wang, Y., Lu, J., Chen, J.: Crawling deep web using a new set covering algorithm. In: Proceedings of the ADMA Conference, Berlin, Heidelberg (2009)
Google Scholar
Wu, P., Wen, J.-R., Liu, H., Ma, W.-Y.: Query selection techniques for efficient crawling of structured web sources. In: Proceedings of the ICDE, Washington, DC, USA (2006)
Google Scholar
Wu, W., Yu, C., Doan, A., Meng, W.: An interactive clustering-based approach to integrating source query interfaces on the deep web. In: Proceedings of SIGMOD, NY, USA (2004)
Google Scholar
Yang, Y., Bansal, N., Dakka, W., Ipeirotis, P., Koudas, N., Papadias, D.: Query by document. In: Proceedings of WSDM, NY, USA (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

National and Kapodistrian University of Athens, Greece
Panagiotis Liakos & Alexandros Ntoulas
Zynga, San Francisco, USA
Alexandros Ntoulas

Authors

Panagiotis Liakos
View author publications
You can also search for this author in PubMed Google Scholar
Alexandros Ntoulas
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science, Fudan University, 825 Zhangheng Rd., Shanghai, 201203, China
X. Sean Wang
Department of Computer Science, College of Engineering, Science and Engineering Offices, The University of Illinois at Chicago, 851 South Morgan Street (M/C 152), 60607-7053, Chicago, Illinois, USA
Isabel Cruz
Department of Informatics and Telecommunications, University of Athens, GR15784, Ilisia, Athens, Greece
Alex Delis
Centre for Applied Informatics, Victoria University, PO Box 14428, 8001, Melbourne, VIC, Australia
Guangyan Huang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liakos, P., Ntoulas, A. (2012). Topic-Sensitive Hidden-Web Crawling. In: Wang, X.S., Cruz, I., Delis, A., Huang, G. (eds) Web Information Systems Engineering - WISE 2012. WISE 2012. Lecture Notes in Computer Science, vol 7651. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35063-4_39

Download citation

DOI: https://doi.org/10.1007/978-3-642-35063-4_39
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35062-7
Online ISBN: 978-3-642-35063-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics