Skip to main content

A Novel Design of Hidden Web Crawler Using Reinforcement Learning Based Agents

  • Conference paper
Advanced Parallel Processing Technologies (APPT 2007)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4847))

Included in the following conference series:

Abstract

An ever-increasing amount of information on the Web today is available only through search interfaces: the users have to type in a set of keywords in a search form in order to access the pages from certain Web sites. These pages are often referred to as the Hidden Web or the Deep Web. Since there are no static links to the Hidden Web pages, search engines cannot discover and index such pages and thus do not return them in the results. However, according to recent studies, the content provided by many Hidden Web sites is often of very high quality and can be extremely valuable to many users. In this paper, an effective design of Hidden Web crawler ALAC that can autonomously discover pages from the Hidden Web is discussed. Here, a theoretical framework is presented to investigate the resource discovery problem. This article proposes an effective crawling strategy for identifying hidden web sites automatically. The crawler design employs agents fuelled with reinforcement learning. The prototype is experimentally evaluated for the effectiveness of the strategy and the results are very promising. The crawler ALAC has found 567 searchable forms after searching 3450 pages which substantiate the effectiveness of the policy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. BrightPlanet. Com, The deep web: Surfacing hidden value (July 2000), http://brightplanet.com

  2. Bergman, M.K.: The deep web: Surfacing the hidden value, http://www.press.mich.edu/jep/07-01/bergman.html

  3. Florescu, D., Levy, A.Y., Mendelzon, A.O.: Database techniques for world wide web: A Survey. SIGMOD record 27(3), 59–74 (1998)

    Article  Google Scholar 

  4. Chang, K.C.-C., He, B., Li, C., Patel, M., Zhang, Z.: Structured databases on the Web: Observations and Implications, Technical Report, UIUC

    Google Scholar 

  5. Raghavan, S., Garcia-Molina, H.: Crawling the Hidden Web. In: Proc. of the 27th VLDB Conference (2001)

    Google Scholar 

  6. Barbosa, L., Freire, J.: Searching for hidden-web databases. In: Eighth Intl. workshop on the Web and Databases (2005)

    Google Scholar 

  7. Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: A New Approach to Topic specific Web Resource Discovery. Computer Networks 31(11-16), 1623–1640 (1999)

    Article  Google Scholar 

  8. Akilandeswari, J., Gopalan, N.P.: A Web Mining System using Reinforcement Learning for Scalable Web Search with Distributed. Fault-tolerant Multi-agents, WSEAS transactions on Computers 4(11), 1633–1639 (2005)

    Google Scholar 

  9. Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: Focussed Crawling using Context Graphs. In: Proc. of the 26th Intl conf. on Very Large Databases, pp. 527–534 (2000)

    Google Scholar 

  10. Miller, R.C., Bharat, K.: Sphinx: A framework for creating personal, site specific web crawlers. In: Proc. for the 7th Intl WWW conf. (1998)

    Google Scholar 

  11. Barbosa, L., Freire, J.: An Adaptive Crawler for Locating Hidden-Web Entry Points. In: Proc. of Intl WWW conf., pp. 441–450 (2007)

    Google Scholar 

  12. Rennie, J., McCallum, A.K.: Using Reinforcement Learning to Spider the Web Efficiently. In: Proc. of 16th Intl. conf. on Machine Learning (1999)

    Google Scholar 

  13. Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 237–285 (1995)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Ming Xu Yinwei Zhan Jiannong Cao Yijun Liu

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Akilandeswari, J., Gopalan, N.P. (2007). A Novel Design of Hidden Web Crawler Using Reinforcement Learning Based Agents. In: Xu, M., Zhan, Y., Cao, J., Liu, Y. (eds) Advanced Parallel Processing Technologies. APPT 2007. Lecture Notes in Computer Science, vol 4847. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-76837-1_47

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-76837-1_47

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-76836-4

  • Online ISBN: 978-3-540-76837-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics