Intelligent Rule-Based Deep Web Crawler



In this chapter, architecture specification of a deep web crawler is discussed. The crawler has indexer with the capability to fetch huge documents from both surface and deep web. The documents from the deep web are fetched-based rules, where core and allied fields of the forms play important role. Based on the domain and nature of FORM in HTML pages, functional dependency between the fields, core and allied fields are identified. The SVM classifier is used for classifying the rule as most preferable, least preferable and mutually exclusive. The documents are fetched by using the most preferable fields in FORM. The fetched document is indexed, and the same architecture is scaled to support distributed functionality with the help of web services. This architecture specification processes huge number of documents which has encouraging coverage rate and lower fetching time. The retrieval performance of the crawler is compared with Google retrieval system and found that the proposed architecture archives similar procession of retrieval.


Deep web crawler Indexer Rules Hidden web data 


  1. Ajoudanian, S., & Jazi, M. D. (2009). Deep web content mining. Proceedings of World Academy of Science: Engineering and Technology, 49.Google Scholar
  2. Alvarez, M., Raposo, J., Pan, A., Cacheda, F., Bellas, F., & Carneiro, V. (2007). Crawling the content hidden behind web forms. In Computational Science and Its Applications Proceedings of the International Conference (Part II, pp 322–333). Berlin, Heidelberg: Springer.Google Scholar
  3. Arasu, A., Cho, J., Garcia-Molina, H., & Raghavan, S. (2001). Searching the web. ACM Transactions on Internet Technologies, 1(1), 2–43. CrossRefGoogle Scholar
  4. Barbosa, L., & Freire, J. (2004). Siphoning hidden-web data through keyword-based interfaces. In XIX Simpsio Brasileiro de Bancos de Dados (pp. 309–321).Google Scholar
  5. Barbosa, L., & Freire, J. (2007). An adaptive crawler for locating hidden web entry points. In World Wide Web Proceedings of the 16th International Conference (pp. 441–450). New York, NY, USA: ACM.Google Scholar
  6. Brooks, T. A. (2004). The nature of meaning in the age of Google. Proceedings of Information Research, 9(3).Google Scholar
  7. Caverlee, J., Liu, L., & Rocco, D. (2006). Discovering interesting relationships among deep web databases: A source-biased approach. Journal of World Wide Web, 9(4), 585–622.CrossRefGoogle Scholar
  8. Chakrabarti, S., Dom, B., Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A., et al. (1999). Mining the Web’s link structure. Computer, 32(8), 60–67.CrossRefGoogle Scholar
  9. Chang, K. C.-C., He, B., Li, C., Patel, M., & Zhang, Z. (2004). Structured databases on the web: Observations and implications. SIGMOD Record, 33(3), 61–70.CrossRefGoogle Scholar
  10. Gianvecchio, S., Xie, M., Wu, Z., & Wang, H. (2008). Measurement and classification of humans and bots in internet chat. In Proceedings of the 17th International Conference on Security Symposium, Association Berkeley, USA (pp. 155–169).Google Scholar
  11. Kayed, M., & Chang, C.-H. (2010). FiVaTech: Page-level web data extraction from template pages. IEEE Transactions on Knowledge and Data Engineering, 22(2), 249–263.CrossRefGoogle Scholar
  12. Liu, J., Lu, J., Wu, Z., & Zheng, Q. (2011). Deep web adaptive crawling based on minimum executable pattern. Journal of Intelligent Information Systems, 36(2), 197–215.CrossRefGoogle Scholar
  13. Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., & Halevy, A. (2008). Google’s deep web crawl. Proceedings of the VLDB Endowment, 1(2), 1241–1252.CrossRefGoogle Scholar
  14. Ntoulas, A., Zerfos, P., & Cho, J. (2005). Downloading textual hidden web content through keyword queries. In ACM/IEEE-CS Proceedings of the 5th Joint Conference on Digital Libraries (pp. 100–109). New York, NY, USA: ACM.Google Scholar
  15. Ntoulas, A., Zerfos, P., & Cho, J. (2008). Downloading hidden web content. UCLA, Computer Science. Retrieved February 24, 2009.Google Scholar
  16. Raghavan, S., & Garcia-Molina, H. (2001). Crawling the hidden web. In Very Large Databases (VLDB F01) Proceedings of the 27th International Conference (pp. 129–138). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.Google Scholar
  17. Rennie, J., & McCallum, A. (1999). Using reinforcement learning to spider the web efficiently. In Machine Learning (ICML) Proceedings of the 16th International Conference (pp. 335–343). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.Google Scholar
  18. Wei, L., Xiaofeng, M., & Weiyi, M. (2010). ViDE: A vision-based approach for deep web data extraction. IEEE Transactions on Knowledge and Data Engineering, 22(3), 447–460. CrossRefGoogle Scholar
  19. Wu, P., Wen, J. R., Liu, H., & Ma, W. Y. (2006). Query selection techniques for efficient crawling of structured web sources. In Data Engineering Proceedings of the 22nd International Conference, Atlanta, 2006 (pp. 47–56).Google Scholar
  20. Yongquan, D., & Qingzhong, L. (2012). A deep web crawling approach based on query harvest model. Journal of Computational Information Systems, 8(3), 973–981.Google Scholar
  21. Zhao, P., Huang, L., Fang, W., & Cui, Z. (2008). Organizing structured deep web by clustering query interfaces link graph. In Advanced Data Mining and Applications Proceedings of the 4th International Conference of ADMA ‘08 (pp. 683–690). Berlin, Heidelberg: Springer.Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringDayananda Sagar UniversityBangaloreIndia
  2. 2.Department of Computer Science and EngineeringSRM University APAmaravatiIndia

Personalised recommendations