Skip to main content

Smart Focused Web Crawler for Hidden Web

  • Conference paper
  • First Online:
Information and Communication Technology for Competitive Strategies

Abstract

Huge amount of useful data is buried under the layers of hidden web that is accessible when submit forms are filled by users. Web crawlers can access this data only by interacting with web-based search forms. Traditional search engines cannot efficiently search and index these deep or hidden web pages. Retrieving data with high accuracy and coverage in hidden web is a challenging task. Focused crawling guarantees that the document that is found has a place with the particular subject. In the proposed architecture, Smart focused web crawler for hidden web is based on XML parsing of web pages, by first finding the hidden web pages and learning their features. Term frequency–inverse document frequency will be used to build classifier in order to find relevant pages, using completely automatic adaptive learning technique. This system will help in increasing the coverage and accuracy of retrieved web pages. For distributed processing, MapReduce framework of Hadoop will be used.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Scheeren, W.O.: The Hidden Web: A Sourcebook. ABC-CLIO (2012)

    Google Scholar 

  2. Sherman, C., Price, G.: The Invisible Web: Uncovering Information Sources Search Engines Can’t See. Information Today Inc, Medford, New Jersey (2001)

    Google Scholar 

  3. Idc worldwide predictions 2014: Battles for dominance–and survival—on the 3rd platform (2014). http://www.idc.com/research/Predictions14/index.jsp

  4. Dragut, E.C., Yu, C., Meng, W.: Meaningful labeling of integrated query interfaces. In: Proceedings of the 32nd International Conference on Very Large Data Bases, pp. 679–690. VLDB Endowment (2006)

    Google Scholar 

  5. Barbosa, L., Freire, J.: Searching for hidden-web databases. In: WebDB, pp. 1–6 (2005)

    Google Scholar 

  6. Barbosa, L., Freire, J.: An adaptive crawler for locating hidden-web entry points. In: Proceedings of the 16th International Conference on World Wide Web, pp. 441–450. ACM (2007)

    Google Scholar 

  7. Olston, C., Najork, M.: Web crawling (foundations and trends®). Inf. Retr. 4(3), 175–246 (2010)

    MATH  Google Scholar 

  8. Madhavan, J., Jeffery, S.R., Cohen, S., Dong, X., Ko, D., Yu, C., Halevy, A.: Web-scale data integration: you can only afford to pay as you go. In: CIDR (2007)

    Google Scholar 

  9. https://brightplanet.com/2012/06/the-deep-web-surfacing-hidden-value/. Accessed 20 Oct 2017

  10. He, B., Patel, M., Zhang, Z., Chang, K.C.C.: Accessing the deep web. Commun. ACM 50(5), 94–101 (2007)

    Article  Google Scholar 

  11. Wright, A.: Searching the deep web. Commun. ACM 51(10), 14–15 (2008)

    Article  Google Scholar 

  12. Chakrabarti, S., Van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. Comput. Netw. 31(11), 1623–1640 (1999)

    Article  Google Scholar 

  13. Chakrabarti, S., Punera, K., Subramanyam, M.: Accelerated focused crawling through online relevance feedback. In: Proceedings of the 11th International Conference on World Wide Web, pp. 148–159. ACM (2002)

    Google Scholar 

  14. Graupmann, J., Biwer, M., Zimmer, C., Zimmer, P., Bender, M., Theobald, M., Weikum, G.: COMPASS: a concept-based web search engine for HTML, XML, and deep Web data. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases, vol. 30, pp. 1313–1316. VLDB Endowment (2004)

    Google Scholar 

  15. Nayak, R., Senellart, P., Suchanek, F.M., Varde, A.S.: Discovering interesting information with advances in web technology. ACM SIGKDD Explor. Newsl. 14(2), 63–81 (2013)

    Article  Google Scholar 

  16. Suchanek, F.M., Varde, A.S., Nayak, R., Senellart, P.: The hidden web, xml and the semantic web: scientific data management perspectives. In: Proceedings of the 14th International Conference on Extending Database Technology, pp. 534–537. ACM (2011)

    Google Scholar 

  17. Tatli, Eİ., Urgun, B.: WIVET—benchmarking coverage qualities of web crawlers. Comput. J. 60(4), 555–572 (2017)

    Google Scholar 

  18. Ahuja, B., Anuradha: SCUM: a hidden web page ranking technique. Int. J. Innov. Res. Adv. Eng. 1(10) (2014)

    Google Scholar 

  19. Wong, B.W.F.: Deep-web search engine ranking algorithms. Doctoral Dissertation, Massachusetts Institute of Technology

    Google Scholar 

  20. Balakrishnan, R.: Trust and Profit Sensitive Ranking for the Deep Web and On-line Advertisements. Arizona State University (2012)

    Google Scholar 

  21. Batra, N., Kumar, A., Singh, D., Rajotia, R.N.: Content based hidden web ranking algorithm (CHWRA). In: 2014 IEEE International Advance Computing Conference (IACC), pp. 586–589. IEEE (2014)

    Google Scholar 

  22. Chahal, P., Singh, M., Kumar, S.: Ranking of web documents using semantic similarity. In: IEEE 2013 International Conference on Information Systems and Computer Networks (ISCON), pp. 145–150. IEEE (2013)

    Google Scholar 

  23. Bal, S.K., Geetha, G.: Advances in web crawler. Int. J. Control Theory Appl. 9(45), 9–30 (2016)

    Google Scholar 

  24. Lu, H., Zhan, D., Zhou, L. and He, D.: An improved focused crawler: using web page classification and link priority evaluation. In: Mathematical Problems in Engineering (2016)

    Google Scholar 

  25. Madhavan, J., Ko, D., Kot, Ł., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s deep web crawl. Proc. VLDB Endow. 1(2), 1241–1252 (2008)

    Article  Google Scholar 

  26. Wu, W., Doan, A., Yu, C.: Webiq: learning from the web to match deep-web query interfaces. In: Proceedings of the 22nd International Conference on Data Engineering 2006, ICDE’06, pp. 44–44. IEEE (2006)

    Google Scholar 

  27. Barbosa, L., Freire, J.: Combining classifiers to identify online databases. In: Proceedings of the 16th International Conference on World Wide Web, pp. 431–440. ACM (2007)

    Google Scholar 

  28. Cope, J., Craswell, N., Hawking, D.: Automated discovery of search interfaces on the web. In: Proceedings of the 14th Australasian Database Conference, vol. 17, pp. 181–189. Australian Computer Society Inc. (2003)

    Google Scholar 

  29. Zhao, F., Zhou, J., Nie, C., Huang, H., Jin, H.: SmartCrawler: a two-stage crawler for efficiently harvesting deep-web interfaces. IEEE Trans. Serv. Comput. 9(4), 608–620 (2016)

    Article  Google Scholar 

  30. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sawroop Kaur .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kaur, S., Geetha, G. (2019). Smart Focused Web Crawler for Hidden Web. In: Fong, S., Akashe, S., Mahalle, P. (eds) Information and Communication Technology for Competitive Strategies. Lecture Notes in Networks and Systems, vol 40. Springer, Singapore. https://doi.org/10.1007/978-981-13-0586-3_42

Download citation

  • DOI: https://doi.org/10.1007/978-981-13-0586-3_42

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-13-0585-6

  • Online ISBN: 978-981-13-0586-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics