Abstract
Huge amount of useful data is buried under the layers of hidden web that is accessible when submit forms are filled by users. Web crawlers can access this data only by interacting with web-based search forms. Traditional search engines cannot efficiently search and index these deep or hidden web pages. Retrieving data with high accuracy and coverage in hidden web is a challenging task. Focused crawling guarantees that the document that is found has a place with the particular subject. In the proposed architecture, Smart focused web crawler for hidden web is based on XML parsing of web pages, by first finding the hidden web pages and learning their features. Term frequency–inverse document frequency will be used to build classifier in order to find relevant pages, using completely automatic adaptive learning technique. This system will help in increasing the coverage and accuracy of retrieved web pages. For distributed processing, MapReduce framework of Hadoop will be used.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Scheeren, W.O.: The Hidden Web: A Sourcebook. ABC-CLIO (2012)
Sherman, C., Price, G.: The Invisible Web: Uncovering Information Sources Search Engines Can’t See. Information Today Inc, Medford, New Jersey (2001)
Idc worldwide predictions 2014: Battles for dominance–and survival—on the 3rd platform (2014). http://www.idc.com/research/Predictions14/index.jsp
Dragut, E.C., Yu, C., Meng, W.: Meaningful labeling of integrated query interfaces. In: Proceedings of the 32nd International Conference on Very Large Data Bases, pp. 679–690. VLDB Endowment (2006)
Barbosa, L., Freire, J.: Searching for hidden-web databases. In: WebDB, pp. 1–6 (2005)
Barbosa, L., Freire, J.: An adaptive crawler for locating hidden-web entry points. In: Proceedings of the 16th International Conference on World Wide Web, pp. 441–450. ACM (2007)
Olston, C., Najork, M.: Web crawling (foundations and trends®). Inf. Retr. 4(3), 175–246 (2010)
Madhavan, J., Jeffery, S.R., Cohen, S., Dong, X., Ko, D., Yu, C., Halevy, A.: Web-scale data integration: you can only afford to pay as you go. In: CIDR (2007)
https://brightplanet.com/2012/06/the-deep-web-surfacing-hidden-value/. Accessed 20 Oct 2017
He, B., Patel, M., Zhang, Z., Chang, K.C.C.: Accessing the deep web. Commun. ACM 50(5), 94–101 (2007)
Wright, A.: Searching the deep web. Commun. ACM 51(10), 14–15 (2008)
Chakrabarti, S., Van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. Comput. Netw. 31(11), 1623–1640 (1999)
Chakrabarti, S., Punera, K., Subramanyam, M.: Accelerated focused crawling through online relevance feedback. In: Proceedings of the 11th International Conference on World Wide Web, pp. 148–159. ACM (2002)
Graupmann, J., Biwer, M., Zimmer, C., Zimmer, P., Bender, M., Theobald, M., Weikum, G.: COMPASS: a concept-based web search engine for HTML, XML, and deep Web data. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases, vol. 30, pp. 1313–1316. VLDB Endowment (2004)
Nayak, R., Senellart, P., Suchanek, F.M., Varde, A.S.: Discovering interesting information with advances in web technology. ACM SIGKDD Explor. Newsl. 14(2), 63–81 (2013)
Suchanek, F.M., Varde, A.S., Nayak, R., Senellart, P.: The hidden web, xml and the semantic web: scientific data management perspectives. In: Proceedings of the 14th International Conference on Extending Database Technology, pp. 534–537. ACM (2011)
Tatli, Eİ., Urgun, B.: WIVET—benchmarking coverage qualities of web crawlers. Comput. J. 60(4), 555–572 (2017)
Ahuja, B., Anuradha: SCUM: a hidden web page ranking technique. Int. J. Innov. Res. Adv. Eng. 1(10) (2014)
Wong, B.W.F.: Deep-web search engine ranking algorithms. Doctoral Dissertation, Massachusetts Institute of Technology
Balakrishnan, R.: Trust and Profit Sensitive Ranking for the Deep Web and On-line Advertisements. Arizona State University (2012)
Batra, N., Kumar, A., Singh, D., Rajotia, R.N.: Content based hidden web ranking algorithm (CHWRA). In: 2014 IEEE International Advance Computing Conference (IACC), pp. 586–589. IEEE (2014)
Chahal, P., Singh, M., Kumar, S.: Ranking of web documents using semantic similarity. In: IEEE 2013 International Conference on Information Systems and Computer Networks (ISCON), pp. 145–150. IEEE (2013)
Bal, S.K., Geetha, G.: Advances in web crawler. Int. J. Control Theory Appl. 9(45), 9–30 (2016)
Lu, H., Zhan, D., Zhou, L. and He, D.: An improved focused crawler: using web page classification and link priority evaluation. In: Mathematical Problems in Engineering (2016)
Madhavan, J., Ko, D., Kot, Ł., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s deep web crawl. Proc. VLDB Endow. 1(2), 1241–1252 (2008)
Wu, W., Doan, A., Yu, C.: Webiq: learning from the web to match deep-web query interfaces. In: Proceedings of the 22nd International Conference on Data Engineering 2006, ICDE’06, pp. 44–44. IEEE (2006)
Barbosa, L., Freire, J.: Combining classifiers to identify online databases. In: Proceedings of the 16th International Conference on World Wide Web, pp. 431–440. ACM (2007)
Cope, J., Craswell, N., Hawking, D.: Automated discovery of search interfaces on the web. In: Proceedings of the 14th Australasian Database Conference, vol. 17, pp. 181–189. Australian Computer Society Inc. (2003)
Zhao, F., Zhou, J., Nie, C., Huang, H., Jin, H.: SmartCrawler: a two-stage crawler for efficiently harvesting deep-web interfaces. IEEE Trans. Serv. Comput. 9(4), 608–620 (2016)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Kaur, S., Geetha, G. (2019). Smart Focused Web Crawler for Hidden Web. In: Fong, S., Akashe, S., Mahalle, P. (eds) Information and Communication Technology for Competitive Strategies. Lecture Notes in Networks and Systems, vol 40. Springer, Singapore. https://doi.org/10.1007/978-981-13-0586-3_42
Download citation
DOI: https://doi.org/10.1007/978-981-13-0586-3_42
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-0585-6
Online ISBN: 978-981-13-0586-3
eBook Packages: EngineeringEngineering (R0)