Smart Focused Web Crawler for Hidden Web

Kaur, Sawroop; Geetha, G.

doi:10.1007/978-981-13-0586-3_42

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 40))

1170 Accesses
1 Citations

Abstract

Huge amount of useful data is buried under the layers of hidden web that is accessible when submit forms are filled by users. Web crawlers can access this data only by interacting with web-based search forms. Traditional search engines cannot efficiently search and index these deep or hidden web pages. Retrieving data with high accuracy and coverage in hidden web is a challenging task. Focused crawling guarantees that the document that is found has a place with the particular subject. In the proposed architecture, Smart focused web crawler for hidden web is based on XML parsing of web pages, by first finding the hidden web pages and learning their features. Term frequency–inverse document frequency will be used to build classifier in order to find relevant pages, using completely automatic adaptive learning technique. This system will help in increasing the coverage and accuracy of retrieved web pages. For distributed processing, MapReduce framework of Hadoop will be used.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Scheeren, W.O.: The Hidden Web: A Sourcebook. ABC-CLIO (2012)
Google Scholar
Sherman, C., Price, G.: The Invisible Web: Uncovering Information Sources Search Engines Can’t See. Information Today Inc, Medford, New Jersey (2001)
Google Scholar
Idc worldwide predictions 2014: Battles for dominance–and survival—on the 3rd platform (2014). http://www.idc.com/research/Predictions14/index.jsp
Dragut, E.C., Yu, C., Meng, W.: Meaningful labeling of integrated query interfaces. In: Proceedings of the 32nd International Conference on Very Large Data Bases, pp. 679–690. VLDB Endowment (2006)
Google Scholar
Barbosa, L., Freire, J.: Searching for hidden-web databases. In: WebDB, pp. 1–6 (2005)
Google Scholar
Barbosa, L., Freire, J.: An adaptive crawler for locating hidden-web entry points. In: Proceedings of the 16th International Conference on World Wide Web, pp. 441–450. ACM (2007)
Google Scholar
Olston, C., Najork, M.: Web crawling (foundations and trends®). Inf. Retr. 4(3), 175–246 (2010)
MATH Google Scholar
Madhavan, J., Jeffery, S.R., Cohen, S., Dong, X., Ko, D., Yu, C., Halevy, A.: Web-scale data integration: you can only afford to pay as you go. In: CIDR (2007)
Google Scholar
https://brightplanet.com/2012/06/the-deep-web-surfacing-hidden-value/. Accessed 20 Oct 2017
He, B., Patel, M., Zhang, Z., Chang, K.C.C.: Accessing the deep web. Commun. ACM 50(5), 94–101 (2007)
Article Google Scholar
Wright, A.: Searching the deep web. Commun. ACM 51(10), 14–15 (2008)
Article Google Scholar
Chakrabarti, S., Van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. Comput. Netw. 31(11), 1623–1640 (1999)
Article Google Scholar
Chakrabarti, S., Punera, K., Subramanyam, M.: Accelerated focused crawling through online relevance feedback. In: Proceedings of the 11th International Conference on World Wide Web, pp. 148–159. ACM (2002)
Google Scholar
Graupmann, J., Biwer, M., Zimmer, C., Zimmer, P., Bender, M., Theobald, M., Weikum, G.: COMPASS: a concept-based web search engine for HTML, XML, and deep Web data. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases, vol. 30, pp. 1313–1316. VLDB Endowment (2004)
Google Scholar
Nayak, R., Senellart, P., Suchanek, F.M., Varde, A.S.: Discovering interesting information with advances in web technology. ACM SIGKDD Explor. Newsl. 14(2), 63–81 (2013)
Article Google Scholar
Suchanek, F.M., Varde, A.S., Nayak, R., Senellart, P.: The hidden web, xml and the semantic web: scientific data management perspectives. In: Proceedings of the 14th International Conference on Extending Database Technology, pp. 534–537. ACM (2011)
Google Scholar
Tatli, Eİ., Urgun, B.: WIVET—benchmarking coverage qualities of web crawlers. Comput. J. 60(4), 555–572 (2017)
Google Scholar
Ahuja, B., Anuradha: SCUM: a hidden web page ranking technique. Int. J. Innov. Res. Adv. Eng. 1(10) (2014)
Google Scholar
Wong, B.W.F.: Deep-web search engine ranking algorithms. Doctoral Dissertation, Massachusetts Institute of Technology
Google Scholar
Balakrishnan, R.: Trust and Profit Sensitive Ranking for the Deep Web and On-line Advertisements. Arizona State University (2012)
Google Scholar
Batra, N., Kumar, A., Singh, D., Rajotia, R.N.: Content based hidden web ranking algorithm (CHWRA). In: 2014 IEEE International Advance Computing Conference (IACC), pp. 586–589. IEEE (2014)
Google Scholar
Chahal, P., Singh, M., Kumar, S.: Ranking of web documents using semantic similarity. In: IEEE 2013 International Conference on Information Systems and Computer Networks (ISCON), pp. 145–150. IEEE (2013)
Google Scholar
Bal, S.K., Geetha, G.: Advances in web crawler. Int. J. Control Theory Appl. 9(45), 9–30 (2016)
Google Scholar
Lu, H., Zhan, D., Zhou, L. and He, D.: An improved focused crawler: using web page classification and link priority evaluation. In: Mathematical Problems in Engineering (2016)
Google Scholar
Madhavan, J., Ko, D., Kot, Ł., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s deep web crawl. Proc. VLDB Endow. 1(2), 1241–1252 (2008)
Article Google Scholar
Wu, W., Doan, A., Yu, C.: Webiq: learning from the web to match deep-web query interfaces. In: Proceedings of the 22nd International Conference on Data Engineering 2006, ICDE’06, pp. 44–44. IEEE (2006)
Google Scholar
Barbosa, L., Freire, J.: Combining classifiers to identify online databases. In: Proceedings of the 16th International Conference on World Wide Web, pp. 431–440. ACM (2007)
Google Scholar
Cope, J., Craswell, N., Hawking, D.: Automated discovery of search interfaces on the web. In: Proceedings of the 14th Australasian Database Conference, vol. 17, pp. 181–189. Australian Computer Society Inc. (2003)
Google Scholar
Zhao, F., Zhou, J., Nie, C., Huang, H., Jin, H.: SmartCrawler: a two-stage crawler for efficiently harvesting deep-web interfaces. IEEE Trans. Serv. Comput. 9(4), 608–620 (2016)
Article Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Lovely Professional University, Phagwara, Punjab, India
Sawroop Kaur & G. Geetha

Authors

Sawroop Kaur
View author publications
You can also search for this author in PubMed Google Scholar
G. Geetha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sawroop Kaur .

Editor information

Editors and Affiliations

Department of Computer and Information Science, University of Macau, Macau, China
Simon Fong
Department of Electronics and Communication Engineering, ITM University, Gwalior, India
Shyam Akashe
Smt. Kashibai Navale College of Engineering, Pune, India
Parikshit N. Mahalle

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kaur, S., Geetha, G. (2019). Smart Focused Web Crawler for Hidden Web. In: Fong, S., Akashe, S., Mahalle, P. (eds) Information and Communication Technology for Competitive Strategies. Lecture Notes in Networks and Systems, vol 40. Springer, Singapore. https://doi.org/10.1007/978-981-13-0586-3_42

Download citation

DOI: https://doi.org/10.1007/978-981-13-0586-3_42
Published: 31 August 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-0585-6
Online ISBN: 978-981-13-0586-3
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics