Abstract
The hasty expansion of the World Wide Web poses exceptional scaling challenges for traditional crawlers and search engines. Web crawlers incessantly carry on crawling the Web and locate any novel Web pages that have been added to or removed from the Web. Because of dynamic and growing nature of the Web, it is tricky to deal with inappropriate pages and to forecast which links lead to excellence pages. Since the crawler is just a computer program, it cannot decide how pertinent a Web page is. In this paper, a method of efficient focused crawling is implemented to enhance the quality of Web navigation. We compute the unvisited URL score based on various factors such as its description in Google search engine and its anchor text relevancy and compute the similarity measure of description with given query or topic keywords. Relevancy score is calculated based on vector space model (VSM). Queue optimization is done on the basis of duplicate link and content similarity.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Pal, A., Tomar, D.S., Shrivastava, S.C.: Effective focused Crawling based on content and link structure analysis. (IJCSIS) Int. J. Comput. Sci. Inf. Sec. 2(1) (2009)
Hati, D., Sahoo, B., Kumar, A.: Adaptive focused Crawling based on link analysis. In: 2nd International Conference on Education Technology and Computer (ICETC) (2010)
Chakrabarti, S., van der Berg, M., Dom, B.: Focused Crawling: a new approach to topic-specific web resource discovery. In: Proceedings of the 8th International World-Wide Web Conference (WWW8) (1999)
Cho, J., Garcia-Molina, H., Page, L.: Efficient Crawling through URL ordering. In: Proceedings of the Seventh World-Wide Web Conference (1998)
Cheng, Q., Beizhan, W., Pianpian, W.: Efficient focused Crawling strategy using combination of link structure and content similarity. IEEE (2008)
Pivk, A., Cimiano, P., Sure, Y., Gams, M., Rajkovic, V., Studer, R.: Transforming arbitrary tables into logical form with TARTAR. Data Knowl. Eng. 60(3), 567–595 (2007)
Chang, C., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. TKDE-0475-1104.R3 (2006)
McCown, F., Nelson, M.: Agreeing to disagree: search engines and their public interfaces. In: ACM IEEE Joint Conference on Digital Libraries (JCDL 2007), pp. 309–318. Vancouver, British Columbia, Canada, 17–23 June 2007
Bao, S., Li, R., Yu, Y., Cao, Y.: Competitor Mining with the web knowledge. IEEE Trans. Data Eng. 20(10), 1297–1310 (2008)
Menczer, F., Pant, G., Srinivasan, P.: Topical web Crawlers: evaluating adaptive algorithms. ACM Trans. Internet Technol. (TOIT) 4(4), 378–419 (2004)
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Ehrig, M., Maedche, A.: Ontology-focused Crawling of web documents. In: Proceedings of the Symposium on Applied Computing (SAC 2003), 9–12 Mar 2003
Hliaoutakis, A., Varelas, G., Voutsakis, E., Petrakis, E.G.M., Milios, E.: Information retrieval by semantic similarity. Int. J. Seman. Web Inf. Syst. (IJSWIS) Spec. Issue Multimedia 3(3), 55–73 (2006)
Pant, G., Srinivasan, P.: Learning to Crawl: comparing classification schemes. ACM Trans. Inf. Syst. (TOIS) 23(4), 430–462 (2005)
Ehrig, M., Maedche, A.: Ontology-focused Crawling of web documents. In: Proceedings of the Symposium on Applied Computing, March, 67.2–12.72 Florida, USA (2003)
Yuvarani, M., Ch., N., Iyengar, S.N., Kannan, A., Crawler, L.S.: A framework for an enhanced focused web Crawler based on link semantics. In: Proceedings of the IEEEIWIC/ACM International Conference on Web Intelligence (2006)
Liu, H., Janssen, J., Milios, E.: Using HMM to learn user browsing patterns for focused web Crawling. Data Knowl. Eng. 59(2), 270–329 (2006)
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C., Gori., M.: Focused Crawling using context graphs. In: Proceedings of 26th International Conference on Very Large Databases (VLDB 2000), pp. 527–534 (2000)
Chen, Y.: A novel hybrid focused Crawling algorithm to build domain-specific collections. Ph.D. thesis, Virginia Polytechnic Institute and State University (2007)
Zhang, X., Zhou, T., Yu, Z., Chen, D.: URL rule based focused Crawlers. In: IEEE International Conference on e-Business Engineering (2008)
Chakrabarti, S., van den Berg, M., Dom, B.: Focused Crawling: a new approach to topic-specific Web resource discovery. In: 8th International WWW Conference, May 1999
Varelas, G., Voutsakis, E., Raftopoulou, P., Petrakis, E.G.M., Milios, E.: Semantic similarity methods in WordNet and their application to information retrieval on the web. In: 7th ACM International Workshop on Web Information and Data Management (WIDM 2005), Bremen Germany (2005)
Liu, B.: Web data mining, from Chapter 6, 7, 8, pp. 183–235, 237–270, 273–318. Springer, Berlin (2007)
Bhatia, M.P.S., Gupta, D.: Discussion on web Crawlers of search engine. In: Proceedings of 2nd National Conference on Challenges and Opportunities in Information Technology (COIT-2008)
Soon, L.K., Ku, Y.E., Lee, S.H.: Web Crawler with URL signature—a performance study. In: 4th Conference on Data Mining and Optimization (DMO) (2012)
Kim, S.J., Jeong, H.S., Lee, S.H.: Reliable evaluations of URL normalization. In: Proceedings of the 2006 International Conference on Computational Science and its Applications (ICCSA), pp. 609–617 May 2006
Lee, S.H., Kim, S.J., Hong, S.H.: On URL normalization. In: Proceedings of the 2005 International Conference on Computational Science and its Applications (ICCSA), pp. 1076–1085, Singapore, May 2005
Berners-Lee, T., Fielding, R., Masinter, L.: Uniform resource identifier (URI): general syntax. Available at http://gbiv.com/protocols/uri/rfc/rfc3986.html
Garcia, E.: Vector models based on normalized frequencies. Mi Islita. Retrieved 17 Aug 2012 (2006)
Yongsheng, Y., Hui, W.: Implementation of focused Crawler, COMP 630D Course Project Report (2000)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer India
About this chapter
Cite this chapter
Rawat, S. (2015). Focused Crawling: An Approach for URL Queue Optimization Using Link Score. In: Patnaik, S., Li, X., Yang, YM. (eds) Recent Development in Wireless Sensor and Ad-hoc Networks. Signals and Communication Technology. Springer, New Delhi. https://doi.org/10.1007/978-81-322-2129-6_9
Download citation
DOI: https://doi.org/10.1007/978-81-322-2129-6_9
Published:
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-2128-9
Online ISBN: 978-81-322-2129-6
eBook Packages: EngineeringEngineering (R0)