Focused Crawling: An Approach for URL Queue Optimization Using Link Score

Rawat, Sunita

doi:10.1007/978-81-322-2129-6_9

Focused Crawling: An Approach for URL Queue Optimization Using Link Score

Sunita Rawat⁴

Chapter
First Online: 01 January 2014

1234 Accesses
1 Citations

Part of the book series: Signals and Communication Technology ((SCT))

Abstract

The hasty expansion of the World Wide Web poses exceptional scaling challenges for traditional crawlers and search engines. Web crawlers incessantly carry on crawling the Web and locate any novel Web pages that have been added to or removed from the Web. Because of dynamic and growing nature of the Web, it is tricky to deal with inappropriate pages and to forecast which links lead to excellence pages. Since the crawler is just a computer program, it cannot decide how pertinent a Web page is. In this paper, a method of efficient focused crawling is implemented to enhance the quality of Web navigation. We compute the unvisited URL score based on various factors such as its description in Google search engine and its anchor text relevancy and compute the similarity measure of description with given query or topic keywords. Relevancy score is calculated based on vector space model (VSM). Queue optimization is done on the basis of duplicate link and content similarity.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Pal, A., Tomar, D.S., Shrivastava, S.C.: Effective focused Crawling based on content and link structure analysis. (IJCSIS) Int. J. Comput. Sci. Inf. Sec. 2(1) (2009)
Google Scholar
Hati, D., Sahoo, B., Kumar, A.: Adaptive focused Crawling based on link analysis. In: 2nd International Conference on Education Technology and Computer (ICETC) (2010)
Google Scholar
Chakrabarti, S., van der Berg, M., Dom, B.: Focused Crawling: a new approach to topic-specific web resource discovery. In: Proceedings of the 8th International World-Wide Web Conference (WWW8) (1999)
Google Scholar
Cho, J., Garcia-Molina, H., Page, L.: Efficient Crawling through URL ordering. In: Proceedings of the Seventh World-Wide Web Conference (1998)
Google Scholar
Cheng, Q., Beizhan, W., Pianpian, W.: Efficient focused Crawling strategy using combination of link structure and content similarity. IEEE (2008)
Google Scholar
Pivk, A., Cimiano, P., Sure, Y., Gams, M., Rajkovic, V., Studer, R.: Transforming arbitrary tables into logical form with TARTAR. Data Knowl. Eng. 60(3), 567–595 (2007)
Article Google Scholar
Chang, C., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. TKDE-0475-1104.R3 (2006)
Google Scholar
McCown, F., Nelson, M.: Agreeing to disagree: search engines and their public interfaces. In: ACM IEEE Joint Conference on Digital Libraries (JCDL 2007), pp. 309–318. Vancouver, British Columbia, Canada, 17–23 June 2007
Google Scholar
Bao, S., Li, R., Yu, Y., Cao, Y.: Competitor Mining with the web knowledge. IEEE Trans. Data Eng. 20(10), 1297–1310 (2008)
Article Google Scholar
Menczer, F., Pant, G., Srinivasan, P.: Topical web Crawlers: evaluating adaptive algorithms. ACM Trans. Internet Technol. (TOIT) 4(4), 378–419 (2004)
Article Google Scholar
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Article MATH Google Scholar
Ehrig, M., Maedche, A.: Ontology-focused Crawling of web documents. In: Proceedings of the Symposium on Applied Computing (SAC 2003), 9–12 Mar 2003
Google Scholar
Hliaoutakis, A., Varelas, G., Voutsakis, E., Petrakis, E.G.M., Milios, E.: Information retrieval by semantic similarity. Int. J. Seman. Web Inf. Syst. (IJSWIS) Spec. Issue Multimedia 3(3), 55–73 (2006)
Article Google Scholar
Pant, G., Srinivasan, P.: Learning to Crawl: comparing classification schemes. ACM Trans. Inf. Syst. (TOIS) 23(4), 430–462 (2005)
Article Google Scholar
Ehrig, M., Maedche, A.: Ontology-focused Crawling of web documents. In: Proceedings of the Symposium on Applied Computing, March, 67.2–12.72 Florida, USA (2003)
Google Scholar
Yuvarani, M., Ch., N., Iyengar, S.N., Kannan, A., Crawler, L.S.: A framework for an enhanced focused web Crawler based on link semantics. In: Proceedings of the IEEEIWIC/ACM International Conference on Web Intelligence (2006)
Google Scholar
Liu, H., Janssen, J., Milios, E.: Using HMM to learn user browsing patterns for focused web Crawling. Data Knowl. Eng. 59(2), 270–329 (2006)
Article Google Scholar
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C., Gori., M.: Focused Crawling using context graphs. In: Proceedings of 26th International Conference on Very Large Databases (VLDB 2000), pp. 527–534 (2000)
Google Scholar
Chen, Y.: A novel hybrid focused Crawling algorithm to build domain-specific collections. Ph.D. thesis, Virginia Polytechnic Institute and State University (2007)
Google Scholar
Zhang, X., Zhou, T., Yu, Z., Chen, D.: URL rule based focused Crawlers. In: IEEE International Conference on e-Business Engineering (2008)
Google Scholar
Chakrabarti, S., van den Berg, M., Dom, B.: Focused Crawling: a new approach to topic-specific Web resource discovery. In: 8th International WWW Conference, May 1999
Google Scholar
Varelas, G., Voutsakis, E., Raftopoulou, P., Petrakis, E.G.M., Milios, E.: Semantic similarity methods in WordNet and their application to information retrieval on the web. In: 7th ACM International Workshop on Web Information and Data Management (WIDM 2005), Bremen Germany (2005)
Google Scholar
Liu, B.: Web data mining, from Chapter 6, 7, 8, pp. 183–235, 237–270, 273–318. Springer, Berlin (2007)
Google Scholar
Bhatia, M.P.S., Gupta, D.: Discussion on web Crawlers of search engine. In: Proceedings of 2nd National Conference on Challenges and Opportunities in Information Technology (COIT-2008)
Google Scholar
Soon, L.K., Ku, Y.E., Lee, S.H.: Web Crawler with URL signature—a performance study. In: 4th Conference on Data Mining and Optimization (DMO) (2012)
Google Scholar
Kim, S.J., Jeong, H.S., Lee, S.H.: Reliable evaluations of URL normalization. In: Proceedings of the 2006 International Conference on Computational Science and its Applications (ICCSA), pp. 609–617 May 2006
Google Scholar
Lee, S.H., Kim, S.J., Hong, S.H.: On URL normalization. In: Proceedings of the 2005 International Conference on Computational Science and its Applications (ICCSA), pp. 1076–1085, Singapore, May 2005
Google Scholar
Berners-Lee, T., Fielding, R., Masinter, L.: Uniform resource identifier (URI): general syntax. Available at http://gbiv.com/protocols/uri/rfc/rfc3986.html
Garcia, E.: Vector models based on normalized frequencies. Mi Islita. Retrieved 17 Aug 2012 (2006)
Google Scholar
Yongsheng, Y., Hui, W.: Implementation of focused Crawler, COMP 630D Course Project Report (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, RCPIT, Shirpur, Dhule, India
Sunita Rawat

Authors

Sunita Rawat
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sunita Rawat .

Editor information

Editors and Affiliations

Department of Computer Science and Engin, SOA University, Bhubaneswar, Odisha, India
Srikanta Patnaik
Electronics and Computer Engg Technology, Indiana State University, Indiana, Indiana, USA
Xiaolong Li
School of Electronic Engineering, Kumoh National Institute of Technology, Gumi, Korea, Republic of (South Korea)
Yeon-Mo Yang

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Rawat, S. (2015). Focused Crawling: An Approach for URL Queue Optimization Using Link Score. In: Patnaik, S., Li, X., Yang, YM. (eds) Recent Development in Wireless Sensor and Ad-hoc Networks. Signals and Communication Technology. Springer, New Delhi. https://doi.org/10.1007/978-81-322-2129-6_9

Download citation

DOI: https://doi.org/10.1007/978-81-322-2129-6_9
Published: 02 December 2014
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-2128-9
Online ISBN: 978-81-322-2129-6
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics