Skip to main content

Focused Crawling: An Approach for URL Queue Optimization Using Link Score

  • Chapter
  • First Online:

Part of the book series: Signals and Communication Technology ((SCT))

Abstract

The hasty expansion of the World Wide Web poses exceptional scaling challenges for traditional crawlers and search engines. Web crawlers incessantly carry on crawling the Web and locate any novel Web pages that have been added to or removed from the Web. Because of dynamic and growing nature of the Web, it is tricky to deal with inappropriate pages and to forecast which links lead to excellence pages. Since the crawler is just a computer program, it cannot decide how pertinent a Web page is. In this paper, a method of efficient focused crawling is implemented to enhance the quality of Web navigation. We compute the unvisited URL score based on various factors such as its description in Google search engine and its anchor text relevancy and compute the similarity measure of description with given query or topic keywords. Relevancy score is calculated based on vector space model (VSM). Queue optimization is done on the basis of duplicate link and content similarity.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Pal, A., Tomar, D.S., Shrivastava, S.C.: Effective focused Crawling based on content and link structure analysis. (IJCSIS) Int. J. Comput. Sci. Inf. Sec. 2(1) (2009)

    Google Scholar 

  2. Hati, D., Sahoo, B., Kumar, A.: Adaptive focused Crawling based on link analysis. In: 2nd International Conference on Education Technology and Computer (ICETC) (2010)

    Google Scholar 

  3. Chakrabarti, S., van der Berg, M., Dom, B.: Focused Crawling: a new approach to topic-specific web resource discovery. In: Proceedings of the 8th International World-Wide Web Conference (WWW8) (1999)

    Google Scholar 

  4. Cho, J., Garcia-Molina, H., Page, L.: Efficient Crawling through URL ordering. In: Proceedings of the Seventh World-Wide Web Conference (1998)

    Google Scholar 

  5. Cheng, Q., Beizhan, W., Pianpian, W.: Efficient focused Crawling strategy using combination of link structure and content similarity. IEEE (2008)

    Google Scholar 

  6. Pivk, A., Cimiano, P., Sure, Y., Gams, M., Rajkovic, V., Studer, R.: Transforming arbitrary tables into logical form with TARTAR. Data Knowl. Eng. 60(3), 567–595 (2007)

    Article  Google Scholar 

  7. Chang, C., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. TKDE-0475-1104.R3 (2006)

    Google Scholar 

  8. McCown, F., Nelson, M.: Agreeing to disagree: search engines and their public interfaces. In: ACM IEEE Joint Conference on Digital Libraries (JCDL 2007), pp. 309–318. Vancouver, British Columbia, Canada, 17–23 June 2007

    Google Scholar 

  9. Bao, S., Li, R., Yu, Y., Cao, Y.: Competitor Mining with the web knowledge. IEEE Trans. Data Eng. 20(10), 1297–1310 (2008)

    Article  Google Scholar 

  10. Menczer, F., Pant, G., Srinivasan, P.: Topical web Crawlers: evaluating adaptive algorithms. ACM Trans. Internet Technol. (TOIT) 4(4), 378–419 (2004)

    Article  Google Scholar 

  11. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)

    Article  MATH  Google Scholar 

  12. Ehrig, M., Maedche, A.: Ontology-focused Crawling of web documents. In: Proceedings of the Symposium on Applied Computing (SAC 2003), 9–12 Mar 2003

    Google Scholar 

  13. Hliaoutakis, A., Varelas, G., Voutsakis, E., Petrakis, E.G.M., Milios, E.: Information retrieval by semantic similarity. Int. J. Seman. Web Inf. Syst. (IJSWIS) Spec. Issue Multimedia 3(3), 55–73 (2006)

    Article  Google Scholar 

  14. Pant, G., Srinivasan, P.: Learning to Crawl: comparing classification schemes. ACM Trans. Inf. Syst. (TOIS) 23(4), 430–462 (2005)

    Article  Google Scholar 

  15. Ehrig, M., Maedche, A.: Ontology-focused Crawling of web documents. In: Proceedings of the Symposium on Applied Computing, March, 67.2–12.72 Florida, USA (2003)

    Google Scholar 

  16. Yuvarani, M., Ch., N., Iyengar, S.N., Kannan, A., Crawler, L.S.: A framework for an enhanced focused web Crawler based on link semantics. In: Proceedings of the IEEEIWIC/ACM International Conference on Web Intelligence (2006)

    Google Scholar 

  17. Liu, H., Janssen, J., Milios, E.: Using HMM to learn user browsing patterns for focused web Crawling. Data Knowl. Eng. 59(2), 270–329 (2006)

    Article  Google Scholar 

  18. Diligenti, M., Coetzee, F., Lawrence, S., Giles, C., Gori., M.: Focused Crawling using context graphs. In: Proceedings of 26th International Conference on Very Large Databases (VLDB 2000), pp. 527–534 (2000)

    Google Scholar 

  19. Chen, Y.: A novel hybrid focused Crawling algorithm to build domain-specific collections. Ph.D. thesis, Virginia Polytechnic Institute and State University (2007)

    Google Scholar 

  20. Zhang, X., Zhou, T., Yu, Z., Chen, D.: URL rule based focused Crawlers. In: IEEE International Conference on e-Business Engineering (2008)

    Google Scholar 

  21. Chakrabarti, S., van den Berg, M., Dom, B.: Focused Crawling: a new approach to topic-specific Web resource discovery. In: 8th International WWW Conference, May 1999

    Google Scholar 

  22. Varelas, G., Voutsakis, E., Raftopoulou, P., Petrakis, E.G.M., Milios, E.: Semantic similarity methods in WordNet and their application to information retrieval on the web. In: 7th ACM International Workshop on Web Information and Data Management (WIDM 2005), Bremen Germany (2005)

    Google Scholar 

  23. Liu, B.: Web data mining, from Chapter 6, 7, 8, pp. 183–235, 237–270, 273–318. Springer, Berlin (2007)

    Google Scholar 

  24. Bhatia, M.P.S., Gupta, D.: Discussion on web Crawlers of search engine. In: Proceedings of 2nd National Conference on Challenges and Opportunities in Information Technology (COIT-2008)

    Google Scholar 

  25. Soon, L.K., Ku, Y.E., Lee, S.H.: Web Crawler with URL signature—a performance study. In: 4th Conference on Data Mining and Optimization (DMO) (2012)

    Google Scholar 

  26. Kim, S.J., Jeong, H.S., Lee, S.H.: Reliable evaluations of URL normalization. In: Proceedings of the 2006 International Conference on Computational Science and its Applications (ICCSA), pp. 609–617 May 2006

    Google Scholar 

  27. Lee, S.H., Kim, S.J., Hong, S.H.: On URL normalization. In: Proceedings of the 2005 International Conference on Computational Science and its Applications (ICCSA), pp. 1076–1085, Singapore, May 2005

    Google Scholar 

  28. Berners-Lee, T., Fielding, R., Masinter, L.: Uniform resource identifier (URI): general syntax. Available at http://gbiv.com/protocols/uri/rfc/rfc3986.html

  29. Garcia, E.: Vector models based on normalized frequencies. Mi Islita. Retrieved 17 Aug 2012 (2006)

    Google Scholar 

  30. Yongsheng, Y., Hui, W.: Implementation of focused Crawler, COMP 630D Course Project Report (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sunita Rawat .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer India

About this chapter

Cite this chapter

Rawat, S. (2015). Focused Crawling: An Approach for URL Queue Optimization Using Link Score. In: Patnaik, S., Li, X., Yang, YM. (eds) Recent Development in Wireless Sensor and Ad-hoc Networks. Signals and Communication Technology. Springer, New Delhi. https://doi.org/10.1007/978-81-322-2129-6_9

Download citation

  • DOI: https://doi.org/10.1007/978-81-322-2129-6_9

  • Published:

  • Publisher Name: Springer, New Delhi

  • Print ISBN: 978-81-322-2128-9

  • Online ISBN: 978-81-322-2129-6

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics