Finding Potential Seeds through Rank Aggregation of Web Searches

  • Rajendra Prasath
  • Pinar Öztürk
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6744)

Abstract

This paper presents a potential seed selection algorithm for web crawlers using a gain - share scoring approach. Initially we consider a set of arbitrarily chosen tourism queries. Each query is given to the selected N commercial Search Engines (SEs); top m search results for each SE are obtained, and each of these m results is manually evaluated and assigned a relevance score. For each of m results, a gain - share score is computed using their hyperlinks structure across N ranked lists. Gain score of each link present in each of m results and a portion of the gain score is propagated to the share score of each of m results. This updated share scores of each of m results determine the potential set of seed URLs for web crawling. Experimental results on tourism related web data illustrate the effectiveness of the proposed seed selection algorithm.

Keywords

Web Crawlers Seed Selection Link Data Relevant Judgment 

References

  1. 1.
    Bergman, M.K.: The deep web: Surfacing hidden value. Journal of Electronic Publishing 7(1) (August 2001)Google Scholar
  2. 2.
    Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: WWW7: Proceedings of the seventh international conference on World Wide Web 7, pp. 107–117. Elsevier Science Publishers B.V., Amsterdam (1998)Google Scholar
  3. 3.
    Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. Comput. Netw. 31(11-16), 1623–1640 (1999)CrossRefGoogle Scholar
  4. 4.
    Dmitriev, P.: Host-based seed selection algorithm for web crawlers. US Patent (US20100114858A1) (May 2010)Google Scholar
  5. 5.
    Dwork, C., Kumar, R., Naor, M., Sivakumar, D.: Rank aggregation methods for the web. In: WWW 2001: Proceedings of the 10th international conference on World Wide Web, pp. 613–622. ACM, New York (2001)Google Scholar
  6. 6.
    Hawking, D., Craswell, N.: Which search engine is best at finding online services? In: Proceedings of WWW10, Hong Kong (2001)Google Scholar
  7. 7.
    Micarelli, A., Gasparetti, F.: Adaptive focused crawling. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  8. 8.
    Niu, C., Li, W., Ding, J., Srihari, R.K.: A bootstrapping approach to named entity classification using successive learners. In: ACL 2003: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, pp. 335–342. Association for Computational Linguistics, Morristown, USA (2003)Google Scholar
  9. 9.
    Pal, S.K., Talwar, V., Mitra, P.: Web mining in soft computing framework: relevance, state of the art and future directions. IEEE Transactions on Neural Networks 13(5), 1163–1177 (2002), http://dx.doi.org/10.1109/TNN.2002.1031947 CrossRefGoogle Scholar
  10. 10.
    Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. MIT Press, Cambridge (1986)Google Scholar
  11. 11.
    Smucker, M.D., Allan, J.: Using similarity links as shortcuts to relevant web pages. In: SIGIR 2007: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 863–864. ACM, New York (2007)Google Scholar
  12. 12.
    Yangarber, R.: Counter-training in discovery of semantic patterns. In: ACL 2003: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, pp. 343–350. Association for Computational Linguistics, Morristown, USA (2003)Google Scholar
  13. 13.
    Zheng, S., Dmitriev, P., Giles, C.L.: Graph based crawler seed selection. In: WWW 2009: Proceedings of the 18th international conference on World wide web, pp. 1089–1090. ACM, New York (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Rajendra Prasath
    • 1
  • Pinar Öztürk
    • 1
  1. 1.Department of Computer and Information Science (IDI)Norwegian University of Science and Technology (NTNU)TrondheimNorway

Personalised recommendations