New Generation Computing

, Volume 36, Issue 2, pp 95–118 | Cite as

Efficient Topical Focused Crawling Through Neighborhood Feature

  • Tanaphol Suebchua
  • Bundit Manaskasemsak
  • Arnon Rungsawang
  • Hayato Yamana
Research Paper
  • 289 Downloads

Abstract

A focused web crawler is an essential tool for gathering domain-specific data used by national web corpora, vertical search engines, and so on, since it is more efficient than general Breadth-First or Depth-First crawlers. The problem in focused crawling research is the prioritization of unvisited web pages in the crawling frontier followed by crawling these web pages in the order of their priority. The most common feature, adopted in many focused crawling researches, to prioritize an unvisited web page is the relevancy of the set of its source web pages, i.e., its in-linked web pages. However, this feature is limited, because we cannot estimate the relevancy of the unvisited web page correctly if we have few source web pages. To solve this problem and enhance the efficiency of focused web crawlers, we propose a new feature, called the “neighborhood feature”. This enables the adoption of additional already-downloaded web pages to estimate the priority of a target web page. The additionally adopted web pages consist both of web pages located at the same directory as that of the target web page and web pages whose directory paths are similar to that of the target web page. Our experimental results show that our enhanced focused crawlers outperform the crawlers not utilizing the neighborhood feature as well as the state-of-the-art focused crawlers, including HMM crawler.

Keywords

Focused crawler Domain-specific dataset Vertical search engine Web archive 

References

  1. 1.
    AOL. DMOZ—open directory project (ODP). URL http://www.dmoz.org (2017). Accessed 22 Feb 2017Google Scholar
  2. 2.
    Baroni, M., Bernardini, S.: Bootcat: bootstrapping corpora and terms from the web. In: Proceedings of the 4th International Conference on Language Resources and Evaluation, European Language Resources Association, pp. 1313–1316 (2004)Google Scholar
  3. 3.
    Baroni, M., Kilgarriff, A., Pomikálek, J., Rychlỳ, P.: Webbootcat: instant domain-specific corpora to support human translators. In: Proceedings of the 12th EURALEX International Congress, Edizioni dell’Orso, pp. 123–131 (2006)Google Scholar
  4. 4.
    Batsakis, S., Petrakis, E.G., Milios, E.: Improving the performance of focused web crawlers. Data Knowl. Eng. 68(10), 1001–1013 (2009).  https://doi.org/10.1016/j.datak.2009.04.002 CrossRefGoogle Scholar
  5. 5.
    Chakrabarti, S., den Berg, M.V., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. Comput. Netw. 31(11), 1623–1640 (1999).  https://doi.org/10.1016/s1389-1286(99)00052-3 CrossRefGoogle Scholar
  6. 6.
    Chen, C., Lu, S., Du, P., Wang, H., Yu, W., Song, H., Xu, J.: Silent geographical spread of the h7n9 virus by online knowledge analysis of the live bird trade with a distributed focused crawler. Emerg. Microbes Infect. 2(12), e89 (2013).  https://doi.org/10.1038/emi.2013.91 CrossRefGoogle Scholar
  7. 7.
    Davison, BD.: Topical locality in the web. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, pp. 272–279 (2000)Google Scholar
  8. 8.
    Diligenti, M., Coetzee, F., Lawrence, S., Giles, CL., Gori, M.: Focused crawling using context graphs. In: Proceedings of the 26th International Conference on Very Large Data Bases, Morgan Kaufmann, pp. 527–534 (2000)Google Scholar
  9. 9.
    Du, Y., Liu, W., Lv, X., Peng, G.: An improved focused crawler based on semantic similarity vector space model. Appl. Soft. Comput. 36, 392–407 (2015).  https://doi.org/10.1016/j.asoc.2015.07.026 CrossRefGoogle Scholar
  10. 10.
    Ester, M., Kriegel, HP., Schubert, M.: Accurate and efficient crawling for relevant websites. In: Proceedings of the 30th International Conference on Very Large Data Bases, VLDB Endowment, pp. 396–407 (2004)Google Scholar
  11. 11.
    Ganguly, B., Raich, D.: Performance optimization of focused web crawling using content block segmentation. In: Proceedings of the 2014 International Conference on Electronic Systems, Signal Processing and Computing Technologies, IEEE, pp. 365–370 (2014)Google Scholar
  12. 12.
    Gornostay, T., Ramm, A., Heid, U., Morin, E., Harastani, R., Planas, E.: Terminology extraction from comparable corpora for latvian. In: Proceeding of the 5th International Conference on Human Language Technologies, IOS Press, pp. 66–73 (2012)Google Scholar
  13. 13.
    Gourmet Ads. Recipebridge, a dedicated recipe search engine. URL http://www.recipebridge.com/ (2017) Accessed 23 Oct 2017Google Scholar
  14. 14.
    Hsu, C.C., Wu, F.: Topic-specific crawling on the web with the measurements of the relevancy context graph. Inf. Syst. 31(4–5), 232–246 (2006).  https://doi.org/10.1016/j.is.2005.02.007 CrossRefGoogle Scholar
  15. 15.
    Li, J., Furuse, K., Yamaguchi, K.: Focused crawling by exploiting anchor text using decision tree. In: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, ACM, pp. 1190–1191 (2005)Google Scholar
  16. 16.
    Liu, H., Janssen, J., Milios, E.: Using hmm to learn user browsing patterns for focused web crawling. Data Knowl. Eng. 59(2), 270–291 (2006).  https://doi.org/10.1016/j.datak.2006.01.012 CrossRefGoogle Scholar
  17. 17.
    Liu, L., Peng, T.: Clustering-based topical web crawling using cfu-tree guided by link-context. Front. Comput. Sci. 8(4), 581–595 (2014).  https://doi.org/10.1007/s11704-014-3050-9 MathSciNetCrossRefGoogle Scholar
  18. 18.
    Luo, N., Zuo, W., Yuan, F., Zhang, C.: A new method for focused crawler cross tunnel. In: Proceedings of 1st International Conference on Rough Sets and Knowledge Technology. Lecture Notes in Computer Science, Vol. 4062, pp. 632–637. Springer, Berlin (2006)Google Scholar
  19. 19.
    US National Library of Medicine NIoH. Pubmed. URL https://www.ncbi.nlm.nih.gov/pubmed/ (2017). Accessed 23 oct 2017Google Scholar
  20. 20.
    Meiyappan, Y., Iyengar, SN., Kannan, A.: LSCrawler: A framework for an enhanced focused web crawler based on link semantics. In: Proceeding of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, IEEE, pp. 794–800 (2006)Google Scholar
  21. 21.
    Menczer, F., Belew, RK.: Adaptive information agents in distributed textual environments. In: Proceedings of the 2nd International Conference on Autonomous Agents, ACM, pp. 157–164 (1998)Google Scholar
  22. 22.
    Menczer, F., Pant, G., Srinivasan, P.: Topical web crawlers: evaluating adaptive algorithms. ACM Trans. Internet Technol. 4(4), 378–419 (2004).  https://doi.org/10.1145/1031114.1031117 CrossRefGoogle Scholar
  23. 23.
    Naghibi, M., Rahmani, AT.: Focused crawling using vision-based page segmentation. In: Proceedings of the 6th International Conference on Information Systems, Technology and Management. Communications in Computer and Information Science, Vol. 285, pp 1–12. Springer, Berlin (2012)Google Scholar
  24. 24.
    Pant, G., Srinivasan, P.: Link contexts in classifier-guided topical crawlers. IEEE Trans. Knowl. Data Eng. 18(1), 107–122 (2006).  https://doi.org/10.1109/tkde.2006.12 CrossRefGoogle Scholar
  25. 25.
    Pecina, P., Toral, A., Papavassiliou, V., Prokopidis, P., AlešTamchyna, Way A., van Genabith, J.: Domain adaptation of statistical machine translation with domain-focused web crawling. Lang. Resour. Eval. 49(1), 147–193 (2015).  https://doi.org/10.1007/s10579-014-9282-3 CrossRefGoogle Scholar
  26. 26.
    Peng, T., Liu, L.: A novel incremental conceptual hierarchical text clustering method using cfu-tree. Appl. Soft. Comput. 27, 269–278 (2015).  https://doi.org/10.1016/j.asoc.2014.11.015 CrossRefGoogle Scholar
  27. 27.
    Peng, T., He, F., Zuo, W., Zhang, C.: Adaptive topical web crawling for domain-specific resource discovery guided by link-context. In: Proceedings of 5th Mexican International Conference on Artificial Intelligence. Lecture Notes in Computer Science, Vol .4293, pp 963–973. Springer, Berlin (2006)Google Scholar
  28. 28.
    Peng, T., Zuo, W., He, F.: Svm based adaptive learning method for text classification from positive and unlabeled documents. Knowl. Inf. Syst. 16(3), 281–301 (2008).  https://doi.org/10.1007/s10115-007-0107-1 CrossRefGoogle Scholar
  29. 29.
    Rungsawang, A., Suebchua, T., Manaskasemsak, B.: Thai related foreign language-specific website segment crawler. In: Proceeding of 28th International Conference on Advanced Information Networking and Applications Workshops, IEEE, pp. 293–298 (2014)Google Scholar
  30. 30.
    Suebchua, T., Rungsawang, A., Yamana, H.: Adaptive focused website segment crawler. In: Proceedings of the 19th International Conference on Network-Based Information Systems, IEEE, pp. 181–187 (2016)Google Scholar
  31. 31.
    Tadapak, P., Suebchua, T., Rungsawang, A.: A machine learning based language specific web site crawler. In: Proceeding of the 13th International Conference on Network-Based Information Systems, IEEE, pp. 155–161 (2010)Google Scholar
  32. 32.
    Tamura, T., Somboonviwat, K., Kitsuregawa, M.: A method for language-specific web crawling and its evaluation. Syst. Comput. Jpn. 38(2), 10–20 (2007).  https://doi.org/10.1002/scj.20693 CrossRefGoogle Scholar
  33. 33.
    Taylan, D., Poyraz, M., Akyoku, S., Ganiz, MC.: Intelligent focused crawler: Learning which links to crawl. In: Proceeding of the 2011 International Symposium on Innovations in Intelligent Systems and Applications, IEEE, pp. 504–508 (2011)Google Scholar
  34. 34.
    Toral, A., Esplá-Gomis, M., Klubička, F., Ljubešić, N., Papavassiliou, V., Prokopidis, P., Rubino, R., Way, A.: Crawl and crowd to bring machine translation to under-resourced languages. Lang. Resour. Eval. 51(4), 1019–1051 (2017).  https://doi.org/10.1007/s10579-016-9363-6 CrossRefGoogle Scholar
  35. 35.
    Wang, W., Chen, X., Zou, Y., Wang, H., Dai, Z.: A focused crawler based on naive bayes classifier. In: Proceedings of the 3rd International Symposium on Intelligent Information Technology and Security Informatics, IEEE, pp. 517–521 (2010)Google Scholar
  36. 36.
    Yahoo! Japan. Yahoo! Japan Directory. URL http://dir.yahoo.co.jp (2017). Accessed 23 Apr 2017Google Scholar

Copyright information

© Ohmsha, Ltd. and Springer Japan KK, part of Springer Nature 2017

Authors and Affiliations

  1. 1.Waseda UniversityTokyoJapan
  2. 2.Department of Computer Engineering, Faculty of EngineeringKasetsart UniversityBangkokThailand

Personalised recommendations