Recombination Operators in Genetic Algorithm – Based Crawler: Study and Experimental Appraisal

  • Huynh Thi Thanh BinhEmail author
  • Ha Minh Long
  • Tran Duc Khanh
Conference paper
Part of the Studies in Computational Intelligence book series (SCI, volume 457)


A focused crawler traverses the web selecting out relevant pages according to a predefined topic. While browsing the internet it is difficult to identify relevant pages and predict which links lead to high quality pages. This paper proposes a topical crawler for Vietnamese web pages using greedy heuristic and genetic algorithms. Our crawler based on genetic algorithms uses different recombination operators in the genetic algorithms to improve the crawling performance. We tested our algorithms on Vietnamese newspaper VnExpress websites. Experimental results show the efficiency and the viability of our approach.


Genetic Algorithms Focused Crawler Keyword Vietnamese Word Segmentation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Chen, H., Chung, Y., Ramsey, M., Yang, C.: A smart Itsy Bitsy Spider for the Web. Journal of the American Society for Information Science 49(7), 604–618 (1998)CrossRefGoogle Scholar
  2. 2.
    Menczer, F., Belew, R.K.: Adaptive retrieval agents: Internalizing local context and scaling up to the Web. Machine Learning 29(2/3), 203–242 (2000); Longer version available as Technical Report CS98-579, University of California, San DiegoCrossRefGoogle Scholar
  3. 3.
    Micarelli, A., Gasparetti, F.: Adaptive Focused Crawling. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) Adaptive Web 2007. LNCS, vol. 4321, pp. 231–262. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  4. 4.
    Shokouhi, M., Chubak, P., Raeesy, Z.: Enhancing Focused Crawling with Genetic Algorithms. In: Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC 2005), pp. 503–508 (2005)Google Scholar
  5. 5.
    Pal, A., Tomar, D.S., Shrivastava, S.C.: C Shrivastava, Effective Focused Crawling Based on Content and Link Structure Analysis (IJCSIS) International Journal of Computer Science and Information Security 2(1) (June 2009)Google Scholar
  6. 6.
    Menczer, F., Pant, G., Srinivasan, P., Ruiz, M.: Evaluating Topic-Driven Web Crawlers. In: Proceedings of the 24th Annual International ACM/SIGIR Conference, New Orleans, USA, pp. 241–249 (2001)Google Scholar
  7. 7.
    Chakrabarti, S., van den Berg, M., Domc, B.: Focused crawling: a new approach to topic-specific Web resource discovery. In: Proceedings of the 8th International World Wild Web Conference, Toronto, Canada, pp. 1623–1640 (1999)Google Scholar
  8. 8.
    Petry, F., Buckles, B., Prabhu, D., Kraft, D.: Fuzzy Information Retrieval Using Genetic Algorithms and Relevance Feedback. In: Bonzi, S. (ed.) Proceedings of the Fifty-Sixth Annual Meeting of the American Society for Information Science Annual Meeting, Silver Spring, MD, vol. 30, pp. 122–125 (1993)Google Scholar
  9. 9.
    David, E.: Goldberg, Genetic Algorithms in Search, Optimization, Machine Learning. Addison Wesley (1989)Google Scholar
  10. 10.
    Aggarwal, C., Al-Garawi, F., Yu, P.: Intelligent Crawling on the World Wide Web with Arbitrary Predicates. In: Proc. 10th Int. World Wide Web Conf., Hong Kong, pp. 96–105 (2001)Google Scholar
  11. 11.
    Hsinchum, C., Chen, Y.M., Ramsey, M., Yang, C.C., Ma, P.C., Yen, J.: Intelligent spider for Internet searching. In: Proceedings of the Thirtieth Hawaii International Conference on System Sciences, Maui, Hawaii, January 4-7, pp. 178–188 (1997)Google Scholar
  12. 12.
    Angkawattanawit, N., Rungsawang, A.: Learnable Crawling: An Efficient Ap-proach to Topic-specific Web Resource Discovery. Journal of Network and Computer Applications, 97–114 (April 2005)Google Scholar
  13. 13.
    Chen, H.: Machine learning for information retrieval: Neural networks, symbolic learning, and genetic algorithms. Journal of the American Society for Information Science, 194–216 (1995)Google Scholar
  14. 14.
    Liu, B., Chin, C.W., Ng, H.T.: Mining Topic-Specific Concepts and Definitions on the web. In: Proceedings of the 12th International World Wild Web Conference (www 2003), Budapest, Hungary, pp. 251–260 (May 2003)Google Scholar
  15. 15.
    Raghavan, V., Aggarwal, B.: Optimal Determination of User-Oriented Clusters: An Application for the Reproductive Plan. In: Proceedings of the Second International Conference on Genetic Algorithms and Their Applications, Cambridge, MA, pp. 241–246 (1987)Google Scholar
  16. 16.
    Gordon, M.: Probabilistic and Genetic Algorithms for Document Retrieval. Communications of ACM 31(2), 152–169 (1988)CrossRefGoogle Scholar
  17. 17.
    Yang, J., Korfhage, R., Rasmussen, E.: Query Improvement in Information Retrieval Using Genetic Algorithms: A Report on the Experiments of the TREC Project. In: Harman, D.K. (ed.) Proceedings of the First Text Retrieval Conference, pp. 31–58. National Institute of Standards and Technology (NIST) Special Publication 500-207, Washington, DC (1993)Google Scholar
  18. 18.
    Reed, J.W., Jiao, Y., Potok, T.E., Klump, B.A., Elmore, M.T., Hurson, A.R.: TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams. In: Proceedings of the 5th International Conference on Machine Learning and Applications, pp. 258–263 (2006)Google Scholar
  19. 19.
    Qin, J., Chen, H.: Using Genetic Algorithm in Building Domain-Specific Collections: An Experiment in the Nanotechnology Domain. In: Proceedings of the 38th Hawaii International Conference on System Sciences, vol. 102 (2005)Google Scholar
  20. 20.
    Hông Phuong, L.ê., Thi Minh Huyên, N., Roussanaly, A., Vinh, H.T.: A Hybrid Approach to Word Segmentation of Vietnamese Texts. In: Martín-Vide, C., Otto, F., Fernau, H. (eds.) LATA 2008. LNCS, vol. 5196, pp. 240–249. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  21. 21.
    Daciuk, Jan, Watson, B.W., Watson, R.E.: Incremental construction of minimal acyclic finite state automata and transducers. In: Proceedings of the International Workshop on Finite State Methods in Natural Language Processing, Ankara, Turkey, June 30-July 1, vol. 1, pp. 48–56 (1998)Google Scholar
  22. 22.
    Maurel, D.: Electronic Dictionaries and Acyclic Finite-State Automata: A State of The Art. In: Published in Grammars and Automata for String Processing, Ankara, Turkey, June 30-July 1, vol. 1, Part 3, pp. 177–188 (1998)Google Scholar
  23. 23.
    Nhan, N.D., Son, V.T., Binh, H.T.T., Khanh, T.D.: Crawl Topical Vietnamese Web Pages using Genetic Algorithm. In: Proceedings of Second International on Knowledge and System Engineering, pp. 217–223 (2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Huynh Thi Thanh Binh
    • 1
    Email author
  • Ha Minh Long
    • 1
  • Tran Duc Khanh
    • 1
  1. 1.School of Information and Communication TechnologyHanoi University of Science and TechnologyHanoiVietnam

Personalised recommendations