Towards Intelligent Web Crawling – A Theme Weight and Bayesian Page Rank Based Approach
With the rapid development of Internet, the web crawler has become one of the key technologies for users to automatically obtain information from designated sites. The traditional web crawler technology has exposed several problems, such as low content accuracy due to simple filtering conditions with respect to crawling themes, low efficiency due to content duplication and long webpage update time. Aiming at solving these problems, we propose the TBPR (Theme weight and Bayesian Page Rank based crawler) approach by adopting a multi-queue model to achieve high efficiency and reduce content redundancy. Further, TBPR introduces a theme weights model to accurately classify web pages into user’s crawl concept and a Bayesian Page Rank model containing two novel factors to increase content accuracy. Our experiment applies TBPR to real world web contents, demonstrating its accuracy and efficiency.
KeywordsWeb crawler Multithread Theme weight Bayesian Page Rank
- 2.Quoc, D.L., Fetzer, C., Felber, P., et al.: UniCrawl: a practical geographically distributed web crawler. In: IEEE International Conference on Cloud Computing, pp. 389–396. IEEE (2015)Google Scholar
- 6.Gupta, S., Bhatia, K.K., Manchanda, P.: WebParF: a web partitioning framework for parallel crawlers. Int. J. Comput. Sci. Eng. 5(8) (2014)Google Scholar
- 7.Jiashu, X., Lixin, X., Zheng, T.: PageRank algorithm for text relevance of hyperlink. J. Harbin Inst. Technol. 1, 223–225 (2009)Google Scholar
- 8.Najork, M., Wiener, J.L.: Breadth-first crawling yields high-quality pages. In Proceedings of the 10th International Conference on World Wide Web, pp. 114–118 (2001)Google Scholar
- 9.Barford, P., et al.: Harvesting and analyzing online display ads. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 597–608 (2014)Google Scholar