Harvesting Forum Pages from Seed Sites

  • Luciano BarbosaEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10360)


Web forums are rich sources of conversational content. Many applications, such as opinion mining and question answering, can greatly benefit from mining and exploring such useful content. A key step towards making this content more easily available is to collect conversational pages on forum sites – so-called thread pages. In this paper, we propose a two-step crawling solution for the problem of collecting thread pages in large scale. First, since thread pages are located within forum sites, we propose an inter-site crawler that locates forum sites on the Web. To do that, the inter-site crawler focuses on the Web graph neighbourhood of forum sites, and explores the content patterns of the links in this region to guide its visitation policy. Next, to collect thread pages within the discovered forum sites, we propose an intra-site crawler that finds thread pages by learning the context of links that lead to those pages and, to detect them, relies on their content and structural features. Experimental results demonstrate that both the inter-site and the intra-site crawlers are effective and obtain superior performance in comparison to their baselines.


  1. 1.
    Barbosa, L., Bangalore, S., Sridhar, V.K.R.: Crawling back and forth: using back and out links to locate bilingual sites. In: Proceedings of the 5th International Joint Conference on Natural Language Processing, pp. 429–437 (2011)Google Scholar
  2. 2.
    Barbosa, L., Ferreira, G.: Extracting records and posts from forum pages with limited supervision. In: Wang, J., Cellary, W., Wang, D., Wang, H., Chen, S.-C., Li, T., Zhang, Y. (eds.) WISE 2015. LNCS, vol. 9419, pp. 233–240. Springer, Cham (2015). doi: 10.1007/978-3-319-26187-4_19 CrossRefGoogle Scholar
  3. 3.
    Barbosa, L., Freire, J.: Searching for hidden-web databases. In: WebDB, pp. 1–6 (2005)Google Scholar
  4. 4.
    Cai, R., Yang, J.-M., Lai, W., Wang, Y., Zhang, L.: iRobot: an intelligent crawler for web forums. In: Proceedings of the 17th International Conference on World Wide Web, pp. 447–456. ACM (2008)Google Scholar
  5. 5.
    Cong, G., Wang, L., Lin, C.-Y., Song, Y.-I., Sun, Y.: Finding question-answer pairs from online forums. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 467–474. ACM (2008)Google Scholar
  6. 6.
    Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M., et al.: Focused crawling using context graphs. In: VLDB, pp. 527–534 (2000)Google Scholar
  7. 7.
    Guo, Y., Li, K., Zhang, K., Zhang, G.: Board forum crawling: a web crawling method for web forum. In: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 745–748. IEEE Computer Society (2006)Google Scholar
  8. 8.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  9. 9.
    Jiang, J., Song, X., Yu, N., Lin, C.-Y.: Focus: learning to crawl web forums. IEEE Trans. Knowl. Data Eng. 25(6), 1293–1306 (2013)CrossRefGoogle Scholar
  10. 10.
    Koppula, H.S., Leela, K.P., Agarwal, A., Chitrapura, K.P., Garg, S., Sasturkar, A.: Learning URL patterns for webpage de-duplication. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 381–390. ACM (2010)Google Scholar
  11. 11.
    Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends Inf. Retrieval 2(1–2), 1–135 (2008)CrossRefGoogle Scholar
  12. 12.
    Platt, J., et al.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv. Large Margin Classifiers 10(3), 61–74 (1999)Google Scholar
  13. 13.
    Seo, J., Croft, W.B., Smith, D.A.: Online community search using thread structure. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 1907–1910. ACM (2009)Google Scholar
  14. 14.
    Vidal, M.L., da Silva, A.S., de Moura, E.S., Cavalcanti, J.: Structure-driven crawler generation by example. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 292–299. ACM (2006)Google Scholar
  15. 15.
    Wang, H., Wang, C., Zhai, C., Han, J.: Learning online discussion structures by conditional random fields. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 435–444. ACM (2011)Google Scholar
  16. 16.
    Wang, Y., Yang, J.-M., Lai, W., Cai, R., Zhang, L., Ma, W.-Y.: Exploring traversal strategy for web forum crawling. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 459–466. ACM (2008)Google Scholar
  17. 17.
    Webber, B., Webb, N.: Question answering. In: The Handbook of Computational Linguistics and Natural Language Processing, pp. 630–654 (2010)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Universidade Federal de PernambucoRecifeBrazil

Personalised recommendations