1 Introduction

There is a great variety of social data available on the Web. Internet forums are social data where users hold conversations about particular subjects or topics. There are forums in diverse topics such as moviesFootnote 1, agricultureFootnote 2 and healthFootnote 3. Forums are also very popular: to give some numbers, as of September 2015, a big forum website – ConceptArt.orgFootnote 4 – had more than 376 thousand users and more than 8 billion posts, another forum – Gaia OnlineFootnote 5 – had about 26 million users and 2 billion messages (source: The Biggest BoardsFootnote 6). This huge amount of diverse human-generated content is very helpful for a variety of applications such as opinion mining [11], question answering [17] and forum search [13].

To take advantage of such rich content, methods to collect and process forum data have been previously proposed [2, 5, 9, 15]. In this paper, we focus on the particular problem of collecting conversational pages of forums, also known as thread pages. Previous approaches in the area of forum crawling have mainly focused on collecting thread pages within forum sites [4, 7, 9, 16]. To avoid visiting unproductive regions of those sites, they learn regular expression patterns of URLs in the navigational paths that lead to thread pages, and use these patterns to guide the crawler’s visitation policy.

Fig. 1.
figure 1

Overview of our two-step strategy to collect thread pages from forums.

In this work, we aim to perform a broader harvest of thread pages on the Web. More specifically, we are interested not only in collecting thread pages in a particular forum site, as previous approaches [4, 7, 9, 16], but also to gather these pages from as many sites as possible. For that, we propose a two-step approach that first finds forum sites on the Web and, subsequently, collects thread pages within those sites. Figure 1 gives an overview of our solution. For the first step, we propose the Inter-Site Crawler that focuses on the Web neighbourhood of already known forum sites to discover new forum sites on the Web. Since not all links in the neighbourhood of forums lead to relevant information, we apply machine learning techniques to learn the patterns of relevant links in it. Once forum sites are found, the next step is to collect thread pages within them. This is the role of the Intra-Site Crawler. From the homepage of a given forum site, the Intra-Site Crawler navigates through the link structure of the site to locate thread pages. To focus its visitation policy on promising regions of the site, the crawler explores the context of the link neighbourhood of thread pages by using classifiers, instead of using regular expressions as previous approaches [9, 14].

The remainder of the paper is organized as follows. Section 2 describes in details the Inter-Site Crawler, and Sect. 3 the Intra-Site Crawler. In these sections, we also provide experimental evaluation of these solutions. Finally in Sect. 4, we conclude the paper.

2 Inter-Site Crawler

The goal of the Inter-Site Crawler is to discover forum sites on the Web. To avoid visiting unproductive regions of the Web, the Inter-Site Crawler must focus on the region of the Web where forum sites are located (site discovery) and, to collect a high-quality set of forum sites with low cost, it needs to effectively and efficiently detect forum sites (site detection). In the remaining of this section, we explain in details these two tasks.

Fig. 2.
figure 2

Word cloud created from a set of entry-pages of forums.

2.1 Site Detection

The Site Detector is the component of the crawler responsible for identifying forum sites. Given a website, the Site Detector needs to perform the detection with high-quality and low-cost, i.e., visiting few pages as possible of the website. Our strategy of detecting forum sites is based on two observations: (1) sites containing forums usually have an entry page to the forum content, which gives an overview of the current discussions in the forum; and (2) the forum entry pages are either the initial page of the site, or located at a shallow depth. Based on those, the crawler performs the detection of forum sites by doing a shallow crawling in the websites looking for the forum entry page if it exists.

A previous approach [9] proposed a heuristic method to identify the URL of the forum entry page given a forum site. We can not apply this strategy directly to our problem of Site Detection because we do not assume the input of the algorithm is a forum site. In fact, given any site on the Web, we want to verify whether the site is a forum site or not. Thus, to perform the detection of entry pages, we build a classifier based on the content of entry pages. Usually, entry pages of forums have a common vocabulary. Figure 2 shows a word cloud from a set of entry pages of forums to illustrate that. Words as “forum”, “post” and “topics”, which are not associated to a particular domain, have high frequency in this set. Based on this observation, we built a generic classifier, the Entry-Page Classifier, that uses as features the content (words) in entry pages (positive examples) as well as in non entry pages (negative examples) to detect entry pages of forums.

The Entry-Page Classifier is used in the site detection as follows. Given the homepage of a website, it verifies whether this page is an entry page to a forum using the Entry-Page Classifier. If so, the site is classified as a forum site. Otherwise, the outlink pages of the homepage are checked by the classifier. To avoid downloading too many pages, only outlink pages whose links contain indicative words of forums such as “forum”, and “community” are visited. At the end of this process, a site is classified as a forum site if the entry-page classifier considers relevant one the visited pages of the site in the shallow crawling.

Fig. 3.
figure 3

Example of bipartite graph used by the Inter-Site crawler to locate forum sites (FS stands for forum sites).

figure a

2.2 Site Discovery

To the best of our knowledge, no previous work has been proposed to locate forum sites on the Web. The main challenge in performing this task is that forum sites are sparsely distributed on the Web. As a result, a simple crawling strategy that randomly follows outlinks obtains a poor performance in finding forum sites, as the results in Sect. 2.3 suggest. To find forum sites more efficiently, we propose a crawling strategy that focuses its visitation policy on Web neighbourhood of forum sites. More specifically, the crawler explores the neighborhood graph defined by the bipartite graph composed by the backlink pages (BPs) of URLs of forum sites, and the pages pointed by BPs (outlink pages), as shown in Fig. 3. The intuition behind this strategy is that a single backlink page might refer to many related pages (forum sites in our context). We call this link-rich backlink as hub page. Thus, once the crawler finds a hub page, it is only one step away from multiple forum sites. A previous work [1] used a similar strategy to locate multilingual sites on the Web.

The strategy works as presented in Algorithm 1. Initially, the user provides a set of seed URLs of forum sites. The crawler then retrieves the backlinks of these sites (line 9) using a backlink API available onlineFootnote 7, adding them to the backlink frontier (line 10). One of the backlinks is selected from the backlink frontier (line 12) and the page that the backlink points to (backlink page) is downloaded (line 13). The outlinks of this backlink page are extracted from the page and inserted into the outlink frontier (15). Next, a link from the outlink frontier is picked (line 6) and the Site Detector verifies whether it is a forum site or not (line 8). If so, the backlinks of this link are retrieved and added to the backlink frontier, and the process continues as described before. Notice that the crawler does not explore the outlinks of outlink pages, and only explores backlinks of forum sites, detected by the Site Detector.

Since backlinks in the backlink frontier not necessarily lead to hub pages, and outlinks in the outlink frontier not necessarily lead to forum sites, we apply machine learning techniques to rank links in those frontiers. More specifically, the crawler uses the Backlink Classifier that predicts the likelihood of a backlink b being a hub page, given features of b such as the tokens in the URL and in the title of the pageFootnote 8. Likewise, the crawler uses the Outlink Classifier to predict the likelihood of a given outlink to point to a forum site based on tokens in the URL, in the anchor and around the link. The two classifiers are automatically created during the crawling process. Initially, the crawler starts with no link prioritization. After a specified number of crawled pages, a learning iteration is performed by collecting the link neighbourhood of the links that point to relevant and non-relevant pages in each set. The result of this process is used as training data for the Backlink and Outlink classifiers.

2.3 Experimental Evaluation

In this section, we evaluated the two tasks performed by the Inter-Site Crawler – site detection and discovery – presented previously.

Site Detection. To measure the quality of the Site Detector, we manually labeled 380 Web pages (182 positive and 200 negative) from a variety of sites on the Web. Positive examples are entry pages to forums in the sites. We used two thirds of the data for training and one third for test.

Table 1 presents the precision, recall, F-1 and accuracy values for 3 different machine learning algorithms. The classifiers were created using the Weka package [8] with default values. The numbers show that Support Vector Machine (SVM) obtained the best results (precision = 0.855, recall = 0.791, F-1 = 0.822, accuracy = 82.8). The SVM version used was the probabilistic SVM [12], since we are interested in the class likelihood of the instances. A threshold over the class likelihood can be used, for instance, as a filter to improve the precision of the Site Detector. This filter is very useful in this environment whereby the proportion of negative examples is much higher than the positive ones. For this purpose, we varied the minimum likelihood for a page be considered relevant from 0.5 to 0.9, as presented in Table 2. For each value, we measured its quality (precision, recall and F-1). As expected, the minimum likelihood is directly proportional to the precision and inversely proportional to the recall. For instance, when the minimum likelihood is 0.9, i.e., only pages with likelihood higher 0.9 are considered relevant by the detector, the precision is 0.96.

Table 1. Results from different machine learning algorithms.
Table 2. Results of varying the minimum likelihood of a page being considered relevant by the entry page classifier.

Site Discovery. To evaluate our strategy to locate forum sites, we executed the following crawling configurations:

  • Forward Crawler (Forward): The forward crawler randomly follows the forward links. Only out-of-site links are considered, i.e., it excludes from the crawling links to internal pages of the sites;

  • Bipartite-Graph Crawler (Bipartite): our strategy focusing on the bipartite graph composed by backlink and outlink pages without any prioritization of links in the graph;

  • Classifier-Based Bipartite-Graph Crawler (Bipartite+ Classifiers): our strategy using classifiers to prioritize links in the bipartite graph.

Each configuration visited 200,000 pages. Only 2 forum sites were provided as seeds to start the crawlFootnote 9. The performance of the crawling strategies was measured by the total number of forum sites collected. The minimum likelihood used by the Site Detector to consider a site as relevant was 0.9, since we are interested in obtaining a high-quality collection of forum sites.

Figure 4 presents the number of collected forums versus the number of visited pages for each approach. At the end of the crawling processes, the Bipartite + Classifiers approach collected the highest number of forum sites (17,429 sites). That is, only from 2 seed URLs, our best strategy discovered more than 17 K forum sites. The Bipartite crawler located fewer sites (34,515) followed by the Forward crawler (427 sites). These results show that (1) the bipartite graph strategy is in fact effective: the Bipartite crawler found 8 times more sites than the baseline (the Forward crawler); and (2) the classifiers (Backlink and Forward classifiers) used to prioritize the links in the bipartite graph hugely improved the performance of the bipartite crawler: Bipartite + Classifiers crawler discovered 5 times more for forum sites than the Bipartite crawler.

Fig. 4.
figure 4

Performance of the 3 strategies of inter-site crawling.

3 Intra-Site Crawler

The pages in forum sites that contain users’ conversations are called thread pages. The goal of the Intra-Site Crawler is to locate thread pages in a given forum site. To achieve that, it needs to locate and detect thread pages in forum sites. For the first task, it is important that the crawler avoids unproductive regions of the sites that might not lead to thread pages. For the second one, the crawler needs to perform a high-quality detection, otherwise the repository of thread pages collected by the crawler might have poor quality. We give further details about these tasks in the remaining of this section.

Fig. 5.
figure 5

(adapted from [6]).

Example of context graph for thread pages

3.1 Locating Thread Pages

Thread pages are only a subset of the pages in forum sites. In addition to them, forum sites might contain pages related to documentation, news etc. The Intra-Site crawler then needs to focus on the region of the website where thread pages are located to collect the maximum number of thread pages visiting the lowest number of pages as possible. For that, the crawler explores the patterns of links inside forum websites using context graphs [6]. Figure 5 presents an example of a context graph of two levels. The thread page is located in the center of the context graph. The main assumption of context graphs is that links at the same distance to the thread page in this graph have similar context. For instance, in the site ubuntuforums.org, URLs that point to thread pages (one step away) have the string “showthread” in common, whereas URLs two steps away have the string “forumdisplay” in common. Context graphs have been used to locate pages in other contexts such as pages in a given topic [6] or pages containing Web forms [3]. The Link Classifier is the component in the intra-site crawler that leverages context graphs to locate thread pages. More specifically, the Link Classifier estimates the distance (number of edges) of a given link to a thread page based on its context. The context is composed by the tokens in the URL, anchor and words around the anchor, which are the features used by the classifier.

The Link Classifier is automatically built as the crawl progresses. Initially, the crawler starts with no link prioritization. After a specified number of crawled pages, the links visited that led to thread pages collected so far, which compose the context graph, are used as training data to build the Link Classifier.

Fig. 6.
figure 6

Examples of an index and a thread page.

3.2 Detecting Thread Pages

In order to collect a high-quality set of thread pages from a given site, the intra-site crawler needs to perform an effective thread page detection. Thread pages are composed of user posts (see Fig. 6(a)). Posts are usually within records, which contain, in addition to posts, meta-information about the posts such as the user who posted the information, the date of posting etc. In a previous work [2], we implemented a method to extract records from forum pages. Records also appear in other types of pages in forum sites such as the ones that point to thread pages (a.k.a. index pages). Figure 6(b) presents an example of index page.

Based on these observations, our thread page detection relies on the record extraction to obtain features for the classification. Only pages with extracted records are considered for classification. Thus, given a forum page, the first step of the detection is to extract records from the page. If records are extracted, thread classification is applied over the records to verify whether the page is a thread page or not. Records of index pages have different layouts from records of thread pages as one can see in Fig. 6(a). For instance, records in thread pages usually have longer texts than in index pages; and records in index pages contain internal links to thread pages which is not always the case for records from thread pages. Based on that and similar to [9], we employ the following set of features based on the layout of the records:

  • Average number of date and user information for all records in the page. To detect information about date and user in a record, we used detectors based on regular expressions [2];

  • Average size, variance and noise to signal (standard deviation/average) of the texts in the records;

  • Average number of images, internal and external links.

Fig. 7.
figure 7

Performance of the 3 strategies of intra-site crawling.

3.3 Experimental Evaluation

In this section, we evaluate the thread-page detection and discovery presented previously.

Thread-Page Detection. To build the training data, we labeled 1280 instances (50% positive and 50% negative) and split in 722 for training and 558 for test. We tried different machine learning techniques available in Weka package [8] with their default values. The multi-layer perceptron obtained the best results: Accuracy = 92.2%, Precision = 0.91, Recall = 0.87 and F-measure = 0.89.

Thread-Page Discovery. For the problem of thread-page discovery, we implemented 3 different strategies of prioritization of links:

  • Baseline: the baseline randomly follows the internal outlinks of the pages;

  • REGEX-based: the regex-based strategy follows links using regular expressions learned from URLs of thread pages, as proposed by [10] and used by Jiang et al. [9] in their forum crawler;

  • Classifier-based: the classifier-based approach is the one proposed in this paper for the Intra-Site crawler, which builds a Link Classifier learned from the context graph as we described previously. Probabilistic SVM [12] was the learning algorithm used by the Link Classifier, and it was created with default values for SVM provided in Weka [8].

The 3 approaches collected thread pages within 586 forum sites, and each one visited a total of 100,000 pages of these sites. We evaluated the approaches based on the number of thread pages collected by each approach, identified by the thread page detector. The minimum likelihood used by the thread page detector to consider a page as a thread page was 0.8. Both the REGEX-based and the Classifier-based crawlers start with the same link prioritization than the baseline. The learning process to create the regular expressions, for the REGEX-based crawler, and the link classifier, for the Classifier-based crawler, was performed every 20,000 visited pages.

Figure 7 presents the results for the 3 strategies. As expected, until 20,000 visited pages all approaches had similar behaviour. After that, the approaches presented different performance since the Classifier-based and REGEX-based approaches started running the learning process. At the end of the crawling process, our approach outperformed the other two: the Classifier-based crawler collected 13,063 thread pages, whereas the REGEX-based collected 10,483 and the Baseline 7,486. A possible reason why our strategy obtained better results than the REGEX-Based one is that the REGEX-Based crawler learns patterns only from the URLs of relevant pages, whereas our strategy leverages patterns using machine learning not only from URLs but also from the anchor of these links and the context around them.

4 Conclusion

We presented in this paper a two-step crawling approach to collect conversational pages on the Web. First, the Inter-Site Crawler discovers forum sites from seed sites. For that, it restricts its crawl to the link neighbourhood composed of the backlink pages of forum sites, and the outlink pages of backlink pages. To prioritize links in this graph, we apply machine learning techniques. The URLs of the discovered forum sites are provided to the Inter-Site Crawler that collects the conversational pages in these sites. To more efficiently locate those pages, it uses a link classifier that explores the context of the links around conversational pages. Our experimental evaluation shows that our crawling approaches harvested high-quality collections and are more efficient than the baselines. As future work, we plan to apply this strategy to specific domains, instead of using it in a generic manner as we presented in this paper.