Keywords

1 Introduction

The Internet has become one of the main accesses to news information, and therefore news websites produce a great number of news contents for users’ daily demands. With the fast development of front-end techniques, programmers can use Cascading Style Sheets (CSS) and JavaScript to develop more and more complicated web pages, therefore we will face increasing challenges to extract the main contents from highly heterogeneous web pages. In addition to the news content, a news web page commonly contains lots of irrelevant texts which are known as noises, such as advertisements, navigation panels, comments, etc. The studies and applications, such as news topic detection and tracking, require the news contents which have been processed and stored. Extracting the news content automatically is important for massive news information management, retrieval, analysis, and integration.

There are some online content extraction methods that provide theoretical supports for this paper, such as CEPR [1], CETR [2], and CETD [3]. These methods are efficient and concise, which do not need training and pre-processing. Non-content features (e.g. tag path, text layout, DOM structure, hyperlink) used by these methods are easy to be obtained from an HTML page. However, online extraction methods mentioned above do not pay much attention to topics of news content and only rely on the non-content features. These methods are difficult to recognize the noise whose external characteristics are similar to those of the news content.

In this paper, we present Content Extraction based on DS Theory (CEDST) which is an efficient and accurate news content extraction method. CEDST combines the topic feature and non-content features to improve the extraction performance. The contributions of this method are as follows: (1) Improving the recognition accuracy of news content extraction by introducing the topic feature. (2) Maximal compatibility blocks are used to generate topics from text nodes without linguistic analysis, therefore our method can be easily applied at websites in different languages by replacing a word segmentation method. (3) DS theory has the ability to represent and quantify uncertainties, which combine features in a reasonable way.

2 Related Work

Web pages are very heterogeneous and there are no rigid guidelines on how to build HTML pages and how to declare the implicit structure [4]. HTML tags without semantic information bring a lot of difficulties in the content extraction. If we want to develop a precise extraction method which is fully automated, the method should be restricted to a specific domain, such as news extraction [1, 4], data records extraction [5, 6], e-commerce information extraction [7, 8], title extraction [9]. In this paper, we aim at extracting entire news articles from HTML pages automatically and efficiently.

Content extraction for HTML pages has been researched more than a decade. Most of the early studies on the content extraction are rule-based methods. Users write extraction rules with a specific language on the extraction system which assists in generating wrappers quickly, such as TSIMMIS [10] and W4F [11]. In order to reduce manual steps, some semi-automatic methods [12, 13] had been developed. Semi-automatic methods need users to identify regions of interest on web pages, and then use inductive or heuristic algorithms to generate extraction rules. Although rule-based methods extract content accurately, users have to do much hard work manually.

Numerous methods for automatic content extraction have been presented, but each method has its drawbacks. The template-based [4, 6] methods assume that web pages in the same cluster share the same template. These methods can filter out noises from web pages automatically, but any change of websites may lead to templates’ failure, therefore the templates need to be trained again. Deng et al. [14] proposed a classical vision-based method named VIPS, which segments a web page into visually grouped blocks based on the DOM tree. Song et al. [15] proposed a method to rank blocks based on VIPS. The biggest disadvantage of the vision-based methods is high consumption of computer resources and time, because these methods need to render the HTML page and retrieve all the CSS files which relate to the page. Gali et al. [9] proposed a linguistics-based method which can extract titles from web pages, but this method cannot be put into general use, because it requires manual effort and domain knowledge to build the part-of-speech (POS) pattern trees. CEPR [1], CETR [2], and CETD [3] are the online Web content extraction methods without training and preprocessing, inspired by these methods, we propose a novel method named CEDST for news content extraction.

Fig. 1.
figure 1

Process of extracting the news content in the red solid frame. (Color figure online)

3 News Content Extraction Method

The HTML page can be parsed into a Document Object Model (DOM) tree. We define the leaf node that contains text as a text node. The process of our news content extraction method is shown in Fig. 1, which consists of following steps: (1) Parsing the HTML page into a DOM tree: computing values of non-content features for each text node of DOM tree. (2) Text nodes traversal: traversing text nodes of DOM tree in a pre-order manner. (3) Topic generation: using maximal compatibility blocks to generate topics from text nodes, and then computing the topic feature of each text node. (4) Features fusion by DS theory: combining non-content features and a topic feature by DS theory for obtaining the fusion value of each text node. (5) Content extraction: after smoothing the fusion values, we compute a threshold to distinguish the news content from noises by using Otsu algorithm.

3.1 Features Fusion by DS Theory

DS theory [16] is a method for uncertainty. The ability to represent and quantify uncertainties is a key advantage of DS theory. We use DS theory to combine features of a text node to calculate the probability that the text node belongs to news content. We convert features into pieces of evidence by basic mass assignment (BMA) [17]. In the news extraction domain, the frame of discernment for a text node is \(\varTheta =\left\{ news,\sim news\right\} \). The BMA function is m:\(2^{\varTheta }\rightarrow [0,1]\), where \(2^{\varTheta }\) is the power set which can be denoted as . The BMA function should satisfy two conditions: and \(\sum \nolimits _{U \subseteq 2^{\varTheta }}^{} (U)=1\). We use the labels “news” and “\(\sim news\)” to denote positive and negative status respectively. If the feature \(f_{i}\) is positive, which supports the text node belonging to the news content, the feature can only assign the probability to label “news”. The BMA formula is as follows:

$$\begin{aligned} {\left. \begin{array}{ccc} m_{f_{i}}(\{news\})=\alpha _{f_{i}}\times h_{f_{i}} \\ m_{f_{i}}(\{{\sim } news\})=0\\ m_{f_{i}}(\varTheta )=1-m_{f_{i}}(news) \end{array} \right\} } \end{aligned}$$
(1)

Otherwise, the feature is negative, which can only assign the probability to the label “\(\sim news\)”. The BMA formula is as follows:

$$\begin{aligned} { \left. \begin{array}{ccc} m_{f_{i}}(\{news\})=0 \\ m_{f_{i}}(\{{\sim } news\})=\beta _{f_{i}}\times h_{f_{i}}\\ m_{f_{i}}(\varTheta )=1-m_{f_{i}}(\sim news) \end{array} \right\} } \end{aligned}$$
(2)

\(h_{f_{i}}\) is the feature value normalized between 0 and 1. \(\alpha _{f_{i}}\) and \(\beta _{f_{i}}\) are the weights assigned between 0 and 1 for features, but we set the feature weights near to 1 in order to avoid the normalizing parameter K of Eq. (5) appearing zero. All the features obtained from BMA can be combined together by using Eq. (3). Given BMA functions \(m_{f_{1}}\), \(m_{f_{2}} \ldots m_{f_{n}}\) which are reasoned by features of a text node, the fusion function which is denoted as \(\oplus \) in Fig. 1 as follows:

(3)
(4)
(5)

\((m_{f_{1}} \oplus m_{f_{2}} \cdots \oplus m_{f_{n}})(\{news\})\) is the fusion value which captures the probability that a text node belongs to news content. All the features \(f_{i}\) need to be converted into pieces of evidence which assign the probability to \(\{\{news\},\{\sim news\},\varTheta \}\) in the BMA formula \(m_{f_{i}}\). Note that the positive and negative features assign the BMA in different manners.

3.2 Topic Generation

Figure 2 shows the process of topics generation. Firstly, each text node is transformed into a set of keywords, the steps are as follows: word segmentation, deleting stop words and keywords generation by textRank [18]. Maximal compatibility blocks are utilized to generate topics from keywords sets. Compatibility relation is a binary relation which satisfies reflexive and symmetry, so the relation can be denoted as a lower triangular matrix in Fig. 3. Compatibility relation \(\varvec{R}\) on keywords is a pair-wise relation between any two words which occur more than \(\gamma \) times in the same text nodes. We set \(\gamma =2\) which will be discussed in Sect. 4.3.

Fig. 2.
figure 2

Process of topic generation

Definition 1

(maximal compatibility block). Let \(\varvec{U}\) be the set of all the keywords, if \(\varvec{W}\subseteq \varvec{U}\), where any \(x,y \in \varvec{W}\) has the relation \(x\varvec{R}y\), \(\varvec{W}\) is a compatibility block. If there doesn’t exist \(k \in \varvec{U}-\varvec{W}\) which can be added to the compatibility block \(\varvec{W}\), the \(\varvec{W}\) is a maximal compatibility block.

Fig. 3.
figure 3

Process of solving maximum compatibility blocks

The process of solving maximum compatibility blocks is shown in Fig. 3. Each node represents a keyword and the line between two nodes represents the compatibility relation \(\varvec{R}\). The numbers assigned to the lower triangular matrix are co-occurrence frequencies for any two words which occur more than \(\gamma \) times in the same text nodes. Topics are the maximum compatibility blocks generated from left to right on the lower triangular matrix. As shown in the dotted box below the matrix in Fig. 3, the node n has the relation \(\varvec{R}\) with nodes xyzk, and nodes xyz belong to an existing compatibility block, so node n and xyz can be combined into a bigger compatibility block. After scanning all the nodes on the triangular matrix, all the compatibility blocks that seem like complete polygons can be found out. After removing the compatibility blocks which can be covered by a bigger compatibility block, the rest are the maximal compatibility blocks. The time complexity of topic generation is \(O(N*E*R)\), where N, E and R are the number of all the key words, existing compatibility blocks and related text nodes respectively. The compatibility relation matrix in Fig. 3 is a sparse matrix, so the maximum compatibility blocks can be generated quickly.

Topics may belong to news content, advertisement, recommendation, etc. A text node has topics with higher weights, which is more likely to belong to news content. The topic weight formula is as follows:

$$\begin{aligned} tw(topic)= \sum \nolimits _{n \in \{relate(x,y)|x,y \in ~topic\}}^{} n \end{aligned}$$
(6)

where relate(xy) is the co-occurrence frequency of words xy recorded on the lower triangular matrix in Fig. 3. The value of topic feature is as follows:

$$\begin{aligned} h_{topic}(text)=\frac{max(\{tw(m)|m \in ~topic(text)\})}{max(\{tw(n)|n \in ~topics\})} \end{aligned}$$
(7)

where topic(text) is a set of topics generated from the text, topics is a set of all the topics generated from the web page. The greater the \(h_{topic}\), the more possible the text belongs to news content. The feature value \(h_{topic}\) is positive, which can be put into Eq. (1) for obtaining the mass function \(m_{topic}\).

3.3 Non-content Features

Non-content features focus on external characteristics of text nodes. For example, the news content commonly contains long texts with a simple format. After observing lots of news websites and considering the features mentioned by [1] and [3], we design the following features to evaluate the probability of a text node belonging to news content.

Text Cluster Feature. In general, news content consists of continuous text nodes which are commonly appended to several parent nodes. More nodes with long text are appended to the same parent node, these text nodes are more likely to belong to news content, the feature value is as follows:

$$\begin{aligned} h_{cluster}(text)=\frac{\sum \nolimits _{t \in sb(text)}^{} size(t)}{max(\{\sum \nolimits _{w \in sb(n)}^{}size(w)|n \in TextNodes\})} \end{aligned}$$
(8)

where TextNodes denotes a set of all the text nodes in a web page, sb(text) is the set of text and its sibling text nodes. Function \(size(\cdot )\) counts the total number of words in a text node. The text cluster feature is positive, so the feature value \(h_{cluster}\) should be put into the Eq. (1) for obtaining \(m_{cluster}\).

Text Variance Feature. Paragraphs of news content are commonly written with various lengths. But the noise contains brief and neat sentences in general. If the variance of text lengths of a text node and its sibling text nodes is high, the text node is more likely to belong to news content. The feature value is as follows:

$$\begin{aligned} h_{var}(text)=\frac{var(sb(text))}{max(\{var(sb(n))|n \in TextNodes\})} \end{aligned}$$
(9)

where var(sb(text)) is the variance of text lengths of the text and its siblings. The variance feature \(h_{var}\) is positive, which can be put into Eq. (1) for obtaining \(m_{var}\).

Hyperlink Feature. The noises, such as advertisements and navigation panels, commonly have a high ratio of words nested in hyperlink anchors. Because these noises aim at leading users to other web pages. The feature value is as follows:

$$\begin{aligned} h_{href}(text)=\frac{size(href(text))}{size(text)} \end{aligned}$$
(10)

where href(text) is the words nested in hyperlink of the text. The more words of a text are nested in hyperlink anchors, the less likely the text belongs to news content. \(h_{href}\) is negative, which should be put into Eq. (3) for obtaining \(m_{href}\).

After transforming all the values of features into the evidence construction by using BMA, these features can be combined by using the DS fusion method (Eq. (3)). The fusion value is denoted as \(m_{topic}\oplus m_{cluster}\oplus m_{var}\oplus m_{href}(\{news\})\), which is the probability of the text node belonging to news content.

3.4 Content Extraction

Gaussian Smoothing. After combining all the features by DS theory, we obtain the fusion values of all the text nodes on a web page. The preliminary fusion values might not recognize some special texts of news content, such as short texts without news topics. Although these short texts belong to the news content, their fusion values are low. These texts of news content may be lost without smoothing. We use the one-dimensional Gaussian smoothing mentioned by [1, 2] to solve this problem. Figure 4 shows the fusion values of a news web page of CNN, where the histograms present the fusion values of text nodes before and after smoothing on the web page. The numbers below the abscissa axis represent text nodes traversed in a pre-order manner. Text nodes of news content with low fusion values can be smoothed to higher values. We see that between text nodes 310 and 345, most of the text nodes with high fusion values are identified as news content.

Fig. 4.
figure 4

Fusion values before and after Gaussian smoothing in a news web page of CNN

Threshold Segmentation. Based on the fusion values of the histogram after smoothing, we use the Otsu algorithm to compute a threshold which divides the text nodes into two categories. The category with higher fusion values belongs to the news content. Given a threshold t, \(w_{0}\) and \(w_{1}\) are the proportions of the two categories separated by the threshold t, \(u_{0}\) and \(u_{1}\) are the average fusion values of these two categories.

$$\begin{aligned} g = w_{0}(u_{0}-\mu )^2 + w_{1}(u_{1}-\mu )^2 \end{aligned}$$
(11)
$$\begin{aligned} Where: \quad \mu = w_{0}\times u_{0} + w_{1}\times u_{1} \end{aligned}$$
(12)

\(\mu \) is the global average and g is the objective function. We calculate the objective function g by using t from 0 to 1 with the step size 0.1. The Otsu algorithm aims to find the threshold t to maximize the objective function g. We use Otsu algorithm to work out that the threshold of Fig. 4(b) is 0.3.

4 Experimental Result

4.1 Performance Metrics

In this paper, precision, recall and \(F_{1}\)-score are used to evaluate and compare the performance of different content extraction methods. \(N_{e}\) represents the text nodes extracted from a web page and \(N_{l}\) represents the text nodes that are manual labeled results. (Note that in CETR [2], \(N_{e}\) and \(N_{l}\) represent the lines are extracted and hand-labeled respectively). Precision(P), Recall(R) and \(F_{1}\)-score(\(F_{1}\)) are as follows:

$$\begin{aligned} P = \frac{\sum \nolimits _{t \in N_{e} \cap N_{l}}{}size(t)}{\sum \nolimits _{w \in N_{e}}{}size(w)},\ R=\frac{\sum \nolimits _{t \in N_{e} \cap N_{l}}{}size(t)}{\sum \nolimits _{w \in N_{l}}{}size(w)},\ F_{1}=\frac{2PR}{P+R} \end{aligned}$$
(13)

4.2 Method Evaluation

The experimental data contains two kinds of data sets. (1) News: This data set contains news web pages from four Chinese and five English news websites: Xinhuanet, Phoenix News Media (Ifeng), 163 News, People, Freep, CNN, NY Post, Yahoo! News, BBC. Each website contains 100 news web pages which are chosen randomly. (2) CleanEval: This corpus contains Chinese and English data sets (ClenaEval-zh and CleanEval-en) from the CleanEval competition mentioned by CEPR [1] and CETR [2]. The CleanEval contains various kinds of web pages, but our goal is to extract entire articles of news web pages, hence we choose the web pages whose structures are similar to news web pages to join the experiment, such as forums and blogs. The manual labeled result of each web page should be restricted to the entire article.

Table 1. Recall(R), Precision(P), \(F_{1}\)-score(\(F_{1}\)) of each method on different news websites, the highest values are in bold.

Tables 1 and 2 show the news extraction results of our method and comparison methods on different data sets. CEDST is the method proposed by this paper. CEDST-NC uses the fusion value \(m_{cluster}\oplus m_{var}\oplus m_{href}(\{news\})\) to denote that combining all the non-content features mentioned in this paper. CEDST-TF only uses the topic feature \(m_{topic}(\{news\})\) to recognize the news content. CEPR and CETR are the content extraction methods which have similar ideas to our method. CETR is a classical online content extraction method which offers the framework to extract content without training and preprocessing. CEPR is an excellence method which aims at extracting news content automatically and accurately.

Table 1 shows that CEDST is 1.19% higher than CEPR and 7.15% higher than CETR on average \(F_{1}\)-score. It represents that our method outperforms the comparison methods in most cases. CETR performs best on average recall, but it performs worst on average precision. Taking the web page of Xinhuanet news website as an example, it has long abstracts of recommended articles under the news content. CETR can not distinguish these long text noises from news content, because it relies on the tag ratios which focus on the text length and the number of tags. CEPR is a brief and efficient method, which performs stably on all websites. Although CEPR makes full use of non-content features, CEDST outperforms CEPR on the precision with considering topics of news content. Especially on Xinhuanet, People and Freep, CEDST distinguishes the long text noise with the topic feature, while CEPR depends on fusion values after smoothing. Without relating to the main topics of news content, some long text noises are easy to be identified as news content by these comparison methods. CEDST outperforms the CEDST-NC, which means that introducing topic feature can improve the precision obviously, because the non-content features are weak in describing the noise formatted like the news content. Although CEDST-TM performs high precision, the recall is low, because this method only extracts key paragraphs of news content and losses some paragraphs without the main topics. Therefore, CEDST balance the non-content features and the topic feature, which performs best on extraction performance.

Table 2 shows the extraction result on CleanEval. Our method is 2.95% higher than CEPR and 4.74% higher than CETR on average \(F_{1}\)-score, which represents that our method is more suitable to extract entire articles from web pages. CEDST achieves great progress in extracting Chinese web page, because the news topics in Chinese web pages are easier to be captured. Beside the news web page, CleanEval contains many different types of web pages, such as blogs and forums. The result indicates that our method is robust and can be widely used in various web pages with main articles.

Table 2. Recall(R), Precision(P), \(F_{1}\)-score(\(F_{1}\)) of each method on CleanEval, the highest values are in bold.

Despite many advantages of our algorithm, there are some weaknesses in dealing with extreme circumstances. For example, CEDST performs worst in NY post obviously, because a recommendation may appear many times with identical texts in a web page of NY post. The topic generation method may assign high weights to topics of such recommendations, therefore the noises of these recommendations may be identified as news contents with high values of topic feature.

Table 3 shows that our method is slower than CEPR and CETR on execution time, because the time complexity of topic generation is approximately \(O(n^3)\), which is the most time-consuming portion of our method. Although we need more time to construct the topic feature for extracting content from a news web page, our method is the most accurate method. CETR extracts content slower than CEPR, because CETR uses the K-means clustering to distinguish the news content from the noises, while CEPR calculates a simple threshold to segment news content with time complexity O(1). CEPR compresses execution time of processes of content extraction, but sacricing accuracy for speed and simplicity. For example, CEPR calculates fusion values of text nodes with the same tag path simultaneously, but losing accuracy when the tag path of noise is the same as news content.

Table 3. Average execution time for each method to extract a news web page
Fig. 5.
figure 5

The extraction performance of CEDST with different \(\gamma \) thresholds

4.3 Parameter Setting

The threshold \(\gamma \) mentioned in Sect. 3.2 is used to adjust compatibility relation among keywords. If the co-occurrence frequency of two keywords is higher than threshold \(\gamma \), the relation can be recorded in the lower triangle matrix shown in Fig. 3. Threshold \(\gamma \) determines the generation of maximal compatibility blocks, which greatly impacts performance of CEDST. The Fig. 5 shows the tradeoff between recall and precision. Observing trends with the threshold increasing, when \(\gamma <2\), the recall is high, but the precision is low. When \(\gamma >2\), we can see that with the \(\gamma \) increasing, the recall decreases. When \(\gamma >3\), the precision decreases too. \(\gamma =2\) is the best selection for news content extraction, which achieves the highest \(F_{1}\).

5 Conclusion

In this paper, we have proposed a method named CEDST for extracting news contents from news web pages based on DS theory. Considering most of existing online extraction methods that only use the non-content features to recognize the news content, we combine the topic feature and non-content features to achieve the best performance among comparison methods. The proposed method uses maximal compatibility blocks to generate topics from text nodes without complicated linguistics analysis, therefore our method can be applied at websites in different languages easily. DS theory is a method for uncertainty, with the BMA framework, we combine all features to obtain fusion values which are the probabilities of text nodes belonging to new content. The experimental result shows that CEDST outperforms other methods in most cases and performs robustly on various news websites in different languages. CEDST is a concise and efficient method which can extract news content automatically and accurately.