Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The development of the Internet has brought a sharp increase in the number of news websites and the Web becomes a popular platform for broadcasting news. News aggregators are websites that collect news from various sources and provide an aggregated view of the events taking place in all over the world. Unfortunately, a critical issue of news aggregation systems is that large number of daily published news obstructs readers when they want to find the ones relevant to their particular interests. A possible solution to this problem is the use of recommender systems as they can traverse the space of choices and predict the potential usefulness of news for each reader.

There have been many researches on news recommendation methods which are based on a certain similarity measure, probably similarity between news with each other, known as Global Recommendation System (GRS), or similarity between personal interests of readers and news, known as Personal Recommendation System (PRS) [2, 5]. In GRS, news recommended are news with the highest similarity with news that readers are reading. On the other hand in PRS, news recommended for readers are news with the highest similarity with personal interests of readers, which is modeled based on the history of posts that readers have read. Collaborative filtering (CF) is a widely applied technology in PRS development. With explosion of news on the Web, designing novel approach for effective new recommendation to suggest news closer and more relevant to readers is still a matter of concern. In this research, we focus on proposing a news recommendation method according to just global recommendation system model by enhancing results from existing works.

The most important task in developing GRS systems is to build a model to calculate similarity between news. Recent research works on news similarity measuring center on two prominent approaches: content-based similarity and semantic-based similarity. In content-based approach, similarity of news is calculated based on vocabulary statistics appeared in content of news and almost all recommended news only focus on a subject that target news is about. In contrast, in semantic-based approach [1], similarity of news is usually based on a knowledge base available to exploit semantic relationship between elements appeared in these news. Therefore, recommended news will likely expand the subjects than that of content-based approach. Both approaches have some weaknesses limit, which limit their effectiveness in news recommendation. Our approach is a hybrid one in the sense that it combines content-based recommendation and semantic-based recommendation. In concrete, similarity of news is a linear combination of content-based similarity and semantic-based similarity. The experimental results indicate that this combination brings news results suggest more effective than using either measure separately.

This work is in a part of development research of News Aggregation System BKSport [11] that is based on Semantic Web technology, aiming to effectively handle the amount of sports news gathered from various sources on the internet. Therefore, it inherits results obtained in our previous research such as ontology and knowledge base in the sport domain, methods for named entity recognition and semantic relationships extraction between entities in the news.

The rest of the paper is organized as follows. Section 2 describes previous works related to measuring semantic similarity between news. Section 3 presents more in details of our proposed method. In Sect. 4, we present the experiments and the evaluation we performed using the implementation of the proposed recommender. Subsequently, advantages and disadvantages of this method, as well as corrective measures and future research lines are concluded in Sect. 5.

2 Related Work

Traditionally, many content-based recommenders [7, 9] use term extraction methods like TF-IDF (Term Frequency-Inverse Document Frequency [10]) in conjunction with the cosine similarity measure in order to compare the similarity between two documents. TF-IDF is used to measure the importance of a word in a document based on its frequency of occurrence in the entire document dataset (or corpus). After calculating TF-IDF value for each word in document, this metric is combined with Cosine measure or Jacard measure to calculate similarity between two documents.

TF-IDF value of the word appeared in document is calculated by the following formula:

$$ TF - IDF_{{ij}} = TF_{{ij}} \times IDF_{j} $$

In which:

$$ TF_{ij} = \frac{{n_{ij} }}{{\mathop \sum \nolimits_{k} n_{kj} }}\quad {\text{and}}\;\;IDF_{i} = log\frac{|D|}{{|\{ d:t_{i} \in d\} |}} $$

\( n_{ij} \) is number of occurrences of the word \( i \) in document \( j \) and \( |D| \) is total number of document in the dataset.

Then, document is represented as a vector \( V_{i} \) obtaining \( N \) dimensional vector (With \( N \) is the size of dictionary), value of each element of vector is TF-IDF value of the word. If the word in the dictionary does not belong to news, value of corresponding element in the vector is 0.

In semantic-based approach, previous studies have explored relationship between components between news with each other to calculate semantic similarity. In the study carried out by Batet et al. [4], a measure based on the exploitation of the taxonomical structure of a biomedical ontology is proposed for determining the semantic similarity between word pairs. Method proposed by Capelle et al. [6] exploited element of similarity between components (words or named entities) in news thereby calculating similarity between two news. To measure the similarity between two components, their proposed method relies on:

  • WordNet Dictionary tree when components are words - denoted by \( sim_{SS} \)

  • PMI measure when components are named entities – denoted by \( sim_{Bing} \). This measure relates to the statistical frequency of occurrence of components and co-occurrence between them

Final formula combines two \( sim_{SS} \) and \( sim_{Bing} \) measures to calculate semantic similarity between two news as follows (α is correction parameter):

$$ sim_{BingSS} = \alpha \times sim_{Bing} + \left( {1 - \alpha } \right) \times sim_{SS} $$

Also exploiting the relationship between components in two news with each other, Frasincar et al. [8] presented a number of news recommendation methods in semantic-based approach. Similar to Capelle [6], their work aims to a personalized recommendation system. However user profile of the reader is also built based on the news that the reader has read and calculating similarity between user profile and a news is the same as calculating similarity between two news. Methods presented in this research used ontology and knowledge base to exploit semantic relationship between concepts, which are classes in the ontology. Experiment showed that Ranked Semantic Recommendation 2 is the most effective among them. However, it remains certain limitations that we will show in the following parts and propose method to overcome.

3 Similarity Between News Items

There are two main approaches in calculating similarity between text news items as content-based and semantic-based. Each approach has its own advantages and disadvantages. We aim to combine these two approaches by combining content-based similarity measure and semantic-based similarity measure with the expectation to overcome limitations of each approach, making recommendation more effective.

3.1 Semantic-Based Similarity

To calculate semantic similarity, we exploit mutual semantic relations between components in news item. These relations are determined based on ontology and knowledge base that we have built. We extract and analyze components in the news items including: entities, types of entities and semantic annotations. The next sections will present how to exploit these components in calculating semantic similarity between news items.

3.1.1 Semantic Relation Between Entities

Specifically, in order to exploit relations between entities for calculating similarity between news items, we extend Ranked Semantic Recommendation 2 method as approved by Frasincar et al. [8]. In this method, the authors also used ontology and knowledge base to exploit the relations between entities. However, the method remains some limitations such as:

  • It only considers direct relations between entities without considering indirect relations.

  • It does not consider the importance of entities as they appear in various positions in the news item (title, description, etc.)

To overcome these above limitations, in Sect. 3.1.1.1, we present a method to calculate the relation weight between entities based on ontology and knowledge base. In addition, we combine the statistical method of co-occurrence of entities in the same news items in determining relation weight between entities, which is presented in Sect. 3.1.1.2. Finally, we present the method in which uses relation weights between entities in determining semantic similarity between news items in Sect. 3.1.1.3.

3.1.1.1 3.1.1.1 Relation Weight Between Entities Based on Ontology and Knowledge Base

Aleman-Meza et al. presented the methods to calculate the ranking of Semantic Association based on Semantic Path between the two entities in order to determine the relation weight between entities [3]. Specifically, they define Semantic Association and Semantic Path as follows:

Definition: if two entities \( e_{1} \) and \( e_{n} \) can be connected together by one or more sequences \( e_{1} ,P_{1} ,e_{2} ,P_{2} ,e_{3} ,P_{3} , \ldots ,e_{n - 1} , P_{n - 1 } ,e_{n} \) in an RDF graph; here, \( e_{i} \), \( 1 \le i \le n \), is entities and \( P_{j} \), \( 1 \le j \le n \) is relations in ontology, then we say there exists semantic relation between \( e_{1} \) and \( e_{n} \).

Sequence \( e_{1} ,P_{1} ,e_{2} ,P_{2} ,e_{3} ,P_{3} , \ldots ,e_{n - 1} , P_{n - 1 } ,e_{n} \) is a Semantic Path.

For example, in the knowledge base, we have:

  • <Lionel-Messi> <playFor> <Barcelona-FC>.

  • <Luis-Suarez> <playFor> <Barcelona-FC>.

Then, there exists a semantic path between two entities Lionel Messi and Luis Suarez as follows:

<Lionel-Messi> → <playFor> → <Barcelona-FC> ← <playFor> ← <Luis-Suarez>

As a result, there exists a semantic relation between Lionel Messi and Luis Suarez.

Based on the properties of semantic path, we identify a path rank value to show the relation weight between two entities at both ends of the path. Because there might be multiple semantic paths between two entities, we get the highest path rank value to represent relation weight. Aleman-Meza et al. [3] used four characteristics of a semantic path to calculate path rank, corresponding to four following weights:

  • Subsumption Weight: based on the structure of the ontology to determine component weight for each component (predicate and entity) in the path, thereby calculating weight for the whole path.

  • Path Length Weight: based on length of the path.

  • Context Weight: based on determining which region each component of the path belongs to in the ontology. Each region in the ontology has a separate weight depending on the user’s interests.

  • Trust Weight: based on weights of the properties in the ontology.

Applying in news recommendation in football, we found that Path Length Weight and Trust Weight are two meaningful and appropriate weights. For this reason, we only use these two weights to determine path-rank of a semantic path.

Path Length Weight.

Length of a semantic path \( e_{1} ,P_{1} ,e_{2} ,P_{2} ,e_{3} ,P_{3} , \ldots ,e_{n - 1} , P_{n - 1 } ,e_{n} \) is the number of entities and relations in the path (exclude \( e_{1} \) and \( e_{n} \)). We can see that, when two entities remain indirect relation with each other through which the more there are entities and relations, the lower similarity between these two entities is. Consequently, path-rank of a semantic path must be inversely proportional to the length of that path.

The Path Length Weight is defined in [3] as below:

$$ W_{length} = \frac{1}{{length_{path} }} $$

In which: \( length_{path} \) is the length of semantic path.

For example, we have two semantic paths:

  • \( P_{1} \): <Lionel-Messi> → <playFor> → <Barcelona-FC> → <competeIn> → <La-Liga> ← <competeIn> ← <Real-Madrid> ← <playFor> ← <Karim-Benzema>

  • \( P_{2} \): <Lionel-Messi> → <playFor> → <Barcelona-FC> ← <playFor> ← <Luis-Suarez>

\( P_{1} \) has length of 7, we obtain:

$$ W_{length} (P_{1} ) = \frac{1}{{length_{path} }} = \frac{1}{7} $$

\( P_{2} \) has length of 3, we obtain:

$$ W_{length} (P_{2} ) = \frac{1}{{length_{path} }} = \frac{1}{3} $$

From there, we can see that similarity between Lionel Messi and Luis Suarez is higher than that between Lionel Messi and Karim Benzema.

Path Relation Weight.

There are many different relations defined in the ontology. Every relation represents a different meaning therefore also represents a different relation weight between entities. Some relations show close association, some other relations express loose association. For example, we have two triplets in the knowledge base as below:

  • <Luis-Enrique> <managerOf> <Barcelona-FC>.

  • <Luis Suarez> <playFor> <Barcelona-FC>.

Here, there exist two relations which are relation <managerOf> and relation <playFor>. We can see that, relation <managerOf> shows more closer than relation <playFor>, because each team has only one single manager at a certain time; however, may have a lot of players. Therefore, we assign weight of <managerOf> higher than <playFor>. And for this reason, from above triplets, we conclude <Barcelona-FC> has higher similarity with <Luis-Enrique> than <Luis Suarez>.

Weight of relations is in the range \( (0, 1] \). Path Relation Weight of an overall path P is defined in [3] as below:

$$ W_{predicate} = \mathop \prod \limits_{p \in path} w_{p} $$

Relation Weight Between Two Entities is Based on Ontology and Knowledge Base.

Combining two weights \( W_{lenght} \) and \( W_{predicate} \) by a pair of coefficients \( \alpha_{wl} \) and \( \alpha_{wp} \), we define the path rank of a semantic path as below:

$$ W_{path} = \frac{{W_{length} \times \alpha_{wl} + W_{predicate} \times \alpha_{wp} }}{{\alpha_{wl} + \alpha_{wp} }} $$

Value \( W_{path} \) in the above formula is also similarity value between two entities based on ontology and knowledge base.

3.1.1.2 3.1.1.2 Relation Weight Between Entities Based on Statistics of Co-occurrence in the Same News Items

According to the idea of the Capelle et al. on PMI measure [6], if two entities co-occur in the same news items many times; these two entities have high similarity to each other. We count co-occurrence of named entity pairs in a dataset on football news to calculate weights PMI. The formula is defined as below:

$$ W_{PMI} \left( {e_{1} ,e_{2} } \right) = log\frac{{\frac{{c(e_{1} ,e_{2} )}}{N}}}{{\frac{{c(e_{1} )}}{N} \times \frac{{c(e_{2} )}}{N}}} $$

In which:

  • \( N \). is the number of news items available in the dataset.

  • \( c(e_{1} ,e_{2} ) \). is the number of news items in the dataset that two entities \( u \) and \( r \) co-occur.

  • \( c(e_{1} ) \) is the number of news items in the dataset containing entity \( e_{1} \), and \( c(r) \) l is the number of news items in the dataset containing entity \( e_{2} \).

As such, for each any entity pair, we have two values calculate relation weights: Weight \( W_{path} \) (calculated based on semantic path) and weight \( W_{PMI} \) (calculated based on statistics of co-occurrence of entity pairs). Before combining these two weights with each other, we normalize them as below:

$$ w_{new} = \frac{{w_{old} - MIN}}{MAX - MIN} $$

In which: \( MAX \) and \( MIN \) corresponding are maximum value and minimum value in the value chain \( w. \)

Finally, we combine these two values together by a pair of coefficients \( \beta_{path} \) and \( \beta_{PMI} \) to calculate similarity of each entity pair as below:

$$ Similarity_{entity} \left( {e_{1} ,e_{2} } \right) = \frac{{W_{path} \times \beta_{path} + W_{PMI} \times \beta_{PMI} }}{{\beta_{path} + \beta_{PMI} }} $$

By convention, when \( e_{1} \equiv e_{2} \) then \( Similarity_{entity} \left( {e_{1} ,e_{2} } \right) = 1 \).

3.1.1.3 3.1.1.3 Method for Calculating Similarity Between News Items Based on Relation Between Entities

First of all, we define set of entities related to entity \( r \) is a set containing entities that have similarity where \( r \) is greater than 0 and denoted as below:

$$ R(r) = \{ r_{1} ,r_{2} ,r_{3} , \ldots ,r_{n} \} $$

Suppose there is a news item A, set of recognizable named entities in news item A is denoted as below:

$$ A = \{ a_{1} ,a_{2} ,a_{3} , \ldots ,a_{m} \} $$

With each entity \( a_{i} \) in set A, we build a set of entities related to \( a_{i} \) corresponding to \( R\left( {a_{i} } \right) = \{ a_{i1} ,a_{i2} ,a_{i3} , \ldots ,a_{ik} \} \). Grouping all sets \( R(a_{i} ) \) together (\( i:1 \to m \)), we obtain set of all entities not included in A, but related to A:

$$ R = \mathop {\bigcup }\limits_{i:1 \to m} R(a_{i} ) $$

Finally, we group two sets A and R to obtain set \( A_{R} \) called as expansion set of news item A:

$$ A_{R} = A\mathop \cup \nolimits R $$

In the next step, we calculate ranking value for each entity in the set \( A_{R} \). Each rating value will characterize the relevance of the entity corresponding to news item A. These ranking values should satisfy some properties:

  • (1) If the more times an entity appears in the news item, the greater that entity’s ranking value is.

  • (2) If the greater of entities in the news item that an entity is relevant to, the greater that entity’s ranking value is.

  • (3) Ranking value also depends on appearance position of the entity in the news item.

Regarding property (3), we determine an entity that can appear in the different positions of the news item, as follows: title, description, bolder-text (bold text, image title, etc.) and content. We also identify importance weight for these positions respectively as below:

$$ W_{title} > W_{description} > W_{boldertext} > W_{content} $$

To calculate the ranking value for each entity in the set \( A_{R} \), based on Ranked Semantic Recommendation 2 technique [8], we also represent entities in a matrix, in which the first row represents entities in the set \( A_{R} \) and the first column represents entities in the set A. Matrix takes the following form:

 

\( \varvec{e}_{1} \)

\( \varvec{e}_{2} \)

\( \varvec{e}_{\varvec{q}} \)

\( \varvec{a}_{1} \)

\( h_{11} \)

\( h_{12} \)

\( h_{1q} \)

\( \varvec{a}_{2} \)

\( h_{21} \)

\( h_{22} \)

\( h_{2q} \)

\( h_{3q} \)

\( \varvec{a}_{\varvec{m}} \)

\( h_{m1} \)

\( h_{m2} \)

\( h_{4q} \)

In above matrix, we calculate the value \( h_{ij} \) as below:

$$ h_{ij} = similarity(a_{i} ,e_{j} ) \times WE(a_{i} ) $$

In which \( WE(a_{i} ) \) is importance weight of the entity \( a_{i} \) in the news. This weight is calculated as follows: Suppose \( a_{i} \) is an entity appeared in the news item, and \( N_{title} ,N_{description} ,N_{boldertext} , N_{content} \) are respectively numbers of occurrences of \( a_{i} \) in the title, description, bolder-text and content of the news item. We define the importance weight of entity \( a_{i} \) as below:

$$ \begin{aligned} WE & \left( {a_{i} } \right) = N_{title} \times W_{title} + N_{description} \times W_{description} \\ & + N_{boldertext} \times W_{boldertext} + N_{content} \times W_{content} \\ \end{aligned} $$

Finally, as the formula defined in [8], the ranking weight of each entity \( e_{j} \) in the set \( A_{R} \) is calculated by:

$$ Rank\left( {e_{j} } \right) = \mathop \sum \limits_{i = 1 }^{m} h_{ij} $$

Assume \( V_{A} \) is a vector containing above calculated \( Rank(e_{i} ) \) values. We normalize values of each element in \( V_{A} \) in the range [0, 1]. Normalization formula is expressed as follows:

$$ v_{i} = \frac{{v_{i} - MIN}}{MAX - MIN} $$

In which MAX and MIN are maximum value and minimum value respectively of elements in vector \( V_{A} \). If \( MAX = MIN \ne 0 \) then \( v_{i} = 1,{\text{with}}\,{\text{every}}\,{\text{value}}\,{\text{of}}\,i \).

As a result, taking all the steps above will obtain a vector for each news. Final step is calculating similarity between any two news based on their vectors.

Suppose we have two news A, B and two corresponding vectors \( V_{A} \), \( V_{B} \). Because these two vectors can have different number of dimensions, we define the similarity between two vectors \( V_{A} ,V_{B} \) (also similarity between two news A and B) as a variation of cosine similarity as below:

$$ similarity_{based - entity } \left( {A,B} \right) = cosine\left( {V_{A} ,V_{B} } \right) = \frac{{\mathop \sum \nolimits_{{e_{a} \equiv e_{b} }}^{{e_{a} \in A, e_{b} \in B}} v_{a} \times v_{b} }}{{\sqrt[2]{{\mathop \sum \nolimits_{a}^{{e_{a} \in A}} v_{a}^{2} }} \times \sqrt[2]{{\mathop \sum \nolimits_{b}^{{e_{b} \in B}} v_{b}^{2} }}}} $$

In which \( v_{a} ,v_{b} \) corresponding are values \( Rank\left( {e_{a} } \right),Rank(e_{b} ) \) in vectors \( V_{A} \), \( V_{B} \).

3.1.2 Types of Entities Appeared in the News Items

A reader who is interested in a subject is more likely to be also interested in other subjects of the same type. For example, if a reader is reading the news about football teams, then that reader tends to continue reading other news items about football teams rather than news items about players or stadiums. Therefore, if two news items have similarity in the types of entities, similarity of these two news items will be higher (Fig. 1).

Fig. 1.
figure 1

An example of similarity between news based on types of entities in the news

In ontology, each named entity is defined in the knowledge base will belong to a certain object class defined. These classes can be regarded as the type of entity. For example, two entities Lionel Messi and Luis Suarez in the knowledge base have the same type, because they belong to class FootballPlayer; however, both are not the same type with entity Barcelona-FC because this entity belongs to FootballTeam.

Statistics of entity types appeared in the news items is similar to statistics of entities. Two different entities can be of the same type. Appearance position of entities also affects association weight between entity type and corresponding news item. These weights will be calculated based on appearance frequency and appearance position of entities of that type. Suppose, we calculate association weight for entity type \( C \) for a news item \( A \). Given that \( c_{i} \) is entities of class \( C \) appeared in news item \( A \), we define the association weight of entity type \( C \) with news item \( A \) as below:

$$ WC\left( C \right) = \mathop \sum \nolimits WE(c_{i} ) $$

We build a vector for news item with elements as \( WC \) weights similar to building vector based on entity in Sect. 3.1.1.3. Elements in each vector will be normalized before using variations of the formula for calculating similarity between vectors used in Sect. 3.1.1.3. This value is denoted by \( similartiy_{based - type} \).

3.1.3 Semantic Annotations of the News Items

Semantic annotations here are triplets in the form of <subject> <predicate> <object>. In which subject and object are two entities. These semantic annotations also play an important role because they represent somewhat content that news item is talking about (Fig. 2).

Fig. 2.
figure 2

An example of similarity between news items based on semantic annotations of news

A news item may contain many triplets and a triplet may appear several times. Triplets appeared several times in the news item will be important triplets, showing main contents that news item mentions. Moreover, appearance position of these triplets in the news item also expresses their importance. The importance of positions in the news item (title, description, bolder-text, content) is similar to that presented in the previous section. The more common triplets of two news items, the higher their similarity is.

With each triplet, we denote \( N_{title} ,N_{description} ,N_{boldertext} ,N_{content} \) are numbers respectively of occurrences of this triplet in title, description, bolder-text, content. We use the same formula as the one for calculating importance weight of the entities in Sect. 3.1.1.3. to compute importance weight \( WT \) of each triplet in the news item. Then we represent these weights as elements of a vector then use vector normalization formula to put these weights in the range [0, 1]. To calculate similarity between news items based on semantic annotations, we use a variation of Cosine formula as described in Sect. 3.1.1.3. to compute the distance between two vectors. This value is denoted by \( similarity_{based - annotation} \).

Thus, we use three parameters to determine semantic similarity between news items, based on the following factors:

  • Relations between named entities,

  • Types of entity in the news items,

  • Semantic annotations of the news items.

Each of these three parameters has different meanings in determining semantic similarity between news items. We combine these three parameters together to determine the final value showing semantic similarity between news items. To combine these three parameters, we use a set of three parameters including \( \theta_{entity} ,\theta_{annotation} ,\theta_{type} \) to express the level of importance of each of the above parameters. We define the final formula for calculating semantic similarity between two news items is as below:

$$ \begin{aligned} Similarity_{semantic} & \left( {A,B} \right) \\ & = similarity_{based - entity} \left( {A,B} \right) \times \theta_{entity} \\ & + similarity_{based - annotation} \left( {A,B} \right) \times \theta_{annotation} \\ & + similarity_{based - type} \left( {A,B} \right) \times \theta_{type} \\ \end{aligned} $$

3.2 Content-Based Similarity

With news recommendation method in which only uses semantic similarity as proposed above, we may encounter some problems as:

  • Insufficient or incorrect identification of named entities that appear in the news item.

  • Insufficient semantic annotations of the news item.

Occurrence of above limitations is caused by limited information in the ontology and knowledge base. This is unavoidable since the construction of ontology and knowledge base must be done manually or semi-automatically, so a lot of efforts need to be made. Furthermore, the evolution of real world knowledge, for example when new players come or players change their clubs, makes it difficult to timely update.

To overpass these limitations, we combine the proposed semantic similarity and content similarity of two news items.

In this section we describe the content-based similarity which is computed using TF-IDF weight of words in the news item combined with cosine measure.

Words with high TF-IDF weight are often important words, showing main contents of the news item. So, we are only interested in words with high TF-IDF weight. Steps to build a set of important words of the news item include:

  • Step 1: Eliminate stop words. Stop words are words that do not make sense in the representation of contents of the news, such as: “a”, “an”, “the”, etc.

  • Step 2: Standardize words into infinitive form. Verbs or nouns often exist in many different forms depending on the context, although they still express the same meanings. For example, “make”, “makes” and “made”. So, we will change them into infinitive form.

  • Step 3: Calculate TF-IDF for each word in the news (After being standardized in Step 2).

  • Step 4: Sort and select top words with the highest TF-IDF based on defined threshold.

After above steps, we obtain a set of words with the highest TF-IDF. We represent news item in the form of a vector containing values \( v_{k} \) as TF-IDF value of words in the above set. Similarity measure between two news A and B with two important word sets \( S_{A} ,S_{B} \) and two corresponding vectors \( V_{A} , V_{B} \) will be calculated based on variation of Cosine formula as below:

$$ Similarity_{TF - IDF} \left( {A,B} \right) = \frac{{\mathop \sum \nolimits_{{t_{a} \equiv t_{b} }}^{{t_{a} \in S_{A} , t_{b} \in S_{B} }} v_{a} \times v_{b} }}{{\sqrt[2]{{\mathop \sum \nolimits_{a}^{{t_{a} \in S_{A} }} v_{a}^{2} }} \times \sqrt[2]{{\mathop \sum \nolimits_{b}^{{t_{b} \in S_{B} }} v_{b}^{2} }}}} $$

In which:

  • \( t_{a} , t_{b} \) are corresponding words in two sets \( S_{A} ,S_{B} . \)

  • \( v_{a} ,v_{b} \) are TF-IDF values of words \( t_{a} ,t_{b} \).

3.3 News Recommendation Algorithm with Combined Similarity

To combine semantic similarity \( Similarity_{semantic} \) with content similarity \( Similarity_{TF - IDF} \) of two news items, we use pair of weights \( \gamma_{semantic} {\text{and}} \gamma_{content} \). We define the combination formula as below:

$$ \begin{array}{*{20}c} {Similarity_{combined} \left( {A,B} \right) = } \\ {Similarity_{semantic} \left( {A,B} \right) \times \gamma_{semantic} + Similarity_{TF - IDF} \times \gamma_{content} } \\ \end{array} $$

News recommendation algorithm as below:

Input: Target news item A and set N candidate news items C.

Output: set of K news items with the highest semantic similarity with A.

  • Step 1: Identify named entities, make semantic annotations for news item A and candidate news items in set C.

  • Step 2: Build set of words with the highest TF-IDF weight for news item A and news items in set C.

  • Step 3: With each news \( C_{i} \) in set C, take the following steps:

    • Step 3.1: Calculate \( Similarity_{based - entity} (A,C_{i} ) \)

    • Step 3.2: Calculate \( Similarity_{based - annotation} (A,C_{i} ) \)

    • Step 3.3: Calculate \( Similarity_{based - type} (A,C_{i} ) \)

    • Step 3.4: Calculate \( Similarity_{semantic} \left( {A,C_{i} } \right) \) based on the results of steps 3.1, 3.2 and 3.3.

    • Step 3.4: Calculate \( Similarity_{TF - IDF} (A,C_{i} ) \)

    • Step 3.5: Calculate \( Similarity_{combined} \left( {A,C_{i} } \right) \) based on the results of steps 3.4 and 3.5.

  • Step 4: Sort news items \( C_{i} \) in descending order according to value \( Similarity_{combined} (A, C_{i} ) \).

  • Step 5: Get k news items in the top of the list sorted in Step 4 to recommend for news item A.

Assume that nt is the average number of tokens in a news item and n is the number of news items in dataset C. We see that, in step 1, the complexity of named entity recognition and semantic annotation of a news item is O(ncnt), where nc is the total number of classes, entities and properties in ontology and knowledge base. Therefore, for n news items in the set C and a news item A, the time complexity of step 1 is O(nncnt). Step 2 transfers n + 1 news items into vector TF-IDF. As we had computed the IDF for all tokens in the dictionary before running the algorithm, the time complexity of transferring a news item into a vector TF-IDF equal to the time complexity of calculate TF values for all tokens in that news item, O(nt). Consequently the complexity of step 2 is O(nnt). On the other hand, step 3 is repeated n times for each element in C. The steps from 3.1 to 3.4 are the multiplication of the pair of vectors TF-IDF, therefore, the time complexity of each iteration is O(nt) and the time complexity of step 3 is O(nnt). The time complexity of the sort algorithm in step 4 is O(nlogn). As a result, the time complexity of the proposed algorithm is O(nncnt + nlogn).

4 Experiment and Evaluation

4.1 Experiment Scenario

The goal of this chapter is to evaluate and compare the effectiveness of three news recommendation methods:

  • Only use semantic similarity between news items.

  • Only use content similarity between news items.

  • Combine both above similarities.

The evaluation of the different methods is performed by measuring precision. Because we did not build an online system yet, so we use offline evaluation method for evaluation. For offline evaluation, we choose N = 100 news items (symbolized as set \( A \)) from a number of famous sports websites such as http://www.skysports.com/, http://www.espnfcasia.com/, http://sports.yahoo.com/ and then we ask collaborators to rate that a news item as relevant or non-relevant with another one. After that, we have an experiment dataset in which each news item \( A_{i} \) will have \( K_{{A_{i} }} \) (\( 0 \le K_{{A_{i} }} \le N - 1 \)) related news items and (\( N - 1 - K_{{A_{i} }} \)) unrelated news items. We separately run methods above for each news item \( A_{i} \) in set \( A \) and also generate \( K_{{A_{i} }} \) news items with the highest similarity with it, then compared with \( K_{{A_{i} }} \) news items that collaborators have identified in experiment dataset. For example, consider the news item \( A_{1} \), collaborators discover 5 news items in the remaining 99 news items related to \( A_{1} \), then algorithm automatically run also generated 5 corresponding news items, then compared them with 5 news items that collaborators have identified.

Symbol:

  • \( TP_{{A_{i} }} \) is the number of news items that the algorithm precisely recommends for news item \( A_{i} \).

  • \( FP_{{A_{i} }} \) is the number of news items that the algorithm imprecisely recommends for news item \( A_{i} \).

  • \( FN_{{A_{i} }} \) is the number of related news items that the algorithm not recommend for news item \( A_{i} \).

We define precision for a news item \( A_{i} \), using the following formula:

$$ precision\left( {A_{i} } \right) = \frac{{TP_{{A_{i} }} }}{{TP_{{A_{i} }} + FP_{{A_{i} }} }} = \frac{{TP_{{A_{i} }} }}{{K_{{A_{i} }} }} $$

Follow the way that we implement, we obtain \( FP_{{A_{i} }} = FN_{{A_{i} }} \), then \( precision\left( {A_{i} } \right) = recall(A_{i} ) \). There for we only concern about \( precision \) to evaluate these above methods. Finally, we define the final precision of the method as the average of precisions for the entire \( N \) news items in the experiment dataset.

$$ Precision(A) = \frac{{\mathop \sum \nolimits_{{A_{i} \in A}} precision(A_{i} )}}{N} $$

4.2 Experiment Parameters

Certain parameters are employed to determine the importance of the components when these components are combined together. In this experiment, we set the value of parameters totally based on our point of view. For instance:

  • Weights \( w_{p} \) of relations in the ontology to calculate \( W_{path} \) was assigned based on our perception on the relevance of each relation: \( w_{managerOf} = 0.8, w_{playFor} = 0.6, w_{stadiumOf} = 0.5, \ldots \)

  • \( \gamma_{semantic} \) and \( \gamma_{content} \) are two parameters used when combining semantic similarity measure and content similarity measure between news items. As we consider the importance of content similarity is higher than the one of semantic similarity in news recommendation, we choose \( \gamma_{semantic} = 1, \gamma_{content} = 2 \).

4.3 Experiment Results and Evaluation

After running three separate methods for set \( A \) containing 100 news items as experiment scenario as presented in Sect. 4.1, we obtain precision result of each method shown in Table 1.

Table 1. News recommendation precision in circumstances

Assessment of Experiment Results.

Table 1 indicated that, for the experiment data \( A \) containing 100 news items, the semantic-based recommendation method is not as precise as the content-based recommendation method. Meanwhile, if combining the content-based similarity method and semantic-based similarity method, it will bring the best results. This can be explained as follows:

  • When using only the semantic-based similarity (semantic-based approach), it is mainly dependent on the entities in the news items. Therefore, in some case, the algorithm recommends correct news items about the relevant entities but the completely different topic. For some collaborators, they will seem as irrelevant.

  • Following the content-based approach, the recommended news item’s topic is usually quite close to the target news item. However, this method does not have the ability to expand the topic. If we have two news items about Barcelona club in which the first news item is about the play of the Club and the second one is about the transfer of the Club’s players, the content-based approach will determines that the similarity of these news items is low.

  • When combining the content-based similarity and semantic-based similarity, the recommended news will overcome the limitations of each separated measure, leading to more efficient recommendation.

5 Conclusions and Future Work

In this research, we presented a recommendation method based on the combination of the content-based similarity and semantic-based similarity of the news items. The semantic-based measure is calculated based on the semantic relation among objects. It enables the recommendation not only stopping at the suggestion of the similar topic news items or news items rounding a key object of the target news item, but also being able to recommend the news items of other objects that these objects have a semantic relation with other ones in the target news item. However, the similarity measure is mainly focused on the entities and not considered the context mentioned in the news item. The content-based measure will overcome the weakness of semantic-based measure by extracting from the news item the words having the highest TF-IDF value and these words are characterized the main context mentioned in the news item.

We evaluated and compared the precision of the proposed method and the recommendation method when using only either measure separately. The experimental results showed that the combination of the two similarities helps to promote the effectiveness of both and overcome the weaknesses of each other method, ultimately increasing the better recommendation. However the proposed method remains some limitations such as its dependency on the adequacy of the knowledge base and ontology. Determining the weights in such a way so that the combination of the measures achieves the highest efficiency is also a difficult problem to be solved of the method.