1 Introduction

Heterogeneous information networks are newly emerged graph data, which involve multi-type nodes and relations. Figure 1(a) shows a bibliographic network which contains three types of nodes: Paper, Author and Venue, and four types of relations exist among these nodes. They are Cooperate relation between Author, Write relation between Author and Paper, Cite relation between Paper, Publish relation between Paper and Venue. In addition, each type of nodes has a set of attributes, i.e, Paper(ID, Title, Topic, Keywords); Author(ID, Institute, Field, Location); Venue(Name, Year). Values of part attributes are given in Fig. 1(b).

Fig. 1.
figure 1

Bibliographic network

Fig. 2.
figure 2

Aggregation results

Aggregations allow users to observe and model data in different dimensions, to perform drill-down, roll-up and other OLAP operations further. We investigate the problem of aggregation on multi-type nodes and relations on heterogeneous information networks. Next we give two aggregation queries over the bibliographic network. The aggregate functions are COUNT.

Query 1. Aggregate on Paper node and Cite relation, the selected attribute of Paper is Topic.

Query 2. Aggregate on Paper, Author nodes and Write relation, the selected attribute of Paper is Topic, and the selected attribute of Author is Location.

Figure 2(a) displays the aggregate result of query 1. Paper nodes in the same aggregate nodes have the same values of Topic. Meanwhile, they also have the same Cite relations with other aggregate nodes. Node 9 and 10 are not in the same aggregate node, because node 9 cites ‘DM’ paper, while node 10 does not. Figure 2(b) gives the aggregate result of query 2. Author nodes in the same aggregate nodes have the same values of Location and the same Write relation with aggregate Paper nodes.

From the two examples, we can see flexible aggregation on multi-type nodes and relations is meaningful. The main contributions of this paper are:

  1. 1.

    Flexible aggregation problem for heterogeneous information networks is proposed, which can aggregate multi-type nodes and relations;

  2. 2.

    A novel function based on graph entropy is proposed, which is effective to measure the structural similarities with regard to different types of relations;

  3. 3.

    An efficient aggregation algorithm from two phases is proposed: informational aggregation and structural aggregation.

  4. 4.

    Experiments demonstrate the effectiveness and efficiency of algorithm.

2 Preliminaries

Definition 1

Heterogeneous Information Network. A heterogeneous information network is defined as a directed graph \(G=(V,E,T,R,\phi _{V},\phi _{E},A,D,\phi _{A})\), where V is node set, \(E\subseteq {V\times {V}}\) is edge set. T is set of node types, and R is set of edge types. \(\phi _{V}: V\longrightarrow {T}\) is node type mapping function and \(\phi _{E}: E\longrightarrow {R}\) is edge type mapping function. A is attribute set of nodes and D is domain of A. \(\phi _{A}:T\longrightarrow A\) is mapping function from node types to attributes.

Definition 2

Graph Projection. Given selected node types \(Q=\{T_{1},T_{2},\ldots ,T_{l}\}\), \(Q\subseteq T\), and selected edge types \(L=\{R_{1},R_{2},\ldots ,R_{k}\}\), \(L\subseteq R\). The graph projection of G on Q and L is a graph \(G_{pj}=(V_{pj},E_{pj})\), where \(V_{pj}\) is node set \(V_{pj}=\{v|v\in V, \phi _{V}(v) \in Q\}\), \(E_{pj}\) is edge set, \(\forall u,v \in V_{pj}\), \((u,v)\in E_{pj}\) iff \((u,v)\in E\) and \(\phi _{E}(u,v)\in L\).

Given an attribute set \(S=\{A_{1},A_{2},\ldots ,A_{k}\}\), \(S\subseteq A\), for \(\forall u,v \in V\), if \(A_{i}(u)=A_{i}(v)\) (\(1 \le i\le k\)), then we say \(S(u)=S(v)\).

Definition 3

Graph Partition. Given selected attributes \(S=\{S_{1},S_{2},\ldots ,S_{l}\}\) of Q, where \(S_i{\subseteq }\phi _{A}(T_{i})\). The partition of \(G_{pj}\) w.r.t Q, S and L is a set of graphs \(G_{p}\)={\({{G}}_{1}\), \({{G}}_{2}\), \(\cdots \), \({{G}}_{m}\)}, satisfying:

  1. 1.

    For \(\forall G_{i}\in G_{p}\), \(G_{i}={(V_{i},E_{i})}\), \(G_{i}\) is a subgraph of \(G_{pj}\);

  2. 2.

    \(\bigcup _{i=1}^{m}{V_{i}}=V_{pj}\);

  3. 3.

    For \(\forall {{G}}_{i},{{G}}_{j}\in G_{p}\), \(i\ne j\), \(V_{i}\cap V_{j}=\emptyset \);

  4. 4.

    For \(\forall u,w \in G_{i}\), \(\exists T_{j}\), \(\phi _{V}(u)=\phi _{V}(w)=T_{j}\), \(S_{j}(u)=S_{j}(w)\);

  5. 5.

    For \(\forall u,w \in G_{i}\), for \(\forall G_{j}\), if \(\exists u^{'}\in G_{j}\), \((u,u^{'})\in E\), \(\phi _{E}((u,u^{'}))\in L\), then \(\exists w^{'}\in G_{j}\), \((w,w^{'})\in E\), \(\phi _{E}((w,w^{'}))\in L\);

  6. 6.

    For \(\forall u,w \in G_{i}\), \((u,v)\in E_{i}\) iff \((u,v)\in E\) and \(\phi _{E}(u,v)\in L\);

Definition 4

Aggregate Graph. The aggregate graph of G on Q, S and L is a graph \(G_{c}=(V_{c},E_{c},f_{1},f_{2})\), where \(V_{c}\) is node set, \(E_{c}\) is edge set, \(f_{1}\) is aggregate function on \(V_{c}\) and \(f_{2}\) is aggregate function on \(E_{c}\), satisfying:

  1. 1.

    \(|V_{c}|\)=\(|G_{p}|\);

  2. 2.

    \(\forall a\in V_{c}\), a corresponds to a subgraph \(G_{a}\in G_{p}\);

  3. 3.

    \(\forall a,b\in V_{c}\), a corresponds to a subgraph \(G_{a}\) and b corresponds to a subgraph \(G_{b}\), if \(a\ne b\), then \(G_{a}\ne G_{b}\);

  4. 4.

    \(\forall a\in V_{c}\), a corresponds to a subgraph \(G_{a}\), \(f_{1}(a)=f_{1}(V_{a})\);

  5. 5.

    \(\forall a,b\in V_{c}\), a corresponds to a subgraph \(G_{a}\) and b corresponds to a subgraph \(G_{b}\). \((a,b)\in E_{c}\) iff \(\exists u\in V_{a}\), \(w\in V_{b}\), \((u,w)\in E\) and \(\phi _{E}(u,w)\in L\);

  6. 6.

    \(\forall (a,b)\in E_{c}\), a corresponds to a subgraph \(G_{a}\) and b corresponds to a subgraph \(G_{b}\). \(f_{2}((a,b))=f_{2}(\{(u,w)|u\in V_{a}\),\(w\in V_{b}\), \((u,w)\in E\) and \(\phi _{E}(u,w)\in L\)}.

For \(\forall a\in V_{c}\), we call it aggregate node, and for \(\forall e\in E_{c}\), we call it aggregate edge. Aggregate functions can be selected freely, e.g., COUNT, AVERAGE.

Graph entropy [1] has been widely used in graph mining. So we employ graph entropy to measure the structural consistency of nodes.

Definition 5

Graph Entropy. \(G_{c}=(V_{c},E_{c},f_{1},f_{2})\) is an aggregate graph of G on Q, S and L, for \(\forall a=(V_{a},E_{a}),b=(V_{b},E_{b})\in V_{c}\), the entropy from a to b is

$$\begin{aligned} H_{b}(a)= {\left\{ \begin{array}{ll} -\frac{|V_{b}(a)|}{|V_{a}|}\cdot {log_{2}}\frac{|V_{b}(a)|}{|V_{a}|}\ \ \ \ \ \ &{} |V_{b}(a)|\ne 0\\ 0\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ &{} |V_{b}(a)|=0 \end{array}\right. } . \end{aligned}$$
(1)

where \(V_{b}(a)=\{v|v\in V_{a},\exists w\in V_{b},(v,w)\in E,\phi _{E}((v,w))\in L\}\).

The entropy of a is

$$\begin{aligned} H(a)=\sum _{b\in {V_{c}}}\lambda _{a,b}H_{b}(a). \end{aligned}$$
(2)

\(\lambda _{a,b}\) represents the weight of relation from nodes in a to nodes in b.

Definition 6

C-function. The C-function of aggregate graph \(G_{c}\) is

$$\begin{aligned} F(G_{c})=\sum _{i=1}^{l}\alpha _{i}\sqrt{NUM_{T_{i}}}+\sum _{a\in V_{c}}H(a). \end{aligned}$$
(3)

where \(NUM_{T_{i}}=|\{a|a\in V_{c},\forall u\in V_{a},\phi _{V}(u)=T_{i}\}|\).

\(\alpha _{i}\) distinguishes the importance of different types of nodes. Fewer aggregate nodes are easier for users to understand, while Graph entropy will increase.

The definition of flexible aggregation problem is given as follow.

Input: Given a heterogeneous information network \(G=(V,E,T,R,\) \(\phi _{V},\) \(\phi _{E},\) \(A,D,\phi _{A})\), selected node types \(Q=\{T_{1},T_{2},\cdots ,T_{l}\}\), \(Q\subseteq T\), with selected attributes \(S=\{S_{1},S_{2},\ldots ,S_{l}\}\), \(S_i{\subseteq }\phi _{A}(T_{i})\), and selected edge types \(L=\{R_{1},R_{2},\ldots ,R_{k}\}\), \(L\subseteq R\), \(f_{1}\), \(f_{2}\);

Output: Aggregate graph \(G_{c}=(V_{c},E_{c},f_{1},f_{2})\).

Object: Minimize \(F(G_{c})\).

3 Aggregation Algorithm

To distinguish nodes in semantics of attributes and structures, we design a two-phase aggregation: Informational aggregation and Structural aggregation.

Informational aggregation. This process guarantees that nodes aggregated together have the same types and attribute values. It partitions nodes of selected types according to the their selected attributes.

Structural aggregation. Based on the informational aggregation, we should make nodes in the same aggregate nodes have similar structures. We can reduce C-function by decreasing graph entropy. This process has three challenges: how to choose the partitioned aggregate nodes; what strategy should be used for partition; and when does iteration stop. We discuss how to tackle the challenges.

Challenge 1. In order to decrease C-function, we may choose the aggregate node with the largest graph entropy. Because nodes in it have diverse structures. In order to improve the readability of aggregate graphs, for the aggregate nodes with the same graph entropy, we are prior to choose the one with larger size. Each iteration, we choose the aggregate graph with the largest partition level, where the partition level of a is \(P(a)=\sqrt{|a|}\cdot {H(a)}\).

Challenge 2. In order to respond the quickly, in each iteration, we divide the aggregate node into two aggregate nodes according to the nodes’ neighbors with t, where \(t=\arg {\max _{b}\{{\lambda _{a,b}H_{b}(a)}}\}\).

Challenge 3. In view of C-function minimization, the sizes of aggregate graphs should be moderate. The iteration doesn’t terminate until C-function reaches the first maximal value or the size of aggregate graph exceeds a specific threshold.

4 Experiments Evaluation

AmazonFootnote 1 network contains three types of nodes: Customer, Product, Category. Four types of relations exist, which are: CoPurchase relation between Product, Purchase relation between Customer and Product, Classify relation between Product and Category and Like relation between Customer. Each node type has a set of attributes, Customer(ID, Purchase times), Product(ID, Rank, Reviews), Category(Name). The data set has 53,182 customers, 5,000 products and 4 categories. There are 147 edges of CoPurchase, 77,997 edges of Purchase, 7,231 edges of Like and 5,000 edges of Classify. We set \(\alpha _{customer}\), \(\alpha _{product}\) and \(\alpha _{category}\) to 1, \(\lambda _{{product}\leftrightarrow {product}}=15\), \(\lambda _{{customer}\leftrightarrow {product}}=2\), \(\lambda _{{product}\leftrightarrow {category}}=1\) and \(k=20\). Experiments are done on a Microsoft Windows 7 machine with an Intel Core i5-2400 CPU 3.1 GHz and 4 GB main memory by Microsoft Visual Studio 2010.

We compare our algorithm with the reference [2]. We apply the compared algorithm on each type of nodes, respectively, without taking the structures into consideration.

Query 1. A set of node types \(Q=\{Product,Category\}\) with attributes \(S=\{S_{Product},S_{Category}\}\) and relations \(L=\{CoPurchase,Classify\}\), where \(S_{Product}=\emptyset \) and \(S_{Category}=\{Name\}\).

Fig. 3.
figure 3

Aggregate graph of query 1 by compared algorithm

Fig. 4.
figure 4

Aggregate graph of query 1 by our algorithm

Figure 3 shows the aggregate graph of query 1 of the compared algorithm. The values of nodes and edges represent their aggregate values. We use dotted lines to represent the edges whose function values are below 10, and other edges are represented by solid lines. The most products are books, music stands the second, DVD and videos are the least. The co-purchased DVD products are not co-purchased with music and DVD. Figure 4 presents the aggregate result of query 1 in this paper. Our algorithm presents a deeper result than the compared algorithm. Books that co-purchased with music products are not likely co-purchased with DVD and videos. Meanwhile, the co-purchased music products are not co-purchased with books, which may be co-purchased with DVD and videos. Aggregate results are interesting after considering structural information.

Query 2. A set of node types \(Q=\{Customer,Product,Category\}\) with attributes \(S=\{S_{Customer},S_{Product},S_{Category}\}\) and relations L={CoPurchase, Classify, Purchase}, where \(S_{Customer}=\{Purchase power\}\), \(S_{Product}=\emptyset \) and \(S_{Category}=\{Name\}\).

Figure 5 presents runtime comparisons of different queries by different algorithms. The x-axis represents the queries 1 and 2 and the y-axis stands for the average runtime of queries. Each query is run 10 times and we compute the average running time. Previous work only focuses on attribute of nodes, while our work also considers the structures of networks. In query 2, our algorithm costs 13.2 s more than the compared algorithm.

Fig. 5.
figure 5

Runtime

Fig. 6.
figure 6

C-function

Figure 6 displays the comparison of C-function values of different algorithms. The x-axis represents the queries 1 and 2 and the y-axis shows the values of C-function. Although aggregation including structural aggregation costs more time, the C-function values are much smaller than the compared algorithm. The values of the compared algorithm increases, while our algorithm decreases. Because graph entropy is decreased by taking structures into consideration.

5 Conclusions

We introduced the flexible aggregation problem on heterogeneous information networks. In order to aggregate efficiently, we propose a two-phase aggregation algorithm to aggregate nodes with similar attributes and structures. Experiment results demonstrate our algorithm can provide more accurate and implicit knowledge with a wealth of information.