Flexible Aggregation on Heterogeneous Information Networks

Yin, Dan; Gao, Hong; Zou, Zhaonian; Liu, Xianmin; Li, Jianzhong

doi:10.1007/978-3-319-22324-7_18

Flexible Aggregation on Heterogeneous Information Networks

Dan Yin¹⁸,
Hong Gao¹⁸,
Zhaonian Zou¹⁸,
Xianmin Liu¹⁸ &
…
Jianzhong Li¹⁸

Conference paper
First Online: 01 January 2015

1103 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9052))

Abstract

With the advent of heterogeneous information networks that consist of multi-type, interconnected nodes, such as bibliographic networks and knowledge graphs, it is important to study flexible aggregation in such networks. In this paper, we investigate the flexible aggregation problem on heterogeneous information networks, which is defined on multi-type of nodes and relations. We develop an efficient heuristic algorithm for aggregation in two phases: informational aggregation and structural aggregation. Extensive experiments on real world data sets demonstrate the effectiveness and efficiency of the proposed algorithm.

Download conference paper PDF

1 Introduction

Heterogeneous information networks are newly emerged graph data, which involve multi-type nodes and relations. Figure 1(a) shows a bibliographic network which contains three types of nodes: Paper, Author and Venue, and four types of relations exist among these nodes. They are Cooperate relation between Author, Write relation between Author and Paper, Cite relation between Paper, Publish relation between Paper and Venue. In addition, each type of nodes has a set of attributes, i.e, Paper(ID, Title, Topic, Keywords); Author(ID, Institute, Field, Location); Venue(Name, Year). Values of part attributes are given in Fig. 1(b).

Aggregations allow users to observe and model data in different dimensions, to perform drill-down, roll-up and other OLAP operations further. We investigate the problem of aggregation on multi-type nodes and relations on heterogeneous information networks. Next we give two aggregation queries over the bibliographic network. The aggregate functions are COUNT.

Query 1. Aggregate on Paper node and Cite relation, the selected attribute of Paper is Topic.

Query 2. Aggregate on Paper, Author nodes and Write relation, the selected attribute of Paper is Topic, and the selected attribute of Author is Location.

Figure 2(a) displays the aggregate result of query 1. Paper nodes in the same aggregate nodes have the same values of Topic. Meanwhile, they also have the same Cite relations with other aggregate nodes. Node 9 and 10 are not in the same aggregate node, because node 9 cites ‘DM’ paper, while node 10 does not. Figure 2(b) gives the aggregate result of query 2. Author nodes in the same aggregate nodes have the same values of Location and the same Write relation with aggregate Paper nodes.

From the two examples, we can see flexible aggregation on multi-type nodes and relations is meaningful. The main contributions of this paper are:

1.
Flexible aggregation problem for heterogeneous information networks is proposed, which can aggregate multi-type nodes and relations;
2.
A novel function based on graph entropy is proposed, which is effective to measure the structural similarities with regard to different types of relations;
3.
An efficient aggregation algorithm from two phases is proposed: informational aggregation and structural aggregation.
4.
Experiments demonstrate the effectiveness and efficiency of algorithm.

2 Preliminaries

Definition 1

Heterogeneous Information Network. A heterogeneous information network is defined as a directed graph $G=(V,E,T,R,\phi _{V},\phi _{E},A,D,\phi _{A})$, where V is node set, $E\subseteq {V\times {V}}$ is edge set. T is set of node types, and R is set of edge types. $\phi _{V}: V\longrightarrow {T}$ is node type mapping function and $\phi _{E}: E\longrightarrow {R}$ is edge type mapping function. A is attribute set of nodes and D is domain of A. $\phi _{A}:T\longrightarrow A$ is mapping function from node types to attributes.

Definition 2

Graph Projection. Given selected node types $Q=\{T_{1},T_{2},\ldots ,T_{l}\}$, $Q\subseteq T$, and selected edge types $L=\{R_{1},R_{2},\ldots ,R_{k}\}$, $L\subseteq R$. The graph projection of G on Q and L is a graph $G_{pj}=(V_{pj},E_{pj})$, where $V_{pj}$ is node set $V_{pj}=\{v|v\in V, \phi _{V}(v) \in Q\}$, $E_{pj}$ is edge set, $\forall u,v \in V_{pj}$, $(u,v)\in E_{pj}$ iff $(u,v)\in E$ and $\phi _{E}(u,v)\in L$.

Given an attribute set $S=\{A_{1},A_{2},\ldots ,A_{k}\}$, $S\subseteq A$, for $\forall u,v \in V$, if $A_{i}(u)=A_{i}(v)$ ($1 \le i\le k$), then we say $S(u)=S(v)$.

Definition 3

Graph Partition. Given selected attributes $S=\{S_{1},S_{2},\ldots ,S_{l}\}$ of Q, where $S_i{\subseteq }\phi _{A}(T_{i})$. The partition of $G_{pj}$ w.r.t Q, S and L is a set of graphs $G_{p}$={${{G}}_{1}$, ${{G}}_{2}$, $\cdots $, ${{G}}_{m}$}, satisfying:

1.
For $\forall G_{i}\in G_{p}$, $G_{i}={(V_{i},E_{i})}$, $G_{i}$ is a subgraph of $G_{pj}$;
2.
$\bigcup _{i=1}^{m}{V_{i}}=V_{pj}$;
3.
For $\forall {{G}}_{i},{{G}}_{j}\in G_{p}$, $i\ne j$, $V_{i}\cap V_{j}=\emptyset $;
4.
For $\forall u,w \in G_{i}$, $\exists T_{j}$, $\phi _{V}(u)=\phi _{V}(w)=T_{j}$, $S_{j}(u)=S_{j}(w)$;
5.
For $\forall u,w \in G_{i}$, for $\forall G_{j}$, if $\exists u^{'}\in G_{j}$, $(u,u^{'})\in E$, $\phi _{E}((u,u^{'}))\in L$, then $\exists w^{'}\in G_{j}$, $(w,w^{'})\in E$, $\phi _{E}((w,w^{'}))\in L$;
6.
For $\forall u,w \in G_{i}$, $(u,v)\in E_{i}$ iff $(u,v)\in E$ and $\phi _{E}(u,v)\in L$;

Definition 4

Aggregate Graph. The aggregate graph of G on Q, S and L is a graph $G_{c}=(V_{c},E_{c},f_{1},f_{2})$, where $V_{c}$ is node set, $E_{c}$ is edge set, $f_{1}$ is aggregate function on $V_{c}$ and $f_{2}$ is aggregate function on $E_{c}$, satisfying:

1.
$|V_{c}|$=$|G_{p}|$;
2.
$\forall a\in V_{c}$, a corresponds to a subgraph $G_{a}\in G_{p}$;
3.
$\forall a,b\in V_{c}$, a corresponds to a subgraph $G_{a}$ and b corresponds to a subgraph $G_{b}$, if $a\ne b$, then $G_{a}\ne G_{b}$;
4.
$\forall a\in V_{c}$, a corresponds to a subgraph $G_{a}$, $f_{1}(a)=f_{1}(V_{a})$;
5.
$\forall a,b\in V_{c}$, a corresponds to a subgraph $G_{a}$ and b corresponds to a subgraph $G_{b}$. $(a,b)\in E_{c}$ iff $\exists u\in V_{a}$, $w\in V_{b}$, $(u,w)\in E$ and $\phi _{E}(u,w)\in L$;
6.
$\forall (a,b)\in E_{c}$, a corresponds to a subgraph $G_{a}$ and b corresponds to a subgraph $G_{b}$. $f_{2}((a,b))=f_{2}(\{(u,w)|u\in V_{a}$,$w\in V_{b}$, $(u,w)\in E$ and $\phi _{E}(u,w)\in L$}.

For $\forall a\in V_{c}$, we call it aggregate node, and for $\forall e\in E_{c}$, we call it aggregate edge. Aggregate functions can be selected freely, e.g., COUNT, AVERAGE.

Graph entropy [1] has been widely used in graph mining. So we employ graph entropy to measure the structural consistency of nodes.

Definition 5

Graph Entropy. $G_{c}=(V_{c},E_{c},f_{1},f_{2})$ is an aggregate graph of G on Q, S and L, for $\forall a=(V_{a},E_{a}),b=(V_{b},E_{b})\in V_{c}$, the entropy from a to b is

$$\begin{aligned} H_{b}(a)= {\left\{ \begin{array}{ll} -\frac{|V_{b}(a)|}{|V_{a}|}\cdot {log_{2}}\frac{|V_{b}(a)|}{|V_{a}|}\ \ \ \ \ \ &{} |V_{b}(a)|\ne 0\\ 0\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ &{} |V_{b}(a)|=0 \end{array}\right. } . \end{aligned}$$

(1)

where $V_{b}(a)=\{v|v\in V_{a},\exists w\in V_{b},(v,w)\in E,\phi _{E}((v,w))\in L\}$.

The entropy of a is

$$\begin{aligned} H(a)=\sum _{b\in {V_{c}}}\lambda _{a,b}H_{b}(a). \end{aligned}$$

(2)

$\lambda _{a,b}$ represents the weight of relation from nodes in a to nodes in b.

Definition 6

C-function. The C-function of aggregate graph $G_{c}$ is

$$\begin{aligned} F(G_{c})=\sum _{i=1}^{l}\alpha _{i}\sqrt{NUM_{T_{i}}}+\sum _{a\in V_{c}}H(a). \end{aligned}$$

(3)

where $NUM_{T_{i}}=|\{a|a\in V_{c},\forall u\in V_{a},\phi _{V}(u)=T_{i}\}|$.

$\alpha _{i}$ distinguishes the importance of different types of nodes. Fewer aggregate nodes are easier for users to understand, while Graph entropy will increase.

The definition of flexible aggregation problem is given as follow.

Input: Given a heterogeneous information network $G=(V,E,T,R,$ $\phi _{V},$ $\phi _{E},$ $A,D,\phi _{A})$, selected node types $Q=\{T_{1},T_{2},\cdots ,T_{l}\}$, $Q\subseteq T$, with selected attributes $S=\{S_{1},S_{2},\ldots ,S_{l}\}$, $S_i{\subseteq }\phi _{A}(T_{i})$, and selected edge types $L=\{R_{1},R_{2},\ldots ,R_{k}\}$, $L\subseteq R$, $f_{1}$, $f_{2}$;

Output: Aggregate graph $G_{c}=(V_{c},E_{c},f_{1},f_{2})$.

Object: Minimize $F(G_{c})$.

3 Aggregation Algorithm

To distinguish nodes in semantics of attributes and structures, we design a two-phase aggregation: Informational aggregation and Structural aggregation.

Informational aggregation. This process guarantees that nodes aggregated together have the same types and attribute values. It partitions nodes of selected types according to the their selected attributes.

Structural aggregation. Based on the informational aggregation, we should make nodes in the same aggregate nodes have similar structures. We can reduce C-function by decreasing graph entropy. This process has three challenges: how to choose the partitioned aggregate nodes; what strategy should be used for partition; and when does iteration stop. We discuss how to tackle the challenges.

Challenge 1. In order to decrease C-function, we may choose the aggregate node with the largest graph entropy. Because nodes in it have diverse structures. In order to improve the readability of aggregate graphs, for the aggregate nodes with the same graph entropy, we are prior to choose the one with larger size. Each iteration, we choose the aggregate graph with the largest partition level, where the partition level of a is $P(a)=\sqrt{|a|}\cdot {H(a)}$.

Challenge 2. In order to respond the quickly, in each iteration, we divide the aggregate node into two aggregate nodes according to the nodes’ neighbors with t, where $t=\arg {\max _{b}\{{\lambda _{a,b}H_{b}(a)}}\}$.

Challenge 3. In view of C-function minimization, the sizes of aggregate graphs should be moderate. The iteration doesn’t terminate until C-function reaches the first maximal value or the size of aggregate graph exceeds a specific threshold.

4 Experiments Evaluation

Amazon^{Footnote 1} network contains three types of nodes: Customer, Product, Category. Four types of relations exist, which are: CoPurchase relation between Product, Purchase relation between Customer and Product, Classify relation between Product and Category and Like relation between Customer. Each node type has a set of attributes, Customer(ID, Purchase times), Product(ID, Rank, Reviews), Category(Name). The data set has 53,182 customers, 5,000 products and 4 categories. There are 147 edges of CoPurchase, 77,997 edges of Purchase, 7,231 edges of Like and 5,000 edges of Classify. We set $\alpha _{customer}$, $\alpha _{product}$ and $\alpha _{category}$ to 1, $\lambda _{{product}\leftrightarrow {product}}=15$, $\lambda _{{customer}\leftrightarrow {product}}=2$, $\lambda _{{product}\leftrightarrow {category}}=1$ and $k=20$. Experiments are done on a Microsoft Windows 7 machine with an Intel Core i5-2400 CPU 3.1 GHz and 4 GB main memory by Microsoft Visual Studio 2010.

We compare our algorithm with the reference [2]. We apply the compared algorithm on each type of nodes, respectively, without taking the structures into consideration.

Query 1. A set of node types $Q=\{Product,Category\}$ with attributes $S=\{S_{Product},S_{Category}\}$ and relations $L=\{CoPurchase,Classify\}$, where $S_{Product}=\emptyset $ and $S_{Category}=\{Name\}$.

Figure 3 shows the aggregate graph of query 1 of the compared algorithm. The values of nodes and edges represent their aggregate values. We use dotted lines to represent the edges whose function values are below 10, and other edges are represented by solid lines. The most products are books, music stands the second, DVD and videos are the least. The co-purchased DVD products are not co-purchased with music and DVD. Figure 4 presents the aggregate result of query 1 in this paper. Our algorithm presents a deeper result than the compared algorithm. Books that co-purchased with music products are not likely co-purchased with DVD and videos. Meanwhile, the co-purchased music products are not co-purchased with books, which may be co-purchased with DVD and videos. Aggregate results are interesting after considering structural information.

Query 2. A set of node types $Q=\{Customer,Product,Category\}$ with attributes $S=\{S_{Customer},S_{Product},S_{Category}\}$ and relations L={CoPurchase, Classify, Purchase}, where $S_{Customer}=\{Purchase power\}$, $S_{Product}=\emptyset $ and $S_{Category}=\{Name\}$.

Figure 5 presents runtime comparisons of different queries by different algorithms. The x-axis represents the queries 1 and 2 and the y-axis stands for the average runtime of queries. Each query is run 10 times and we compute the average running time. Previous work only focuses on attribute of nodes, while our work also considers the structures of networks. In query 2, our algorithm costs 13.2 s more than the compared algorithm.

Figure 6 displays the comparison of C-function values of different algorithms. The x-axis represents the queries 1 and 2 and the y-axis shows the values of C-function. Although aggregation including structural aggregation costs more time, the C-function values are much smaller than the compared algorithm. The values of the compared algorithm increases, while our algorithm decreases. Because graph entropy is decreased by taking structures into consideration.

5 Conclusions

We introduced the flexible aggregation problem on heterogeneous information networks. In order to aggregate efficiently, we propose a two-phase aggregation algorithm to aggregate nodes with similar attributes and structures. Experiment results demonstrate our algorithm can provide more accurate and implicit knowledge with a wealth of information.

Notes

1.
SNAP: http://snap.stanford.edu/data/.

References

Shetty, J., Adibi, J.: Discovering important nodes through graph entropy the case of enron email database. In: Proceedings of the 3rd International Workshop on Link Discovery, pp. 74–81. ACM (2005)
Google Scholar
Zhao, P., Li, X., Xin, D., Han, J.: Graph cube: on warehousing and olap multidimensional networks. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pp. 853–864. ACM (2011)
Google Scholar

Download references

Acknowledgement

This work is supported by the National Grand Fundamental Research 973 Program of China under grant 2012CB316200, the Key Program of National Natural Science Foundation of China under grant 60933001, the Major Program of National Natural Science Foundation of China under grant 61190115, the General Program of National Natural Science Foundation of China under grant 61173023.

Author information

Authors and Affiliations

Massive Data Computing Research Lab, Harbin Institute of Technology, Harbin, China
Dan Yin, Hong Gao, Zhaonian Zou, Xianmin Liu & Jianzhong Li

Authors

Dan Yin
View author publications
You can also search for this author in PubMed Google Scholar
Hong Gao
View author publications
You can also search for this author in PubMed Google Scholar
Zhaonian Zou
View author publications
You can also search for this author in PubMed Google Scholar
Xianmin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jianzhong Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dan Yin .

Editor information

Editors and Affiliations

Soochow University, Suzhou, China
An Liu
Nagoya University, Nagoya, Japan
Yoshiharu Ishikawa
Wuhan University, Wuhan, China
Tieyun Qian
University of Hong Kong, Hong Kong, China
Sarana Nutanong
Monash University, Clayton, Victoria, Australia
Muhammad Aamir Cheema

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yin, D., Gao, H., Zou, Z., Liu, X., Li, J. (2015). Flexible Aggregation on Heterogeneous Information Networks. In: Liu, A., Ishikawa, Y., Qian, T., Nutanong, S., Cheema, M. (eds) Database Systems for Advanced Applications. DASFAA 2015. Lecture Notes in Computer Science(), vol 9052. Springer, Cham. https://doi.org/10.1007/978-3-319-22324-7_18

Download citation

DOI: https://doi.org/10.1007/978-3-319-22324-7_18
Published: 30 July 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-22323-0
Online ISBN: 978-3-319-22324-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics