Keywords

1 Introduction

Recently there is a surge in applying network embedding for addressing various tasks in network science such as classification, clustering, link prediction, community detection etc. [5, 7, 12, 18]. Network embedding aims at learning low dimensional feature vector for a node capable of preserving its structural characteristics [4, 7]. Majority of the network embedding models proposed previously consider homogeneous networks, i.e. network consisting of singular type of nodes and relations [7, 12, 16, 18]. However, majority of the real-world information networks and social networks are heterogeneous in nature i.e. networks consist of multiple types of nodes and relations [15]. For example, an academic bibliographic network may be represented using Author (A), Paper (P), Venue (V) (conference/journal) as nodes and different contextual relations such as Author-writes-Paper (AP), Author-publishes-at-Venue (AV), etc.

Majority of the previous studies on mining heterogeneous networks [3, 14] exploit meta-path [8] which is a sequence of relations between different node types. Further, symmetric meta-paths are capable of preserving heterogeneous proximity between the underlying nodes. For example, in a bibliographic network, meta-path APA gives the proximity estimate between two authors collaborating on the same paper whereas AVA represents proximity between two authors publishing at the same venue. While exploring a network, a meta-path defines a specific path the explorer should follow. Recently, meta-paths have been used to generate network embedding [5, 6] and reported to obtain promising results for various applications in network mining such as node classification, link prediction, clustering, etc. In this paper, we systematically analyze the effectiveness of considering meta-path for generating network embedding, specifically for bibliographic network. Since, meta-path guides to explore only the partial network defined by the meta-path, it may lose some of the inherent network properties. Motivated by this, this paper attempts to understand the following two important issues while considering meta-paths for generating network embeddings.

  1. 1.

    Does meta-path lose network information which can degrade the network embedding performance?

  2. 2.

    Are meta-path based embeddings independent to the end task?

To investigate the above-discussed problems, we evaluate embeddings generated using different types of meta-paths using three state-of-the-art embedding models, namely, (i) Metapath2vec [5], (ii) Node2vec [7], and (iii) VERSE [18] on Co-authorship prediction task and Author’s research area classification in DBLPFootnote 1 heterogeneous bibliographic network. From various experimental observations, it is evident that meta-path based network embedding cannot be generalized for graph-based problems of diverse nature. Further, selecting suitable node types in the underlying heterogeneous network seems to be more important than considering different meta-paths for heterogeneous network embedding.

Rest of the paper is organized as follows. Section 2 presents some of the previous works on network embedding. Section 3 gives a brief description for heterogeneous network, meta-path, and network embedding. Section 4 describes the experimental setups and results. Finally, Sect. 5 concludes the paper.

2 Literature Survey

For network embedding, a majority of the initial studies attempt to map the natural graph representations like normalized adjacency or Laplacian matrix to lower dimensions by using spectral graph theory [2, 10] and various non-linear dimensionality reduction techniques [1, 13, 17]. However, these models are not scalable to large real-world networks as they exploit graph decomposition techniques at the core which requires the whole matrix beforehand.

To overcome the above limitations, many network embedding models exploit a framework which first generates a neighborhood sample using a random walk or proximity measure and then leverages it to learn the node embeddings using a skip-gram [9] based neural network model [7, 12, 16]. For example, Node2vec [7] uses a second order random walk to generate the neighborhood samples and learn the node embedding using skip-gram model, VERSE [18] preserves the vertex-to-vertex similarity using Personalized PageRank [11] and then exploits a single layer neural network to learn the embeddings.

All the above graph embedding models are proposed for homogeneous network. Recently, Metapath2vec [5] is proposed for heterogeneous network embedding which samples the node neighborhoods using a random walk guided through a meta-path. In a similar direction, study in [6] exploits the combined effect of different meta-path of predefined length to generate node embeddings in heterogeneous network.

3 Background Study

Definition 1 (Heterogeneous Network)

A Heterogeneous Network can be defined as six-tuple \({<}N,E,N^{\tau }, E^{\tau }, \phi , \psi {>}\) where N is a set of nodes, E is a set of edges, \(N^{\tau }\) is a set of node types, \(E^{\tau }\) is a set of edge types, \(\phi :N \rightarrow N^{\tau }\) maps any node \(n \in N\) to a node type \(n^{\tau }\in N^{\tau }\), and \(\psi :E \rightarrow E^{\tau }\) maps any edge \(e \in E\) to an edge type \(e^{\tau }\in E^{\tau }\). A homogeneous network is a special case of heterogeneous network where cardinalities of \(N^{\tau }\) and \(E^{\tau }\) are equal to one i.e. \(|N^{\tau }|= |E^{\tau }| = 1\).

Definition 2 (Meta-path)

Given a heterogeneous network G where \(N^{\tau } = \{n^{\tau }_{1}, n^{\tau }_{2}, \cdots , n^{\tau }_{l}\}\) and \(E^{\tau } = \{e^{\tau }_{1}, e^{\tau }_{2}, \cdots , e^{\tau }_{l-1}\}\), a meta-path \(\mathcal {P}_{(n^{\tau }_{1}, n^{\tau }_{l})}\) can be defined as an ordered sequence of edge types required to traverse for visiting a node type \(n^{\tau }_{l}\) from node type \(n^{\tau }_{1}\), i.e. \(\mathcal {P}_{(n^{\tau }_{1}, n^{\tau }_{l})}\) = \(n^{\tau }_{1}\xrightarrow {e^{\tau }_{1}} n^{\tau }_{2}\xrightarrow {e^{\tau }_{2}} \cdots \xrightarrow {e^{\tau }_{l-1}} n^{\tau }_{l} \).

3.1 Homogeneous Network Embedding

With the popularity of word2vec model using skip-gram proposed in [9] for generating word embedding from large sentence corpus, studies in [7, 12, 16] adapt skip-gram for network embedding. These network embedding frameworks exploit random walk based sampling strategy to generate node sequences capturing node’s neighborhood characteristics similar to a sentence which captures contextual relation between two words. Formally, for a given network G(N, E), network embedding using skip-gram model aims at maximizing neighborhood probability for a given node:

$$\begin{aligned} arg max_\theta \sum _{n \in N}\sum _{c \in \mathcal {N}(n)} log p(c|n;\theta ) \end{aligned}$$
(1)

where \(\mathcal {N}(n)\) gives the neighbors of n and \(p(c|n;\theta )\) is the conditional probability of observing neighbor node c for the given node n.

3.2 Heterogeneous Network Embedding

For a given heterogeneous network \(G(N,E,N^{\tau },E^{\tau })\), the skip-gram model defined in Eq. (1) can be transformed into heterogeneous skip-gram model as follows [5]:

$$\begin{aligned} arg max_\theta \sum _{n \in N}\sum _{\tau \in N^{\tau }}\sum _{c_{\tau } \in \mathcal {N}_{\tau }(n)} log p(c_{\tau }|n;\theta ) \end{aligned}$$
(2)

where \(\mathcal {N}_{\tau }(n)\) gives the neighbor nodes of n from \(\tau ^{th}\) type. Furthermore, \(p(c_{\tau }|n;\theta )\) is defined using softmax function, i.e. \(p(c_{\tau }|n;\theta )\) = \(\frac{exp(X_{c_\tau }\cdot X_n)}{\sum _{u \in N}exp(X_u \cdot X_n)}\), where \(X_n\) corresponds to the embedding vector of node n.

3.3 Meta-path Based Heterogeneous Network Embedding

The meta-path based heterogeneous network embedding model exploits heterogeneous skip-gram defined in Eq. (2). Further, random walks guided through meta-paths are used to generate neighborhood samples for all the nodes. In other words, random walker traverses partial heterogeneous network specific to underlying meta-path. For example, Metapath2vec exploits APVPA (or AVA) meta-path while generating random walk based node sequences [5].

While Metapath2vec has been proposed specifically for heterogeneous network embedding, the above-discussed meta-path based network embedding framework can be easily adapted by homogeneous network embedding methods through redefining the input network with specific meta-path. Therefore, this paper further exploits two homogeneous network embedding models namely Node2vec [7] and VERSE [18] for meta-path based heterogeneous network embedding.

4 Experimental Setups and Analysis

4.1 Experimental Dataset

This paper uses DBLP bibliographic dataset (reported in [19]) covering publication information for the period between years 1968 to 2011. To generate various network embeddings using different meta-paths and to evaluate the embedding performance over different applications, we further divide the dataset into two parts; (i) from 1968 to 2008 for generating network embedding, and (ii) from 2009 to 2011 for evaluating the embeddings over different applications. This paper considers three types of nodes, namely (i) Author (A), (ii) Paper (P), and (iii) Venue (V) for constructing various types of networks defined by different meta-paths. We construct the following four types of undirected networks from the DBLP 1968-2008 dataset.

  • AA: It is a homogeneous unweighted co-authorship network considering only Author node type. Two nodes are connected if they co-author a paper.

  • APA: It is a heterogeneous unweighted network considering Author and Paper node types. An author is connected to a paper if he/she is one of the authors of the paper.

  • AVA: It is a heterogeneous unweighted network considering Author and Venue node types. An author is connected to a venue if he/she published a paper in that venue. This network structure is similar to the structure considered in Metapath2vec [5].

  • All: It is a heterogeneous unweighted network considering all three types of nodes (Author, Paper, and Venue) and corresponding relationships between them.

Table 1 shows the characteristics of these experimental networks.

Table 1. Characteristics of different networks constructed over DBLP data

4.2 Experimental Setups

As mentioned above, three popular recently proposed network embedding models, namely (i) Metapath2vec [5], (ii) Node2vec [7], and (iii) VERSE [18] are considered to generate node embeddings. For all the models, we use the same hyper-parameter values as described in the original studies cited above. All the embedding results reported in this paper consider 100-dimensional vectorFootnote 2. To investigate the performance of different meta-paths and their associated embedding, we evaluate the embedding quality using the following two applications.

Co-authorship Prediction: Like the study [18], we also consider Co-authorship prediction task as a classification problem i.e., given a node pair, classify if the node pair has a co-author relation or not. To model it as a binary classification problem, we generate feature vectors representing node pairs using Hadamard operator [7, 18]. To avoid possible bias with the embedding towards the target application, we consider the DBLP 2009-2011 (non-overlapping with the embedding dataset) for generating samples for the classification task. In this sample, there are 29,677 number of co-authorship relations and 18,457 authors. We use random 80-20 split as training and test samples subjected to four different classifiers namely Gaussian Naive Bayes (NB), Random Forest (RF), Decision Tree (DT), and Logistic Regression (LR). To avoid over-fitting, the above setup has been repeated 10 times.

Research Area Classification: We now investigate the quality of the embeddings for predicting author’s research area. For each author in DBLP 2009-2011, we further identify (considering the Field attribute in [19]) the area in which author has maximum publication and consider it as the author’s class label. Like Co-authorship prediction, we use similar random 80-20 split for all the classifiers and repeated 10 times.

4.3 Result and Discussion

Tables 2 and 3 present the Accuracy for Co-authorship prediction and Author’s research area classification respectively using three network embedding models discussed above for all networks, i.e. AA, AVA, APA, and All. From Tables 2 and 3, it is observed that LR out-performs other classifiers in 93% times for Co-authorship prediction and 75% times for Author’s research area classification task. Therefore, we select LR Accuracy for further analysis.

Table 2. Accuracy for co-authorship prediction by classifiers for different networks, (\(\mathbf{Combine = Concat(Metapath2vec, Node2vec, VERSE)}\))
Table 3. Accuracy for author’s research area classification by classifiers for different networks, (\(\mathbf{Combine = Concat(Metapath2vec, Node2vec, VERSE)}\))

We first investigate if meta-path based embedding loses information or not. It is evident from Tables 2 and 3 that almost all the models perform best by exploiting All network and show poor performance with AA, APA, and AVA networks for both tasks, i.e. Co-authorship prediction and area classification. Thus, it can be inferred that meta-path alone may be a weak representation for the network because it does not incorporate the impacts of other relational properties while capturing node neighborhood.

Secondly, we intend to investigate if the same embedding responds coherently to different problems. From Tables 2 and 3, it is clearly visible that APA performs better than AVA for Co-authorship prediction whereas AVA performs better than APA for classifying Author’s research area. This observation is true for all the embedding techniques used in this study. Thus, meta-path based heterogeneous network embedding cannot be generalized for the tasks of different nature.

The homogeneous network AA and heterogeneous network APA, preserve similar proximity, i.e. co-authorship between underlying pair of authors. From Table 2, it is evident that AA performs better than APA for Co-authorship prediction in majority of the cases. However, for Author’s research area classification in Table 3, APA performs better than AA in almost all the scenarios. Thus, it can be inferred that meta-path based heterogeneous network embedding may perform differently (poor or better) compared to homogeneous network embedding when subjected to tasks of diverse nature.

Among all the embedding models, VERSE consistently outperforms others for almost all the networks and classifiers for both Co-authorship prediction and research area classification tasks. It may be because unlike Metapath2vec and Node2vec, VERSE exploits a Personalized PageRank [11] capturing vertex-to-vertex similarity while generating the neighborhood sequences.

We further investigate combining all the three embeddings (Metapath2vec, Node2vec, VERSE) by concatenating the feature vectors. From Tables 2 and 3, it is observed that combined embedding always out-performs individual embedding for Co-authorship prediction and Author’s research area classification over all the four networks.

5 Conclusion

In this paper, we investigate the applicability of meta-paths in heterogeneous network embedding for Co-authorship prediction and Author’s research area classification problems in heterogeneous DBLP database. From various experimental results, we observe that by using appropriate node types, majority of the embedding methods out-perform their counter-parts exploiting meta-path based network for both of the above-mentioned tasks. Further, it is also evident that exploiting past co-authorship relation or APA meta-path yields better co-author prediction in comparison to AVA meta-path which exploits author’s publication venue. On the other hand, AVA meta-path contributes positively to Author’s research area classification problem and have superior performance than APA meta-path. Thus, for heterogeneous network embedding one should carefully choose the node types, relation types, and meta-paths which can capture better the network characteristics to address the underlying problem.