Keywords

1 Introduction

With the growing popularity of online social networks including Weibo and Twitter, the “information overload” [1] come up and the social media platforms took more effort to satisfy users’ more individualized demands by providing personalized services such as recommendation systems. User profile, the actual representation to capture certain characteristics about an individual user [25], is the basis of recommendation system [7] and exact-marketing [2, 3]. As a result, user profiling methods, which help obtaining accurate and effective user profiles, have drawn more and more attention from industrial and academic community.

A straightforward way of inferring user profiles is leveraging information from the user’s activities, which requires the users to be active. However, in many real-world applications a significant portion of users are passive ones who keep following and reading but do not generate any content. As a result, label propagation user profile methods [4,5,6] are widely studied, which mainly use the social network information rather than user’s activities. In order to obtain user profile more accurately and abundantly, multi-label is applied in many researches to describe users’ attributes or interests. Different labels were assumed independently [5] in some research, while the associations among labels were ignored and some implicit label features remained hidden. Meanwhile, several researches [1, 8, 9] considered the explicit associations among labels to get user profile and achieved better performance. Besides the explicit associations, there exists implicit association among labels that is beneficial to make user profile more accurate and comprehensive. The previous work [10] leveraged internal connection of labels, which is called implicit association. However, this work only considered the relation of labels, but ignored the user and label semantic information jointly based on user-generated texts, relationships and label information, which is also important for user profile.

To take advantage of this insight, a graph convolutional networks with implicit label associations (GCN-IA) is proposed to get user profile. A probability matrix is first designed to capture the implicit associations among labels for user representation. Then, we learn user embedding and label embedding jointly based on user-generated texts, relationships and label information. Finally, we make multi-label classification based on given user representations to predict unlabeled user profiles. The main contributions of this paper are summarized as follows:

  • Insight. We present a novel insight about combination among implicit association labels, user semantic information and label semantic information. In online social networks, due to users’ personalized social and living habits, there are still certain implicit associations among labels. At the same time, user and label information from user-generated texts, relationships and label information is significant for the construction of user profile.

  • Method. A graph convolutional networks with implicit label associations (GCN-IA) method is proposed to get user profile. We first construct the social network graph with the relationship between users and design a probability matrix to record the implicit label associations, and then combine this probability matrix with the classical GCN method to embed user and label semantic information.

  • Evaluation. Experiments evaluating GCN-IA method on 4 real Weibo data sets of different sizes are conducted. The comparative experiments evaluate the accuracy and effectiveness of GCN-IA. The results demonstrate that the performance is significantly improved compared with some previous methods.

The following chapters are organized as follows: In Sect. 2, related works are briefly elaborated. The Sect. 3 describes the details of GCN-IA, and experiments and results are described in Sect. 4. Finally, we summarize the conclusion and future work in Sect. 5.

2 Related Works

Label propagation method shows advantages of linear complexity and less required given user’s labels, and disadvantages such as low accuracy and propagation instability. The existing label propagation methods in user profile can be divided into three parts. One is to optimize the label propagating process to obtain more stable and accurate profiles, the second part is to propagate multi-label through social network structure to get more comprehensive user profile, and the last part is to apply deep-learning methods such as GCN to infer multi-label user profile.

2.1 Propagation Optimization

Label propagation method was optimized by leveraging more user attributes information, specifying propagation direction and improving propagation algorithm. Subelj et al. proposed balanced propagation algorithm in which an increasing propagation preferences could decide the update order certain nodes, so that the randomness was counteracted by utilizing node balancers [14]. Ren et al. introduced node importance measurement based on the degree and clustering coefficient information to guide the propagation direction [15]. Li et al. leveraged user attributes information and user attributes’ similarity to increase recall ratio of user profile [5]. Huang et al. redefined the label propagating process with a multi-source integration framework that considered content and network information jointly [16]. Explicit associations among labels also have been taken into consideration in some research, Glenn et al. [1] introduced the explicit association labels and the results proved the efficiency of the method.

We innovatively introduced the implicit association labels into multi-label propagation [10], the method was proved to be convergent and faster than traditional label propagation algorithm and its performance was significantly better than the state-of-the-art method on Weibo datasets. However the research [10] ignored user embedding and label embedding jointly based on user-generated texts, relationships and label information, which seemed very important for user profile.

2.2 Multi-label Propagation

The multi-label algorithms were widely applied to get abundant profile. Gregory et al. proposed COPRA algorithm and extended the label and propagation step to more than one community, which means each node could get up to v labels [17]. Zhang et al. used the social relationship to mine user interests, and discovered potential interests from his approach [6]. Xie et al. recorded all the historical labels from the multi-label propagation process, which make the profile result more stable [18]. Wu et al. proposed balanced multi-label propagation by introducing a balanced belonging coefficients p, this method improved the quality and stability of user profile results on the top of COPRA [19].

Label propagation algorithm has been improved in different aspects in the above work, however it’s still difficult to get a high accuracy and comprehensive profile due to the lack of input information and the complex community structures.

2.3 GCN Methods

GCN [20] is one of the most popular deep learning methods, which can be simply understood as a feature extractor for graphs. By learning graph structure features through convolutional neural network, GCN is widely used in node classification, graph classification, edge prediction and other research fields. GCN is a semi-supervised learning method, which can infer the classification of unknown nodes by extracting the characteristics of a small number of known nodes and the graph structure. Due to the high similarity with the idea of label propagation, we naturally consider constructing multi-label user profile with GCN. Wu et al. proposed a social recommendation model based on GCN [21], in which both user embedding and item embedding were learned to study how users’ interests are affected by the diffusion process of social networks. William et al. [22] and Yao et al. [23] applied GCN for text classification and recommendation systems respectively, with node label and graph structure considered to GCN modeling. However, the existing methods rarely consider the implicit relationships between labels in the GCN based methods.

3 Methodology

3.1 Overview

This section mainly focuses on the improvement of graph convolutional networks (GCN) based on implicit association labels. The goal of this paper is to learn user representation for multi-label user profile task by modeling user-generated text and user relationships.

The overall architecture of GCN-IA is shown in Fig. 1. The model consists of three components: Prior Knowledge Enhancement (PKE) module, User Representation module, and Classification module. Similar with other graph-based method, we formulated the social network into a heterogeneous graph. In this graph, nodes represent the users in social network and edges represent user’s multiple relationships such as following, supporting and forwarding. First, PKE captures the implicit associations among labels for user representation. Then, user representation module learns user embedding and label embedding jointly based on user-generated texts, relationships and label information. Classification module makes multi-label classification based on user representations to predict unlabeled user profiles.

Fig. 1.
figure 1

Overall architecture of GCN-IA.

3.2 Prior Knowledge Enhancement Module

Social networks are full of rich knowledge. According to [10], associations among implicit labels are very significant in user profile. In this part, we introduce the knowledge of implicit association among labels to capture the connections among users and their profile labels.

A priori knowledge probability matrix P is defined as Eq. (1). Probability of propagation among labels gets when higher \( P_{ij} \) gets a higher value.

$$ P_{ij} = \frac{{\left| {\left\{ {t|t \in I \wedge \left( {l_{i} ,l_{j} } \right) \subseteq t} \right\}} \right|}}{{\mathop \sum \nolimits_{i = 0}^{m} \mathop \sum \nolimits_{j = 0}^{m} \left| {\left\{ {t|t \in I \wedge \left( {l_{i} ,l_{j} } \right) \subseteq t} \right\}} \right|}} $$
(1)

Associations in social network are complex due to uncertainty [12] or special events [13]. Therefore, we define the set of labels, where elements are sampled by co-occurrence, cultural associations, event associations or custom associations, as shown in Eq. (2).

$$ I = I_{1} \cup I_{2} \cup I_{3} \cup \ldots $$
(2)

Where \( {\text{I}}_{\text{i}} \left( {{\text{i}} = 1,2,3, \ldots } \right) \) represents respectively a set of each user’s interest label set.

3.3 User Representation Module

Generally, the key idea of GCNs is to learn the iterative convolutional operation in graphs, where each convolutional operation means generating the current node representations from the aggregation of local neighbors in the previous layer. A GCN is a multilayer neural network that operates directly on a graph and induces embedding vectors of nodes based on properties of their neighborhoods.

In the user representation module, we apply GCNs to embed users and profile labels into a vector space and learn user representation and label representation jointly from user-generated content information and social relationships. Specifically, the implicit associations as prior knowledge are introduced to improve the GCNs to model the associations among labels.

Formally, the model considers a social network \( G = \left( {V,E} \right) \), where V and E are sets of nodes and edges, respectively. In our model, there are two types of nodes, user node and label node. The initialized embedding of user nodes and label nodes, denoted as X, is initialized with user name and their content via pre-trained word2vec model.

We build edges among nodes based on user relationships (user-user edges), users’ profiles (user-label edges) and implicit associations among labels (label-label edges). We introduce an adjacency matrix A of G. and its degree matrix D, where \( D_{ii} = \sum\nolimits_{j = 1, \ldots ,n} {A_{ij} } \). The diagonal elements of A are set to 1 because of self-loops. The weight of the edges between a user node and a label node is based on user profile information, formulated as Eq. (3).

$$ A_{\text{ij}} = \left\{ {\begin{array}{*{20}c} 1 \\ 0 \\ \end{array} \begin{array}{*{20}c} {{\text{if}}\,{\text{the}}\,{\text{user}}\,i\,{\text{is}}\,{\text{with}}\,{\text{the}}\,{\text{label}}\,j} \\ {\text{otherwise}} \\ \end{array} } \right., where\,i \in \mathcal{U}_{gold} , j \in \mathcal{C} $$
(3)

Where \( \mathcal{U} \) is the set of all users in the social network, \( \mathcal{U}_{gold} \) denotes labeled users. And \( \mathcal{C} \) is the set of labels of user profile.

To utilize label co-occurrence information for knowledge enhancement, we calculate weights between two label nodes as described in Sect. 3.2. The weights between two user nodes are defined as Eq. (4) according to user relationships.

$$ A_{ij} = \left\{ {\begin{array}{*{20}c} {1 \times {\text{Sim}}\left( {{\text{i}},{\text{j}}} \right)} & {if \left( {u_{i} ,u_{j} } \right) \in \mathcal{R}} \\ 0 & {otherwise} \\ \end{array} } \right., where \,i,j \in \mathcal{U} $$
(4)

Where \( \mathcal{R} = \{ \left( {u_{0} ,u_{1} } \right),\left( {u_{1} ,u_{3} } \right),..)\} \) is the set of relations between users and \( Sim\left( {i,j} \right) \) indicates the similarity between user i and user j followed by [10]. The less the ratio of the value is, the closer the distance is.

GCN stacks multiple convolutional operations to simulate the message passing of graphs. Therefore, both the information propagation process with graph structure and node attributes are well leveraged in GCNs. For a one-layer GCN, the new k-dimensional node feature matrix is computed as:

$$ L^{\left( 1 \right)} = \sigma \left( {\tilde{A}XW_{0} } \right) $$
(5)

Where \( \tilde{A} \) (\( \tilde{A} = {\text{D}}^{ - 1/2} W{\text{D}}^{ - 1/2} ) \) is a normalized symmetric adjacency matric, and \( W_{0} \) is a weight matrix. \( \sigma \left( \cdot \right) \) is an activation function, e.g. a ReLU function \( \sigma \left( x \right) = \hbox{max} \left( {0,x} \right) \). And the information propagation process is computed as Eq. (6) by stacking multiple GCN layers.

$$ L^{{\left( {j + 1} \right)}} = \sigma \left( {\tilde{A}L^{\left( j \right)} W_{j} } \right) $$
(6)

Where j denotes the layer number and \( L^{\left( 0 \right)} = X \).

3.4 User Profile Prediction

The prediction of user profile is regarded as a multi-classification problem. After the above procedures, we obtain user representation according to user-generated content and relationships. The node embedding for user representation is fed into a softmax classifier to project the final representation into the target space of class probability:

$$ Z = p_{i} (c|\mathcal{R},{\mathcal{U}} ;\Theta ) = {\mathbf{softmax}}\left( {\tilde{A}\upsigma\left( {\tilde{A}XW_{0} } \right)W_{1} } \right) $$
(7)

Finally, the loss function is defined as the cross-entropy error over all labeled users as shown in Eq. (8).

$$ {\mathcal{L}} = - \sum\nolimits_{{u \in y_{u} }} {\sum\nolimits_{f = 1}^{F} {Y_{df} \ln Z_{df} } } $$
(8)

Where \( y_{u} \) is the set of user’s indices with labels, and F is the dimension of the output features which is equal to the number of classes. Y is the label indicator matrix. The weight parameters \( W_{0} \) and \( W_{1} \) can be trained via gradient descent.

4 Experiments

4.1 Dataset

Weibo is the largest social network platform in ChinaFootnote 1. Followed by [10], we evaluate our method in different scale data sets in Weibo.

The datasets are sampled with different users in different time. And we select five classes as interest profiles of users, Health, Women, Entertainment, Tourism, Society.

The details of the datasets are illustrated in Table 1.

Table 1. The details of the datasets.

4.2 Comparisons and Evaluation Setting

To evaluate the performance of our method (GCN-IA), we compare it with some existing methods including textual feature-based method and relation feature-based method. In addition, to evaluate the implicit association labels for GCN, we compare GCN-IA with classical GCN. The details of these baselines are listed as follows:

SVM

[26] uses the method of support vector machine to construct user profile based on user-generated context. In our experiment, we select username and blogs of users to construct user representation based on textual features. The textual features are obtained via pre-trained word2vec model.

MLP-IA

[10] uses multi-label propagation method to predict user profiles. They capture relationship information by constructing probability transfer matrix. The labeled users are collected if the user is marked with a “V” which means his identity had been verified by Weibo. Analyzed by Jing et al. [24], these users were very critical in the propagation.

In the experiments, we will analyze the precision ratio (P) and recall ratio (R) of method which respectively represent the accuracy and comprehensiveness of user profile. And F1-Measure (F1) is a harmonic average of precision ratio and recall ratio, and it reviews the performance of the method.

4.3 Results and Analysis

The experiment results are shown in Table 2. The results show that our method can make a significant increase in macro-F1 in all datasets.

Table 2. Experimental results of user profile task.

Compared with feature-based method, our model makes a significant improvement. SVM fails since the method does not consider user relationships in the social networks. It only models the user-generated context, such as username and user’s blogs.

Compared with relation-based method, our model achieves improvements in all datasets, especially in dataset of 2#, we have improved 13.83% in macro-F1. MLP-IA [10] established user profiles based on user’s relationships via label propagation. It suffers from leveraging the user-generated context, which contains semantic contextual features. Our model can represent users based on both relationships and context information via GCN module, which is more beneficial for identifying multi-label user profile task

Fig. 2.
figure 2

The results of each interest class in 4# dataset.

.

The results of each interest class in 4# dataset are shown in Fig. 2. The results show that GCN-IA performs stably in all interest profiles, which demonstrate the good robustness of our model.

As shown in the results, the performance is little weak for the Entertainment interest class compared with baselines. In Weibo, there are much blogs with aspect to entertainment. Fake information exists in social network including fake reviews and fake accounts for specific purposes, which brings huge challenge for user profiles.

Our model constructs user profile via both textual features and relational features. The results can demonstrate that the user relationships can provide a beneficial signal for semantic feature extraction and the two features can reinforce each other.

5 Conclusion and Future Work

In this paper, we have studied the user profile by graph convolutional networks with implicit association labels, user information and label information embedding. We proposed a method to utilize implicit association among labels and then we take graph convolutional networks to embed the label and user information. On four real-world datasets in Weibo, experimental results demonstrate that GCN-IA produces a significant improvement compared with some state-of-the-art methods.

Future work will pay more attention to consider more prior knowledge to get higher performance.