1 Introduction

With the success of deep learning on various tasks, a new set of methods have emerged, called Graph Neural Networks (GNNs), that achieve remarkable performance on graph-based data (Kipf and Welling 2017; Hamilton et al. 2017). A large number of different GNN approaches have been proposed in the literature, aiming to tackle mostly node classification or link prediction problems (Veličković et al. 2017; Xu et al. 2019; Wu et al. 2019; Zhang and Chen 2018). The predictive ability of these models relies mostly on aggregating information from the direct neighborhood of the node or edge to be labeled. However, when the labeled data is limited, the data chosen to be labeled can affect significantly the efficiency of GNNs (Zhu and Goldberg 2009; Sun et al. 2020).

Towards this direction, recent studies suggest that Active Learning (AL) is an effective technique to improve the models’ robustness (Aggarwal et al. 2014; Ren et al. 2021). Specifically, AL iteratively selects data to be labeled and used to train the model. In the AL literature, there are two main data selection strategies (Settles and Craven 2008): (i) uncertainty-based and (ii) distribution-based. The former approach chooses the most uncertain samples based on the entropy of the learned model. The latter approach selects the samples that better represent the underlying distribution of training instances, e.g. using centrality measures such as Pagerank. Both data selection approaches depend heavily on the initial set of labeled samples.

On the other hand, the recent rise of self-training (ST) showed that enlarging the label set with the most confident pseudo-labels generated by the model learned thus far, can improve the representation learning of GNNs (Li et al. 2018; Wang et al. 2021). The core idea of self-training is to automatically augment the label set with unlabeled samples that can be labeled with high confidence by the model. The augmented label set is then used to train the final model. Nevertheless, a known issue of ST methods is that the predicted pseudo-labels may introduce noise, and bias the learning process (Dai et al. 2021). Therefore, great care in selecting the pseudo-labels is needed.

Despite various advances in both AL and ST, there is limited research on combining the two approaches. Some attempts have been made in natural language processing (Kwak et al. 2022; Yu et al. 2022) and computer vision (Chan et al. 2021; Feng et al. 2021), while recently a framework that combines AL with ST has been introduced (Fazakis et al. 2019), using traditional machine learning methods (not graph-based learning). To the best of our knowledge, there is only one approach in the literature (Zhu et al. 2020), that investigates the combination of AL and self-supervision on graph data. These initial results seem encouraging and have been an important motivation for our work.

In particular, the study presented here addresses the problem of node classification with the use of label-efficient techniques. We suggest that combining AL with ST can reduce the labeling effort and improve the training process of GNNs. In order to achieve this, we first design an AL strategy which selects highly uncertain cases from a large unlabeled set of nodes. Additionally, we propose a ST technique which enables GNNs to expand their label set by incorporating pseudo-labels that are predicted with confidence by the model. In this way, the predicted labels of one iteration are used as training data in subsequent iterations. Different from most active learning approaches that rely solely on highly uncertain samples, our method utilizes also highly certain ones, through self-training. Our experiments show that the iterative application of these two steps can benefit the representation learning of GNNs.

Overall, the main contributions of the paper are the following:

  • We propose a new framework that combines AL with ST to reduce labeling cost and boost the performance of GNNs.

  • We introduce a simple but effective strategy to obtain reliable pseudo-labels in self-training.

  • We confirm through experimentation on several benchmark datasets the effectiveness of the proposed approach.

Notably, our proposed approach can be combined with various GNN backbone architectures, as well as different AL and ST methodologies.

2 Related work

2.1 Graph Neural Networks

One of the most popular GNN approaches is the Graph Convolution Network (GCN) (Kipf and Welling 2017), which performs an iterative propagation through a message passing strategy. In GCN, the node representations are produced by aggregating features (messages) from their neighboring nodes. Other notable GNN architectures include the Graph Attention Network (GAT) (Veličković et al. 2017), that uses a weighted aggregation design and a trainable attention mechanism, GraphSAGE (Hamilton et al. 2017), introducing different aggregation functions, and FiLM (Brockschmidt 2020) which considers both the source and target nodes of each graph edge in the representation learning process. There are many more GNN models in the literature, and a more detailed overview can be found in recent surveys (Zhou et al. 2020, 2022).

2.2 Active learning

Active Learning (AL) (Aggarwal et al. 2014; Settles 2009) is a well-studied research area with applications in various domains, such as text mining (Schröder and Niekler 2020) and computer vision (Beluch et al. 2018). Several new approaches have been introduced in recent years, with the most successful ones utilizing selection criteria based on uncertainty (Yang et al. 2015; Zhu et al. 2008). A widely applied AL strategy is Query-by-Committee (QbC) (Settles 2009), which selects the most informative samples based on the votes of multiple models (i.e. a committee). Although, QbC has been shown to be beneficial, building multiple models can be prohibitive in terms of computational resources. For a more detailed discussion on AL, readers can consult corresponding surveys (Ren et al. 2021; Zhan et al. 2022).

Recently, there has been an increased interest in applying AL on graph-structured data. Early work proposed various selection criteria based on the structure of the graph (Bilgic et al. 2010; Gu et al. 2013). Different work (Appice et al. 2018) introduced a new AL method for regression problems in graph data. More recently, hybrid approaches (Cai et al. 2017; Gao et al. 2018) proposed the linear combination of different heuristics, including information entropy, embedding representativeness and graph centrality, in order to select the most informative nodes to label. Furthermore, a policy network is proposed in Hu et al. (2020a), in order to sequentially select informative nodes using reinforcement learning. Differently from all these approaches, we enrich the AL process with a self-training strategy to further promote the use of unlabeled data during training.

2.3 Self-training GNNs

When the proportion of labeled nodes in a graph is small, the diffusion of supervision information by GNNs is limited. To address this limitation, several methods enhance GNNs through self-training, which extends the supervision by adding nodes that the current model can classify with high confidence, i.e. pseudo-labels (Dai et al. 2021; Wang et al. 2021; Li et al. 2018). These extra nodes may contain valuable local information which does not appear in the initial training set. As an example, the authors of Yang et al. (2021) proposed the Self-Enhanced GNN (SEG), which expands the labeled node set based on the predictions of the GNN, as trained until that point. Other work (Caron et al. 2018; Wang et al. 2019), employs also clustering to improve the quality of the pseudo-labels. Alternatively, the most confident samples can be selected (Li 2022) by constructing a new graph which consists of homogeneous and heterogeneous edges between labeled and unlabeled data.

A known issue of all of these methods is that the predicted pseudo-labels may introduce label noise, which biases the learning process. In Wang et al. (2021), the authors argue that the noise is due to the under-explored low-confidence samples and propose a weighted self-training strategy. Along the same lines, our approach utilizes low-confidence samples by combining self-training with active learning.

2.4 Hybrid methods

The idea of combining AL with other learning methods has been explored by some of the literature. Integrating AL with semi-supervised learning was proposed by Hao et al. (2020) and Xie et al. (2022) to effectively predict molecular properties and classify graphs, respectively. Different work (Yi et al. 2022) proposed an AL approach that utilizes self-supervised models on pretext tasks to achieve state-of-the-art performance on image classification and semantic segmentation. Focusing specifically on self-training, an AL strategy that incorporates pseudo-labels was proposed in Feng et al. (2021), demonstrating significant performance gains in the task of 3D pose estimation. The combination of AL and ST was also proposed by Chaplot et al. (2021), Kwak et al. (2022) and Yu et al. (2022) to improve image and text classification tasks, respectively.

More recently, in Zhu et al. (2020) AL was combined with contrastive learning (You et al. 2020). Initially, the original graph was augmented twice and the agreement between the augmented embeddings was maximized. Then, the produced node embeddings were used in AL by selecting the nodes that have the most similar embeddings to their neighbors. To the best of our knowledge, this is the only existing work that combined AL with a form of self-supervision, i.e. contrastive learning, for the task of node classification. In contrast, our work focuses on how to integrate AL with pseudo-labels.

3 Preliminaries

3.1 Notation

We consider an undirected graph \(G = (V,E,X)\), where \(V=\{v_1,v_2,..,v_{N} \}\) is the set of nodes and \(\mid V\mid = N\), E is the set of edges, where each \(e_{ij} \in E\) denotes an edge between nodes \(v_i\) and \(v_j\), and \(X=\{x_1,x_2,..,x_{N} \}\) indicates the node features with \(X \in \mathbb {R}^{N \times R}\), where R is the dimension of node features. Let us define \(A = [A_{ij}] \in \mathbb {R}^{N \times N}\) as the adjacency matrix of G, where \(A_{ij} = 1\) if \(e_{ij}\) exists and \(A_{ij} = 0\) otherwise. We denote the degree of a node v as \(d_v \in \mathbb {R}^{+}\) and D as the diagonal degree matrix of G, i.e. \(D_{ii}=\sum _i{A_{ij}}\). Also, each node in V is associated with a true label \(y_{i}\), and we use \(\mathcal {Y}=\{y_{1},y_{2},...,y_{N}\}\) to denote the true label vector. Let \(V_{L}=\{v_1,v_2,..,v_{L} \}\) be a set of labeled nodes, then \(V_{U}= V - V_{L}\) is the set of unlabeled nodes.

The learned representation matrix of G, at layer l of a GNN, is represented by \(H^{l}=\{h_{1}^{l},h_{2}^{l},..,h_{N}^{l} \}\) and \(H^{l} \in \mathbb {R}^{N \times R^l}\), where \(h_{v}^{l}\) is the representation vector of node v at layer l, and \(R^l\) is the dimension of the representation vector at layer l.

The layer-wise propagation operation of GCN (Kipf and Welling 2017) is modeled as follows:

$$\begin{aligned} H^{l+1} = \sigma (D^{\frac{-1}{2}}\hat{A}D^{\frac{-1}{2}}H^lW^l) \end{aligned}$$
(1)

where \(\hat{A}=A + I\), in order to add self-loops, \(W^l\) is the trainable weight matrix of layer l and \(\sigma\) is an activation function.

3.2 Problem formulation

Given a small number of labeled samples \(\mathcal {X}_L=\{(v_i,x_i,y_i)\}^{L}_{i=1}\) and a large number of unlabeled samples \(\mathcal {X}_U=\{(v_j,x_j)\}^{U}_{j=1}\), our goal is to train a Graph Neural Network \(g(\mathcal {X},A;\theta ): \mathcal {X} \mapsto \mathcal {Y}\), through an iterative process of k rounds that utilizes active learning and self-training at each round. In particular, we aim to select the T most uncertain, as well as the B most confidently classified samples from \(\mathcal {X}_U\) to train a model \(g(\hat{\mathcal {X}},A;\theta )\) with \(\hat{\mathcal {X}}=\mathcal {X}_L \cup \{(v_t,x_t,y_t)\}^{T}_{t=1} \cup \{(v_b,x_b,y_b)\}^{B}_{b=1}\).

4 Methodology

In Fig. 1, we illustrate the proposed approach for the task of node classification. Given a graph, we first train a GNN model with the initial labeled data. The learned node representations are then used to calculate the uncertainty scores for all unlabeled samples. Then, the labels of the most uncertain samples are requested and added to the initial labeled data. The updated graph is used to train another GNN which generates new class probabilities for all unlabeled samples. These predictions are utilized to identify high-quality pseudo-labels, which will be integrated into the label set. The new graph is fed back to the active learning process, and so on. At the end of the predefined number of iterations, the whole graph is labeled by the trained GNN. Each of two main steps is detailed in the following subsections.

Fig. 1
figure 1

Overview of the proposed approach for the node classification task. The STAL framework operates in a sequential order, considering the outputs of preceding models

4.1 Active learning

The main objective of AL is to select a subset of unlabeled instances, the labels of which can improve the model’s performance. Typically, AL consists of k sampling rounds. At each round, T samples are selected from the unlabeled data, based on strategy \(\phi\). These samples, together with the other labeled instances, are used to re-train the model. In the context of this study, we investigate the following AL strategies:

Uncertainty A widely used strategy that selects the samples that are the most uncertain according to an information-based measure, such as entropy:

$$\begin{aligned} \phi _e(v_i) = \sum _{c} P(y^{c}_{i} \mid x_i;g)logP(y^{c}_{i}\mid x_i;g) \end{aligned}$$
(2)

where \(P(y^{c}_{i}\mid x_i;g)\) is the probability of \(v_i\) belonging to class c as predicted by the GNN model g.

Query-by-committee Instead of relying on the uncertainty sampling of a single model, QbC employs a committee of models \(C = \{g_1, g_2,...,g_C\}\). The samples causing the maximal disagreement among committee members are chosen:

$$\begin{aligned} \phi _{qbc}(v_i) = \sum _{j\ne r} \Vert \phi _e(v_i;g_j) - \phi _e(v_i;g_r) \Vert \end{aligned}$$
(3)

where \(\phi _e(v_i;g_j)\) is the entropy score of \(v_i\) based on the committee model \(g_j\).

AGE A more recent strategy that incorporates three different query sub-strategies: it combines uncertainty with the density of the node and its centrality. Specifically, it computes the entropy of the predicted label distribution, measures the distance between a node and its cluster center, and calculates the PageRank centrality. These criteria are linearly combined as:

$$\begin{aligned} \phi _{age}(v_i) = \alpha * \phi _e(v_i) + \beta * \phi _d(v_i) + \gamma * \phi _{PR}(v_i) \end{aligned}$$
(4)

where \(\phi _d(v_i)=1/(1+\Vert h^{l}_i - CC_i\Vert )\), \(\phi _{PR}(v_i)\) is the PageRank centrality of \(v_i\), \(\alpha +\beta +\gamma = 1\) and \(CC_i\) is the center of the cluster, in which \(v_i\) belongs, as defined in Cai et al. (2017).

Using these strategies, we select the most uncertain samples \(\mathcal {X}_T \subset \mathcal {X}_U\) on each AL round as:

$$\begin{aligned} \mathcal {X}_T=\{(v_i,x_i,\phi _i)\}^{T}_{i=1} ; \phi _1>\phi _2>...>\phi _B\} \end{aligned}$$
(5)

Note that although such uncertainty-based strategies ignore the use of graph-specific properties in selecting nodes for labeling, their performance is often problem-dependent and remain strong baselines, as noted by Cai et al. (2017) and Shui et al. (2020).

4.2 Self-training with confident nodes

In many real-life cases, the number of labeled samples \(\mathcal {X}_L\) is relatively small, when compared to the number of unlabeled ones \(\mathcal {X}_U\), i.e. \(\mathcal {X}_U \gg \mathcal {X}_L\). Self-training addresses scarcity of labeled data by training a model on \(\mathcal {X}_L\) and using it to predict high-confidence pseudo-labels for some unlabeled nodes in \(\mathcal {X}_U\). These pseudo-labels are then used to augment the initial labeled set. The reliance of self-training on the generated pseudo-labels may introduce noise. Therefore, selecting reliable pseudo-labels is essential.

As shown in Fig. 1, our ST strategy begins with the completion of the AL step, at each iteration. Using the available labels, including the T samples labeled by AL, we train a new GNN model. This model produces a class probability vector for each node in \(\mathcal {X}_U\):

$$\begin{aligned} p(y_{i} \Vert x_{i},A;\theta ) = g( {{\mathcal X}_{\mathcal L}},A;\theta ) \end{aligned}$$
(6)

A common approach in selecting the pseudo-labels is to keep only the most confident of them, based on the probabilities p. Specifically, a set \(\mathcal {X}_B \subset \mathcal {X}_U\) of unlabeled nodes with estimated probabilities higher than a threshold \(\tau _u\), i.e. \(max(p_i) > \tau _u\).

We argue for the use of classification confidence, instead of the simple probability, in selecting reliable pseudo-labels. In particular, we measure classification confidence, as the difference between the probabilities of the most-probable classes. To this end, we calculate the Euclidean distance between these probabilities, i.e. \(r_i = \sqrt{\left( p_{i}^{1} - p_{i}^{2}\right) ^2 }\), where \(p^{1}\) and \(p^2\) indicate the highest and second highest values of probabilities p respectively, and use \(r_i\) as the confidence score for node \(v_i\). Intuitively, the larger the distance, the more separable are the classes, and thus, the more confident we are that the pseudo-labels are accurate. To obtain the most reliable pseudo-labels, we select the top-B such that:

$$\begin{aligned} \mathcal {X}_B=\{(v_i,x_i,r_i)\}^{B}_{i=1} ; r_1>r_2>...>r_B\} \end{aligned}$$
(7)

Note that the selected pseudo-labeled nodes are not considered labeled in subsequent AL iterations; i.e. pseudo-labeled nodes may be selected for labelling by AL.

4.3 Label prediction

With the help of the AL and ST strategies, we obtain the new augmented labeled set \(\hat{\mathcal{X}}\). Using these data, we train a GNN model to produce the new class probabilities. Finally, the model decides on the labels \(\mathcal {Y}\) of the nodes as:

$${\mathcal{Y}} = argmax(g({\hat{\mathcal{X}}},A;\hat{\theta }))$$
(8)

An overview of the STAL framework is given in Algorithm 1.

figure a

5 Experiments

In this section, we evaluate the proposed approach on the node classification task using four benchmark datasets. Furthermore, we conduct ablation experiments to gain insights on the various components of the method.

5.1 Evaluation protocol

We conducted experiments on four benchmark datasets. In particular, we used the public citation networks - Cora, Citeseer and Pubmed (Yang et al. 2016). We also used one larger dataset from the OGB archive (Hu et al. 2020b); namely the ogbn-arxiv. The statistics of the datasets are presented in Table 1.

Table 1 The details of the datasets

Regarding the evaluation methodology, we followed common practice in the node classification literature (Cai et al. 2017; Zhu et al. 2020). The nodes of each dataset were partitioned randomly into five folds. For each fold, we further randomly sampled 20 and 30 nodes per class as the training and validation set, respectively. For the assessment of AL, we start with 5 training samples per class and increase the size up to 20. Thus, we perform \(k=(20-5) \cdot |c|\) rounds, and in each round a single unlabeled sample is selected to be added to the training set. The rest of the nodes are used as the test set. For each fold, we kept the model that performed best on the validation set and evaluated it on the held-out test set. Moreover, we set the number of pseudo-labels introduced by ST at each iteration to \(B = \frac{{\left| {{\mathcal{V}}_{{\mathcal{U}}} } \right|}}{2}\), where \(\left| {{\mathcal{V}}_{{\mathcal{U}}} } \right|\) is the number of unlabeled nodes, for all datasets. Eventually, we calculated the average accuracy over the five random data folds.

Regarding the hyper-parameters of the methods included in the experiments, we performed grid search in the space presented in Table 2. The Adam optimizer (Kingma and Ba 2015) was used to minimize the cross-entropy losses with weight decay 5e\(-\)4 and epsilon value 1e\(-\)8 for all models. We used the GNN implementations of PyTorch Geometric (Fey and Lenssen 2019) and trained all models for the same number of epochs,Footnote 1 in full-batch mode. Furthermore, we used the implementations of the OGB libraryFootnote 2 for the node classification experimental set-up. We conducted our experiments using one Nvidia RTX A6000 GPU on an AMD Ryzen Threadripper PRO 3955WX CPU. Our code is available at https://github.com/nneinn/STAL.

Table 2 The search space of hyper-parameters for our experiments

5.2 Ablation study

In order to assess the importance of different features of the proposed method, we ran a set of experiments, using a GCN as the backbone model.

5.2.1 Variants of STAL

In particular, we evaluated the following baselines and variants of STAL:

  • GCN: A vanilla GCN trained with the full set of labels.

  • ST: A GCN model that utilizes the self-training strategy only.

  • AL\(_{\epsilon }\): An AL model that utilizes the uncertainty-based entropy strategy.

  • AL\(_{QbC}\): An AL model that utilizes the QbC entropy-based strategy, with a committee of five models.

  • AL\(_{AGE}\): An AL model that utilizes the AGE strategy only.

  • STAL\(_{\epsilon }\): STAL with the entropy strategy.

  • STAL\(_{QbC}\): STAL with the QbC entropy-based strategy, using a committee of five models.

  • STAL\(_{AGE}\): STAL with the AGE strategy.

  • STAL\(_{rev}\): A model similar to STAL\(_{\epsilon }\) except it applies the self-training strategy before the active learning selection.

The results for all these variants are shown in Table 3. The first observation is that active learning improves the vanilla GCN model. Secondly, the incorporation of self-training in STAL seems to improve the results further, with STAL\(_{AGE}\) and STAL\(_{QbC}\) achieving the best scores. This result confirms the value of pseudo-labels, when combined with AL. It is worth mentioning that ST, without the use of AL, did not yield significant improvements. Moreover, our results show that applying ST at the end of each AL round (STAL\(_{\epsilon }\)) instead of the beginning (STAL\(_{rev}\)), slightly improves the performance, and therefore, we use the former strategy in the rest of the experiments.

Table 3 The results of the STAL variants

Despite the higher accuracy of QbC in many cases, it can be computationally prohibitive, since it requires retraining multiple models at each iteration of the AL process. Besides, compared to the other strategies, the improvement in accuracy over the simpler AL\(_{\epsilon }\) approach is not significant. Therefore, we have excluded QbC from the rest of the experiments.

5.2.2 Quality and size of pseudo-labels

As already discussed in Sect. 2.3, the quality of the pseudo-labels can play a significant role in the downstream tasks. To develop a reliable selection strategy for pseudo-labels we investigated the results of GNN models, focusing on their mistakes. Hence, we have observed that GNNs tend to generate incorrect predictions when the top-two class probabilities \(p^1_i, p^2_i\) are close. As shown in Fig. 2a, in the majority of these cases the correct label corresponds to \(p^2_i\). This observation, motivated the design of our ST selection strategy, as presented in Sect. 2.3.

Fig. 2
figure 2

a The percentage of incorrect predictions where the second most probable class was the correct one. b The ratio of incorrect pseudo-labels by different selection strategies. Lower values indicate more accurate labelling. c The performance of STAL on node classification, when varying the number of pseudo-labels. d The performance of STAL with varied number k of AL rounds

In Fig. 2b, we compare this selection strategy against two baselines: (i) random selection of pseudo-labels, and (ii) selection of the top (most confident) pseudo-labels. Specifically, for each strategy we report the ratio of incorrect pseudo-labels over all pseudo-labels predicted. The proposed pseudo-label selection strategy seems to reduce significantly the errors made by other methods, leading to more robust self-training.

Additionally, we perform a series of experiments to examine how the number of pseudo-labels selected to be added to the labelled set affect the performance of node classification. In particular, we vary the number of pseudo-labels between 20% and 80% of all unlabelled nodes and report the average accuracy over five random runs. The results in Fig. 2c indicate that STAL handles well the incorporation of pseudo-labels, independent of the number of pseudo-labels. Therefore, in the rest of the experiments we opt for a conservative approach, using a small number of pseudo-labels.

5.2.3 Number of active learning rounds k

Finally, we study the role of the number of rounds k used in the AL process. It is worth noting that we have set k to be inversely related to the number of selected samples T at each AL round, i.e. \(T=\frac{(20-5) \cdot |c|}{k}\). Thus, T decreases as k increases and vice versa. When k is maximum (\(k=(20-5) \cdot |c|\)) T is minimum (\(T=1\)). Eventually, the total number of labeled samples remains the same, independent of the number of rounds k. As shown in Fig. 2d, STAL seems to benefit somewhat by large values of k, as was expected, but this comes at a higher manual cost. Therefore, for the rest of the experiments we use \(k=(20-5) \cdot |c|\).

Table 4 Node classification results of all models

5.3 Results for different models

The proposed approach (STAL) can be used with various GNNs and different AL and ST strategies. To demonstrate this flexibility, we conducted experiments with three GNN models: GCN (Kipf and Welling 2017), SAGE (Hamilton et al. 2017) and GAT (Veličković et al. 2017).

The performance of all models, with and without STAL, is shown in Table 4. We observe that STAL improves node classification accuracy by 1–9.5 percentage points on the four datasets. Interestingly, the simple uncertainty-based strategy achieves the best overall score in two cases, although STAL\(_{AGE}\) performs slightly better overall.

Fig. 3
figure 3

The performance of 3 different GNNs, with and without using STAL\(_{\epsilon }\), with varying number of labels

Figure 3 presents also the performance of the models with varying number of training labels. As shown in the figure, STAL requires considerably fewer labels to reach the performance of the baseline models. In most cases, STAL seems to need just 30–60% of the number of labeled samples required by the baselines, leading to benefits in the labeling effort and enhancing the models’ performance.

6 Conclusion and future work

In this paper, we have proposed STAL, a new approach that combines active learning (AL) with self-training (ST) to improve both label efficiency and performance of GNNs. AL is used to select highly uncertain nodes from a large unlabeled set. In combination with AL, we proposed a simple but efficient ST strategy to identify accurate pseudo-labels. Finally, we incorporated these techniques into a common framework that can be easily used with various AL strategies and GNN backbones. The experimental results verified the effectiveness of our approach in node classification, as well as the contribution of each feature of the proposed method to the final result. Besides, the experiments, demonstrated the ability of STAL to reduce the labeling cost.

Still, a number of issues remain open to investigate in the future. In this paper we demonstrate the effectiveness of our method on various node classification datasets, acknowledging that there are other tasks, such as link prediction and graph classification, where our approach has not been tested yet. Therefore, further analysis is needed to validate whether the proposed approach can produce similar performance when applied to other downstream tasks. Moreover, in our experiments, we mainly focus on investigating how the combination of AL with ST can produce more accurate models. This improvement in accuracy comes at a higher computational cost due to multiple training rounds that are required by both AL and ST. Accordingly, STAL assumes a sequential architecture where models are trained separately and in a specific order. In the future, we would like to assess a joint methodology that would integrate all components into a single model and reduce its computational cost.