Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Alzheimer’s disease (AD) is the most common neurological disorder in the older population. There is overwhelming evidence in the literature that the morphological patterns are observable by means of either structural and diffusion MRI or PET [13]. However, morphological abnormal patterns are often subtle, compared to high inter-subject variations. Hence, sophisticated pattern recognition methods are of high demand to accurately identify individuals at different stages of AD progression.

Medical imaging applications often deal with high dimensional data and usually less number of samples with ground-truth labels. Thus, it is very challenging to find a general model that can work well for an entire set of data. Hence, GTL method has been investigated with great success in medical imaging area [4, 5], since it can overcome the above difficulties by taking advantage of the data representation on unlabeled testing subjects. In current state-of-the-art methods, graph is used to represent the subject-wise relationship. Specifically, each subject, regardless of being labeled or unlabeled, is treated as a graph node. Two subjects are connected by a graph link (i.e., an edge) if they have similar morphological patterns. Using these connections, the labels can be propagated throughout the graph until all latent labels are determined. Many current label propagation strategies have been proposed to determine the latent labels of testing subjects based on subject-wise relationships encoded in the graph [6].

The assumption of current methods is that the graph constructed in the observed feature domain represents the real data distribution and can be transferred to guide label propagation. However, this assumption usually does not hold since morphological patterns are often highly complex and heterogeneous. Figure 1(a) shows the affinity matrix of 51 AD and 52 NC subjects using the ROI-based features extracted from each MR image, where red dot and blue dot denote high and low subject-wise similarities, respectively. Since the clinical data (e.g., MMSE and CDR scores [1]) is more related with clinical labels, we use these clinical scores to construct another affinity matrix, as shown in Fig. 1(c). It is apparent that the data representations using structural image features and clinical scores are completely different. Thus, there is no guarantee that the learned graph from the affinity matrix in Fig. 1(a) can effectively guide the classification of AD and NC subjects. More critically, the affinity matrix using observed image features is not even necessarily optimal in the feature domain, due to possible imaging noises and outlier subjects. Many studies take advantage of multi-modal information to improve discrimination power of transductive learning. However, the graphs from different modalities might be different too, as shown in the affinity matrices using structural image features from MR images (Fig. 1(a)) and functional image features from PET images (Fig. 1(b)). Graph diffusion [5] is recently proposed to find the common graph. Unfortunately, as shown in Fig. 1, it is hard to find a combination for the graphs in Fig. 1(a) and (b) that can lead to the graph in Fig. 1(c), which is more related with final classification task.

Fig. 1.
figure 1

Affinity matrices using structural image features (a), functional image features (b), and clinical scores (c).

To solve these issues, we propose a pGTL method to learn the intrinsic data representation, which could be eventually optimal for label propagation. Specifically, the intrinsic data representation is required to be (a) close to subject-wise relationships constructed by image features extracted from different modalities, and (b) verified on the training data and guaranteed to be optimal for label classification. To that end, we simultaneously (1) refine the data representation (subject-wise graph) in the feature domain, (2) find the intrinsic data representation based on the constructed graphs on multi-modal imaging data and also the clinical labels of entire subject set (including known labels on training subjects and also the tentatively-determined labels on testing subjects), and (3) propagate the clinical labels from training subjects to testing subjects, following the latest learned intrinsic data representation. Promising classification results have been achieved in classifying 93AD, 202 MCI, and 101NC subjects, each with MR and PET images.

2 Methods

Suppose we have \( N \) subjects \( \left\{ {I_{1} , \ldots ,I_{P} ,I_{P + 1} , \ldots ,I_{N} } \right\} \), which sequentially consist of \( P \) training subjects and \( Q\left( { = N - P} \right) \) testing subjects. For \( P \) training subjects, the clinical labels \( {\mathbf{F}}_{P} = \left[ {{\mathbf{f}}_{p} } \right]_{p = 1, \ldots ,P} \) are known, where each \( {\mathbf{f}}_{p} \in \left[ {0,1} \right]^{C} \) is a binary coding vector indicating the clinical label from \( C \) classes. Our goal is to jointly determine the latent labels for \( Q \) testing subjects based on a set of their continuous likelihood vectors \( {\mathbf{F}}_{Q} = \left[ {{\mathbf{f}}_{q} } \right]_{q = P + 1, \ldots ,N} \), where each element in vector \( {\mathbf{f}}_{q} \) indicates the likelihood of the \( q \)-th subject belonging to one of \( C \) classes. For convenience, we concatenate \( {\mathbf{F}}_{P} \) and \( {\mathbf{F}}_{Q} \) into a single label matrix \( {\mathbf{F}}^{{\varvec{N} \times \varvec{C}}} = [{\mathbf{F}}_{P} {\mathbf{F}}_{Q} ] \).

2.1 Progressive Graph-Based Transductive Learning

Conventional Graph-Based Transductive Learning.

For clarity, we first extract single modality image features from each subject \( I_{i} \) \( (i = 1, \ldots ,N) \), denoted as \( {\mathbf{x}}_{i} \). In conventional GTL methods, the subject-wise relationships are computed based on feature similarity, which is encoded in an \( N \times N \) feature affinity matrix \( {\mathbf{S}} \). Each element \( s_{ij} \) \( (0 \le s_{ij} \le 1,i,j = 1, \ldots ,N) \) represents the feature affinity degree between \( {\mathbf{x}}_{i} \) and \( {\mathbf{x}}_{j} \). After constructing \( {\mathbf{S}} \) (based on feature similarity), conventional methods determine the latent label for each testing subject \( I_{q} \) by solving a classic graph learning problem:

$$ {\hat{\mathbf{F}}}_{q} = { \arg }\,{ \hbox{min} }_{{{\mathbf{F}}_{q} }} \sum\nolimits_{i,j = 1}^{N} {\left\| {{\mathbf{f}}_{i} - {\mathbf{f}}_{j} } \right\|_{2}^{2} s_{ij} } . $$
(1)

As shown in Fig. 1, the affinity matrix \( {\mathbf{S}} \) might not be strongly related with the intrinsic data representation in the label domain. Therefore, it is necessary to further design a graph based on the labels matrix, rather than solely using the graph constructed by the features. However, the labels on testing subjects are not determined yet. In order to solve this chicken-and-egg dilemma, we propose to construct a dynamic graph which progressively reflects the intrinsic data representation in the label domain.

Progressive Graph-Based Transductive Learning on Single Modality.

We propose three strategies to remedy the above issue. (1) We propose to gradually find an intrinsic data representation \( {\mathbf{T}} = \left[ {t_{ij} } \right]_{i,j = 1, \ldots ,N} \) which is more relevant than \( {\mathbf{S}} \) to guide the label propagation in Eq. (1). (2) Since only the training images have their known clinical labels, exclusively optimizing \( {\mathbf{T}} \) in the label domain is an ill-posed problem. Thus, we encourage the intrinsic data presentation \( {\mathbf{T}} \) also respecting the affinity matrix \( {\mathbf{S}} \), where image features are complete in the feature domain. (3) In order to suppress possible noisy patterns and outlier subjects, we allow the intrinsic data representation \( {\mathbf{T}} \) to progressively refine the affinity matrix \( {\mathbf{S}} \) in the feature domain. In this way, the estimations of \( {\mathbf{S}} \) and \( {\mathbf{T}} \) are coupled, thus bringing a dynamic graph learning model with the following objective function:

$$ \begin{array}{*{20}c} {{ \arg }\,{ \hbox{min} }_{{{\mathbf{S}},{\mathbf{T}},{\mathbf{F}}}} \sum\nolimits_{i,j = 1}^{N} {\left\{ {\mu \left\| {{\mathbf{f}}_{i} - {\mathbf{f}}_{j} } \right\|_{2}^{2} t_{ij} + \left\| {{\mathbf{x}}_{i} - {\mathbf{x}}_{j} } \right\|_{2}^{2} s_{ij} + \lambda_{1} s_{ij}^{2} + \lambda_{2} \left\| {s_{ij} - t_{ij} } \right\|_{2}^{2} } \right\}} } \\ {s.t.\,0 \le s_{ij} \le 1, {\mathbf{s}}_{i}^{'} 1 = 1, 0 \le t_{ij} \le 1,\;{\mathbf{t}}_{i}^{'} 1 = 1, {\mathbf{F}} = \left[ {{\mathbf{F}}_{P} {\mathbf{F}}_{Q} } \right]\quad \quad \quad } \\ \end{array} $$
(2)

where \( \mu \) is the scalar balancing the data fitting terms from two different domains (i.e., the first and second terms in Eq. (2)). Suppose \( {\mathbf{s}}_{i} \in {\text{R}}^{N \times 1} \) and \( {\mathbf{t}}_{i} \in {\text{R}}^{N \times 1} \) are vectors with the \( j \)-th element as \( s_{ij} \) and \( t_{ij} \) separately. In order to avoid trivial solution, \( l_{2} \)-norm is used as the constraint on each element \( s_{ij} \) in affinity matrix \( {\mathbf{S}} \). \( \uplambda_{1} \) and \( \uplambda_{2} \) are two scalars to control the strengths of the last two terms in Eq. (2).

Progressive Graph-Based Transductive Learning on Multiple Modalities.

Suppose we have \( M \) modalities. For each subject \( I_{i} \), we can extract multi-modal image features \( {\mathbf{x}}_{i}^{m} , m = 1, \ldots ,M \). For \( m \)-th modality, we optimize the affinity matrix \( {\mathbf{S}}^{m} \). As shown in Fig. 1(a) and (b), the affinity matrices across modalities could be different. Thus, we require the intrinsic data representation T to be close to all \( {\mathbf{S}}^{m} , m = 1, \ldots ,M \). It is straightforward to extend our above pGTL method to the multi-modal scenario:

$$ \begin{array}{*{20}c} {{ \arg }\,{ \hbox{min} }_{{{\mathbf{S}}^{m} ,{\mathbf{T}}, {\mathbf{F}}}} \mathop \sum \limits_{i,j = 1}^{N} \left\{ {\mu \left\| {{\mathbf{f}}_{i} - {\mathbf{f}}_{j} } \right\|_{2}^{2} t_{ij} + \mathop \sum \limits_{m = 1}^{M} \left[ {\left\| {{\mathbf{x}}_{i}^{m} - {\mathbf{x}}_{j}^{m} } \right\|_{2}^{2} s_{ij}^{m} + \lambda_{1} \left( {s_{ij}^{m} } \right)^{2} + \lambda_{2} \left\| {s_{ij}^{m} - t_{ij} } \right\|_{2}^{2} } \right]} \right\}} \\ {s.t.\,0 \le {\text{s}}_{\text{ij}}^{\text{m}} \le 1,\left( {{\mathbf{s}}_{\text{i}}^{\text{m}} } \right)^{ '} 1 = 1,\;0 \le {\text{t}}_{\text{ij}} \le 1,\;{\mathbf{t}}_{\text{i}}^{ '} 1 = 1,\quad {\mathbf{F}} = [{\mathbf{F}}_{\text{P}} {\mathbf{F}}_{\text{Q}} ]\quad \quad \quad } \\ \end{array} $$
(3)

It is worth noting that, although the multi-modal information leads to multiple affinity matrices in the feature domain, they share the same intrinsic data representation \( {\mathbf{T}} \).

2.2 Optimization

Since our proposed energy function in Eq. (3) is convex to each variables, i.e. \( {\mathbf{S}},{\mathbf{T}},{\mathbf{F}} \), we present the following divide-and-conquer solution to optimize one set of variables at a time by fixing other sets of variable. We initialize \( S = { \exp }( - \left\| {x_{i} - x_{j} } \right\|_{2}^{2} /2\sigma^{2} ) \), \( \sigma \) is an empirical parameter, \( {\mathbf{T}} = \sum\nolimits_{m = 1}^{M} {{\mathbf{S}}^{m} /M} \),\( F_{Q} = \left\{ 0 \right\}^{Q \times C} \).

Estimation of Affinity Matrix \( {\mathbf{S}}^{m} \) for Each Modality.

Removing the unrelated terms w.r.t. \( {\mathbf{S}}^{m} \) in Eq. (3), the optimization of \( {\mathbf{S}}^{m} \) falls to the following objective function:

$$ { \arg }\,{ \hbox{min} }_{{{\mathbf{S}}^{m} }} \sum\nolimits_{i,j = 1}^{N} {\left\| {{\mathbf{x}}_{i}^{m} - {\mathbf{x}}_{j}^{m} } \right\|_{2}^{2} s_{ij}^{m} + \lambda_{1} \left( {s_{ij}^{m} } \right)^{2} } + \lambda_{2} \sum\nolimits_{i,j = 1}^{N} {\left\| {s_{ij}^{m} - t_{ij} } \right\|_{2}^{2} } $$
(4)

where \( (0 \le s_{ij} \le 1, ({\mathbf{s}}_{i}^{m} )^{'} 1 = 1) \). Since Eq. (4) is independent of variables \( i \) and \( j \), we further reformulate Eq. (4) in the vector form as below:

$$ { \arg }\,{ \hbox{min} }_{{{\mathbf{s}}_{i}^{m} }} \left\| {{\mathbf{s}}_{i}^{m} + \frac{{{\mathbf{d}}_{i} }}{{2r_{1} }}} \right\|_{2}^{2} $$
(5)

where \( {\mathbf{s}}_{i}^{m} \) is the i-th column vector of affinity matrix \( {\mathbf{S}}^{m} , \) \( {\mathbf{d}}_{i} = \left[ {d_{ij} } \right]_{j = 1, \ldots ,N} \) is a vector with each \( d_{ij} = \left\| {{\mathbf{x}}_{i}^{m} - {\mathbf{x}}_{j}^{m} } \right\|^{2} - 2\lambda_{2} t_{ij} \), and \( r_{1} = \lambda_{1} + \lambda_{2} \). The problem in Eq. (5) is equivalent to project onto a simplex, which has a closed-form solution in [7]. After we solve each \( {\mathbf{s}}_{i}^{m} \), we can obtain the affinity matrix \( {\mathbf{S}}^{m} \).

Estimate the Intrinsic Data Representation \( {\mathbf{T}} \).

Fixing \( {\mathbf{S}}^{m} \) and \( {\mathbf{F}} \), the objective function w.r.t. \( {\mathbf{T}} \) reduces to:

$$ { \arg }\,{ \hbox{min} }_{{\mathbf{T}}} \sum\nolimits_{i,j = 1}^{N} {\mu \left\| {{\mathbf{f}}_{i} - {\mathbf{f}}_{j} } \right\|_{2}^{2} t_{ij} } + \lambda_{2} \sum\nolimits_{m = 1}^{M} {\sum\nolimits_{i,j = 1}^{N} {\left( {\left\| {s_{ij}^{m} - t_{ij} } \right\|_{2}^{2} } \right)} } $$
(6)

Similarly, we can reformulate Eq. (6) by solving each \( {\mathbf{t}}_{i} \) at a time:

$$ { \arg }\,{ \hbox{min} }_{{{\mathbf{t}}_{i} }} \left\| {{\mathbf{t}}_{i} + \frac{{{\mathbf{h}}_{i} }}{{2r_{2} }}} \right\|_{2}^{2 } $$
(7)

where \( {\mathbf{h}}_{i} = \left[ {h_{ij} } \right]_{j = 1, \ldots ,N} \) is a vector with each element \( h_{ij} = \mu \left\| {{\mathbf{f}}_{i} - {\mathbf{f}}_{j} } \right\|_{2}^{2} - 2\lambda_{2} \sum\nolimits_{m = 1}^{M} {{\mathbf{s}}_{i}^{m} } \), and \( r_{2} = M\lambda_{2} \) is a scalar.

Update the Latent Labels \( {\mathbf{F}}_{Q} \) on Testing Subjects.

Given both \( {\mathbf{S}}^{w} \) and \( {\mathbf{T}} \), the objective function for the latent label \( {\mathbf{F}}_{Q} \) can be derived from Eq. (3) as below:

$$ { \arg }\,{ \hbox{min} }_{{\mathbf{F}}} \sum\nolimits_{i,j = 1}^{N} {\left\| {{\mathbf{f}}_{i} - {\mathbf{f}}_{j} } \right\|_{2}^{2} t_{ij} \Rightarrow { \arg }} \,{ \hbox{min} }_{{\mathbf{F}}} {\text{Trace}}({\mathbf{F^{\prime}LF}}), $$
(8)

where \( {\text{Trace}}(.) \) denotes the matrix trace operator, \( {\mathbf{L}} = {\text{diag}}\left( {\mathbf{T}} \right) - \left( {{\mathbf{T^{\prime}}} + {\mathbf{T}}} \right)/2 \) is the Laplacian matrix of \( {\mathbf{T}} \). By differentiating Eq. (8) w.r.t. \( {\mathbf{F}} \) and letting the gradient \( {\mathbf{LF}} = 0 \), we obtain the following equation: \( \left[ {\begin{array}{*{20}c} {{\mathbf{L}}_{PP} } & {{\mathbf{L}}_{PQ} } \\ {{\mathbf{L}}_{QP} } & {{\mathbf{L}}_{QQ} } \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {{\mathbf{F}}_{P} } \\ {{\mathbf{F}}_{Q} } \\ \end{array} } \right] = 0, \) where \( {\mathbf{L}}_{PP} \), \( {\mathbf{L}}_{PQ} , {\mathbf{L}}_{QP} \), and \( {\mathbf{L}}_{QQ} \) denote the top-left, top-right, bottom-left, and bottom-right blocks of \( {\mathbf{L}} \). The solution for \( {\mathbf{F}}_{Q} \) can be obtained by \( {\hat{\mathbf{F}}}_{Q} = - \left( {{\mathbf{L}}_{QQ} } \right)^{ - 1} {\mathbf{L}}_{QP} {\mathbf{F}}_{P} \).

Discussion.

Taking MRI and PET modalities as example, Fig. 2(a) illustrates the optimization of Eq. (3) by alternating the following three steps. (1) Estimate each affinity matrix \( {\mathbf{S}}^{m} \), which depends on the observed image features \( {\mathbf{x}}^{m} \) and the currently estimated intrinsic data representation \( {\mathbf{T}} \) (red arrows); (2) Estimate the intrinsic data representation \( {\mathbf{T}} \), which requires the estimations of both \( {\mathbf{S}}^{1} \) and \( {\mathbf{S}}^{2} \) and also the subject-wise relationship in the label domain (purple arrows); (3) Update the latent labels \( {\mathbf{F}}_{Q} \) on the testing subjects which needs guidance from the learned intrinsic data representation \( {\mathbf{T}} \) (blue arrows). It is apparent that the intrinsic data representation T links the feature domain and label domain, which eventually leads to the dynamic graph learning model.

Fig. 2.
figure 2

(a) The dynamic procedure of the proposed pGTL method, (b) Classification accuracy as a function of the number of training samples used.

3 Experiments

Subject Information and Image Processing.

In the following experiments, we select 93 AD subjects, 202 MCI subjects, and 101 NC subjects from ADNI dataset. Since MCI is a highly heterogeneous group, we further separate them into 55 progressive MCI subjects (pMCI), who will finally develop into AD patients within the next 24 months, and 63 stable MCI subjects (sMCI), who won’t convert to AD after 24 months. The remain MCI subjects included a group not converted in 24 months but converted in 36 months and another group with observation information in baseline but missing information in 24 months. Each subject has both MR and 18-Fluoro-DeoxyGlucose PET (FDG-PET) images.

For each subject, we first align the PET image to MR image. Then we remove the skull and cerebellum from MR image and segment MR image into white matter, gray matter and cerebrospinal fluid. Next, we parcellate each subject image into 93 ROIs (Regions of Interest) by registering the template (with manual annotation of 93 ROIs) to the subject image domain. Finally, the gray matter volume and the mean PET intensity image in each ROI are used and form a 186-dimensional feature vector.

Experiment Settings.

First, we evaluate our proposed pGTL method, with comparison to classic classification methods, such as Canonical Correlation Analysis (CCA) [8] based SVM (denoted as CCA in the following context), Multi-Kernel SVM (MKSVM) [9], and a conventional GTL method, since these methods are widely used in AD studies. In order to demonstrate the overall performance of our method in several classification tasks, i.e. AD vs NC, MCI vs NC, and pMCI vs sMCI, in each experiment, we use 10-fold cross-validation strategy, with 9 folds of data as training dataset and the remaining 1 fold as testing dataset. Second, we compare our proposed method with three recently published state-of-the-art classification methods: (1) random-forest based classification method [10], (2) multi-modal graph-fusion method [4], and (3) multi-modal deep learning method [11]. It is worth noting that we only use the classification accuracy reported in their papers, in order for fair comparison.

Parameter Settings.

In the following experiments, we use the same greedy strategy to select best parameters for CCA, MKSVM and our proposed method. For example, we obtain the optimal values for \( \mu \), \( \lambda_{1} \) and \( \lambda_{2} \) in our method by exhaustive search in the range from \( 10^{ - 3} \) to \( 10^{3} \) in a small portion of training dataset.

Comparison with Classic CCA, GTL and Multi-kernel SVM (MKSVM).

The classification accuracies by CCA, MKSVM, GTL and our method are evaluated in three classification tasks (AD vs NC, MCI vs NC, and pMCI vs sMCI), respectively. The averaged classification accuracy (ACC), sensitivity (SEN), and specificity (SPE) with 10-fold cross-validation are summarized in Table 1. It is clear that our proposed method beats other competing classification methods in three classification tasks, with significant improvement under paired t-test (p < 0.001, designated by ‘*’ in Table 1).

Table 1. Comparison of classification performance by different methods.

Furthermore, we evaluate the classification performance w.r.t. the number of training samples, as shown in Fig. 2(b). It is clear that (1) our proposed method always has higher classification accuracy than both CCA and MKSVM methods; and (2) all methods can improve the classification accuracy as the number of training samples increases. It is worth nothing that our proposed method achieves large improvement against MKSVM, when only 10 % of data is used as the training dataset. The reason is that supervised methods require a sufficient number of samples to train the reliable classifier. Since the training samples with known labels are expensive to collect in medical imaging area, this experiment indicates that our method has high potential to be deployed in current neuroimaging studies.

Comparison with Recently Published State-of-the-Art Methods.

Table 2 summarizes the subject information, imaging modality, and average classification accuracy by using state-of-the-art methods. These comparison methods represent four typical machine learning techniques. Since the classification between pMCI and sMCI groups are not reported in [4, 10, 11], we only show the classification results for AD vs NC, and MCI vs NC tasks. Our method achieves higher classification accuracy than both random forest and graph fusion methods, even though those two methods use additional CSF and genetic information.

Table 2. Comparison with the classification accuracies reported in the literatures (%).

Discussion.

Deep learning approach in [11] learns feature representation in a layer-by-layer manner. Thus, it is time consuming to re-train the deep neural-network from scratch. Instead, our proposed method only uses hand-crafted features for classification. It is noteworthy that we can complete the classification on a new dataset (including greedy parameter tuning) within three hours on a regular PC (8 CPU cores and 16 GB memory), which is much more economic than massive training cost in [11]. Complementary information in multi-modal data can help improve the classification performance, therefore, in order to find the intrinsic data representation, we combine our proposed pGTL with multi-modal information.

4 Conclusion

In this paper, we present a novel pGTL method to identify individual subject at different stages of AD progression, using multi-modal imaging data. Compared to conventional methods, our method seeks for the intrinsic data representation, which can be learned from the observed imaging features and simultaneously validated on the existing labels of training data. Since the learned intrinsic data presentation is more relevant to label propagation, our method achieves promising classification performance in AD vs NC, MCI vs NC, and pMCI vs sMCI tasks, after comprehensive comparison with classic and recent state-of-the-art methods.