1 Introduction

Alzheimer’s Disease (AD), the most common form of dementia, is a neurodegenerative disorder which severely impacts patients’ thinking, memory and behavior. Current consensus has emphasized the demand of early recognition of this disease, with which the goal of stoping or slowing down the disease progression can be achieved [8]. The effectiveness of neuroimaging in predicting the progression of AD or cognitive performance has been studied and reported in plentiful research [4, 12]. However, many previous research merely paid attention to the prediction using the baseline data, which neglected correlation among longitudinal cognitive performance. AD is a progressive neurodegenerative disorder, thus it is significant to discover neuroimaging measures that impact the progression of this disease along the time axis.

In the association study of predicting cognitive scores from imaging features, the input data usually consists of two matrices: the imaging feature matrix \(\mathcal {X}\) and the cognitive score matrix \(\mathcal {Y}\). If we denote the number or samples as n; the number of features as d while the number of different measures of a certain cognitive performance test as m, then \(\mathcal {X}\) and \(\mathcal {Y}\) can be formed in the following format: \(\mathcal {X} = [X_1,\cdots ,~X_T] \in \mathbb {R}^{d \times nT}\) corresponds to the imaging features at T consecutive time points where \(X_t \in \mathbb {R}^{d \times n}\) is the imaging marker matrix at the t-th time point; \(\mathcal {Y} = [Y_1,\cdots ,~Y_T] \in \mathbb {R}^{n \times mT}\) corresponds to the cognitive scores at T consecutive time points with \(Y_t \in \mathbb {R}^{n \times m}\) denoting the measurement at the t-th time point.

Let’s consider the prediction of one cognitive measure at one time point to be one task, then the association study between cognitive scores and imaging features can be regarded as a multi-task problem. Apparently, in our setting of the longitudinal association study, the number of tasks is mT. The goal of the association study is to find a weight matrix \(\mathcal {W} = [W_1,\cdots , W_T] \in \mathbb {R}^{d \times mT}\), which captures the relevant features for predicting the cognitive scores.

A forthright method is to perform linear regression at each time point and determine \(W_t\) separately. However, the linear regression treats all tasks independently and ignores the useful information reserved in the change along the time continuum. Since AD is a progressive neurodegenerative disorder and cognitive performance is an intuitive indication of the disease status, we can reasonably regard the various tasks to be possibly related. In one cognitive experiment, the result of a certain measure at different time points may be correlated and also different cognitive measures at a certain time point may have mutual influence. To excavate the correlations among the cognitive scores, several multi-task models are put forward.

One possible method is the longitudinal \(\ell _{2,1}\)-norm regression model [6, 11]. In this model, the introduced \(\ell _{2,1}\)-norm regularization enforces structured sparsity, which helps to detect features related to all the cognitive measures along the whole time axis. Moreover, with the assumption that imaging features may be correlated with each other thus gain an overlap in their effects on brain structure or disease progression, we can use the trace norm (also known as nuclear norm) regularization to impose a low-rank restriction. Also, there are models combining these two regularization terms to enforce the structured sparsity as well as low-rank constraint [13, 14].

Indeed, these models impose trace norm regularization to the whole parameter matrix, such that the common subspace globally shared by different prediction tasks can be extracted. However, the longitudinal prediction tasks can be interrelated as different groups. The straightforward way to discover such interrelated groups is to conduct the clustering analysis first and extract the group structures. However, such a heuristic step is independent to the entire longitudinal learning model, thus the detected group structures are not optimal for the longitudinal learning process.

To address this challenging problem, we propose a novel longitudinal structured low-rank learning model to uncover the interrelations among different cognitive measures and utilize the learned interrelated structures to enhance cognitive function prediction tasks.

2 Longitudinal Structured Low-Rank Regression Model

In our multi-task problem, suppose these mT tasks come from c groups, where tasks in each group are correlated. We can introduce and optimize a group index matrix set \(Q = \{Q_1,~Q_2,\ldots Q_c\}\) to discover this group structure. Each \(Q_i\) is a diagonal matrix with \({Q_i} \in \{0,1\}^{mT \times mT}\) showing the assignment of tasks to the i-th group. For the (kk)-th element of \(Q_i\), \((Q_i)_{kk} = 1\) means that the k-th task belongs to the i-th group while \((Q_i)_{kk} = 0\) means not. To avoid overlap of groups, we constrain \(\sum _{i = 1}^c {{Q_i}} = I\).

Since each group of tasks share correlative dependence, we can reasonably assume the latent subspace of each group maintains a low-rank structure. We impose Schatten-p norm as a low-rank constraint to uncover the common subspace shared by different tasks. According to the discussion below, Schatten p-norm makes a better approximation of the low-rank constraint than the popular trace norm regularization [7].

For a matrix \(A \in \mathbb {R}^{d \times n}\), suppose \(\sigma _i\) is its i-th singular value, then the rank of A can be written as \(rank(A) = \sum _{i=1}^{min\{d, n\}} \sigma _i^0\), where \(0^0 = 0\). And the definition of p-th power Schatten p-norm (\(0< p <\infty \)) of A is: \(\left\| A \right\| _{S_p}^p = Tr((A^TA)^{\frac{p}{2}})= \sum _{i=1}^{min\{d, n\}} \sigma _i^p\). Specially, when p = 1, we find the Schatten p-norm of A is exactly its trace norm: \(\left\| A \right\| _{S_1} = (Tr((A^TA)^{\frac{1}{2}})) = \sum _{i=1}^{min\{d, n\}} \sigma _i = \left\| A \right\| _{*}\).

So when \(0< p < 1\), Schatten p-norm is a better low-rank regularization than trace norm. Accordingly, our longitudinal structured low-rank regression model is:

$$\begin{aligned} \begin{aligned} \mathop {\min }\limits _{\left. {\mathcal {W},{Q_i}} \right| _{i = 1}^c \in \{0,1\}^{mT \times mT}, \sum \limits _{i = 1}^c {{Q_i}} = I}&\sum \limits _{t = 1}^T {\left\| {W_t^T{X_t} - {Y_t}} \right\| _F^2} + \gamma \sum \limits _{i = 1}^c ({{{\left\| {\mathcal {W}{Q_i}} \right\| _{{S_p}}^p}}})^l. \end{aligned} \end{aligned}$$
(1)

In Problem (1), the grouping structure tends to be unstable when p is small, so we add a power parameter l to the regularization term and make our model robust. It is difficult to solve this new non-convex and non-smooth objective function. In next section, we will propose a novel alternating optimization method for Problem (1).

3 Optimization Algorithm for Solving Problem (1)

According to the property of \(Q_i\) that \({Q_i}^2 = Q_i\), Problem (1) can be rewritten as:

$$\begin{aligned} \begin{aligned} \mathop {\min }\limits _{{\mathcal {W},\left. {{Q_i}} \right| _{i = 1}^c \in \{0,1\}^{mT \times mT},\sum \limits _{i = 1}^c {{Q_i}} = I}} \sum \limits _{t = 1}^T {\left\| {W_t^T{X_t} - {Y_t}} \right\| _F^2} + \gamma \sum \limits _{i = 1}^c Tr({\mathcal {W}^T D_i \mathcal {W} Q_i}), \end{aligned} \end{aligned}$$
(2)

where \(D_i\) is defined as:

$$\begin{aligned} D_i = \frac{lp}{2}({{{\left\| {\mathcal {W}{Q_i}} \right\| _{{S_p}}^p}}})^{l-1}({\mathcal {W}{Q_i} \mathcal {W}^T})^{\frac{p-2}{2}}. \end{aligned}$$
(3)

We can solve Problem (2) via alternating optimization method.

The first step is fixing \(\mathcal {W}\) and solving Q, and then Problem (2) becomes:

$$\begin{aligned} \begin{aligned} \mathop {\min }\limits _{\left. {{Q_i}} \right| _{i = 1}^c \in \{0,1\}^{mT \times mT},\sum \limits _{i = 1}^c {{Q_i}} = I} \sum \limits _{i = 1}^c Tr((\mathcal {W}^T D_i \mathcal {W})Q_i). \end{aligned} \end{aligned}$$
(4)

Letting \(A_i = \mathcal {W}^T D_i \mathcal {W}\), then the solution of \(Q_i\) is:

$$\begin{aligned} {(Q_i)_{kk}} = \left\{ {\begin{array}{lc} 1&{}\quad {i = \arg \mathop {\min }\limits _j {(A_j)_{kk}}}\\ 0&{}\quad {otherwise} \end{array}} \right. \end{aligned}$$
(5)

The second step is fixing Q and solving \(\mathcal {W}\), and then Problem (2) becomes:

$$\begin{aligned} \begin{aligned} \mathop {\min }\limits _{\mathcal {W}} \sum \limits _{t = 1}^T {\left\| {W_t^T{X_t} - {Y_t}} \right\| _F^2} + \gamma \sum \limits _{i = 1}^c Tr({\mathcal {W}^T D_i \mathcal {W} Q_i}). \end{aligned} \end{aligned}$$
(6)

Denote \(Q_i\) in the format that \(Q_i = diag(Q_{i1}, Q_{i2}, \ldots , Q_{iT})\). Since \(Tr({\mathcal {W}^T D_i \mathcal {W} Q_i}) = \sum \limits _{t = 1}^T Tr({W_t^T D_i W_t Q_{it}})\), we can decouple Problem (6) for each t:

$$\begin{aligned} \begin{aligned} \mathop {\min }\limits _{W_t} {\left\| {W_t^T{X_t} - {Y_t}} \right\| _F^2} + \gamma \sum \limits _{i = 1}^c Tr({W_t^T D_i W_t Q_{it}}). \end{aligned} \end{aligned}$$
(7)

Problem (7) can be further decoupled for each column of \(W_t\) as follows:

$$\begin{aligned} \begin{aligned} \mathop {\min }\limits _{(\mathbf {w_t})_k} {\left\| {(\mathbf {w_t}^T)_k{X_t} - {(\mathbf {y_t})^k}} \right\| _2^2} + \gamma ~Tr({(\mathbf {w_t}^T)_k (\sum \limits _{i = 1}^c (Q_{it})_{kk} D_i) (\mathbf {w_t})_k }). \end{aligned} \end{aligned}$$
(8)

Taking derivative w.r.t. \((\mathbf {w_t})_k\) in Problem (8) and setting it to zero, then we get:

$$\begin{aligned} (\mathbf {w_t})_k = ({X_t}{X_t}^T + \gamma ~(\sum \limits _{i = 1}^c (Q_{it})_{kk} D_i))^{-1}{X_t}((\mathbf {y_t})^k)^T. \end{aligned}$$
(9)

We can iteratively update Q, \(\mathcal {W}\) and D with the alternating steps mentioned above and the algorithm of Problem (2) is summarized in Algorithm 1.

figure a

Convergence Analysis: Our algorithm uses alternating optimization method, whose convergence has already been proved in [1]. In Algorithm 1, variables in each iteration has a closed form solution and can be computed fairly fast. In the following experiments on the ADNI data, the running time of each iteration is about 0.005 s and our method usually converges within one second.

4 Experimental Results

In this section, we evaluate the prediction performance of our proposed method by applying it to the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database.

4.1 Data Description

Data used in the preparation of this article were obtained from the ADNI database (adni.loni.usc.edu). Each MRI T1-weighted image was first anterior commissure (AC) posterior commissure (PC) corrected using MIPAV2, intensity inhomogeneity corrected using the N3 algorithm [10], skull stripped [16] with manual editing, and cerebellum-removed [15]. We then used FAST [17] in the FSL package3 to segment the image into gray matter (GM), white matter (WM), and cerebrospinal fluid (CSF), and used HAMMER [9] to register the images to a common space. GM volumes obtained from 93 ROIs defined in [5], normalized by the total intracranial volume, were extracted as features. Longitudinal scores were downloaded from three independent cognitive assessments including Fluency Test, Rey’s Auditory Verbal Learning Test (RAVLT) and Trail making test (TRAILS). The details of these cognitive assessments can be found in the ADNI procedure manuals. The time points examined in this study for both imaging markers and cognitive assessments included baseline (BL), Month 6 (M6), Month 12 (M12) and Month 24 (M24). All the participants with no missing BL/M6/M12/M24 MRI measurements and cognitive measures were included in this study. A total of 385 sample subjects are involved in our study, among which we have 56 AD samples, and 181 MCI samples and 148 health control (HC) samples. Seven cognitive scores were included: (1) RAVLT TOTAL, RAVLT TOT6 and RAVLT RECOG scores from RAVLT cognitive assessment; (2) FLU ANIM and FLU VEG scores from Fluency cognitive assessment; (3) Trails A and Trails B scores from Trail making test.

4.2 Performance Comparison on the ADNI Cohort

We first evaluate the ability of our method to predict a certain set of cognitive scores via neuroimaging marker. We tracked the process along the time axis and intended to find the set of markers which could influence the cognitive score over the time points. As the evaluation metric, we reported the Root Mean Square Error (RMSE) as well as the Correlation Coefficient (CorCoe) between the predicted score and the ground truth.

We compared our method with all the counterparts discussed in the introduction, which are: Multivariate Linear Regression (MLR), Multivariate Ridge Regression (MRR), Longitudinal Trace-norm Regression (LTR), Longitudinal \(\ell _{2,1}\) norm Regression (L21R) and their combination (L21R + LTR). To illustrate the advantage of simultaneously conducting task correlation and longitudinal feature learning, we also compared with the method of using K-means to cluster the tasks first and then implementing LTR in each group (K-means + LTR) as the baseline.

Table 1. Cognitive assessment FLUENCY, RAVLT and TRAILS prediction comparison via RMSE and CorCoe. Better performance corresponds to lower RMSE or higher CorCoe value.

We utilized the 10-fold cross validation technique and ran 50 times for each method. The average RMSE and CorCoe on these 500 trials are reported. For MLR and MRR, since they were not designed for the longitudinal tasks, we computed the weight matrix for each time point separately and then merged them to the final weight matrix according to the definition \(\mathcal {W} = [W_1,\cdots ,W_T]\). Here in this experiment, the number of time points T is 4. Our initial analyses indicated that our model performs fairly stable when choosing parameter l from \(\{2, 2.5, \dots , 5\}\) and choosing parameter p from \(\{0.1, 0.2, \dots , 0.8\}\) (data not shown). In our experiments, we fixed \(p = 0.1\) and \(l = 3\).

The experimental results are summarized in Table 1. From all the results, we can notice that our method outperforms all other methods consistently on all data sets. The reasons go as follows: MLR and MRR assumed the cognitive measures at different time points to be independent, thus didn’t consider the correlations along the time. Their neglects of the longitudinal correlation within the data was detrimental to their prediction ability. As for L21R, LTR and their combination LTR + L21R, even though they take into account the longitudinal information, they cannot handle the possible group structure within the cognitive scores. That is why they overweigh the standard methods like MLR and MRR in most cases, but are inferior to our proposed method. For K-means + LTR, the clustering step is detached from the longitudinal association study, thus the learned interrelation structure is not optimal for the following longitudinal learning process. As for our proposed method, we not only captured longitudinal correlations among imaging features, but also detected group structure within cognitive scores. As was discussed in the theoretical sections, our model is able to find features which impact on the cognitive result at different stages and meanwhile cluster the cognitive results into groups. Thus, our model can capture features responsible for some, but not necessarily all, cognitive measures along the time continuum, which saves more effective information in the prediction.

Fig. 1.
figure 1

Heat maps of our learned weight matrices on the RAVLT cognitive assessment via MRI data. The weight matrices at four time points, BL, M6, M12 and M24, are plotted. We draw two matrices for each time point, where the left figure is for the left hemisphere and the right figure for the right hemisphere. For each weight matrix, columns denote neuroimaging features while rows represent three different RAVLT scores, which are RAVLT TOTAL, RAVLT TOT6 and RAVLT RECOG, respectively. Imaging features (columns) with larger weights possess higher correlation with the corresponding cognitive measure.

4.3 Identification of Longitudinal Imaging Markers

We further take a special case, the RAVLT assessment, as an example to analyze results of our model. RAVLT is composed of three cognitive measures, which are: (1) the total number of words kept in mind by the testee in the first five trials, RAVLT TOTAL; (2) the number of words recalled during the 6th trial, RAVLT TOT6; and (3) the number of words recognized after a gap of 30 min, RAVLT RECOG. According to the common sense, these three measures should be interrelated with each other, thus clustered into the same group in our model. The result of our model shows a consistent obedience of this rule, i.e., no matter what the c value (number of groups) is, our model invariably put all these three measures to the same group, which is in line with reality. Specially, when c is larger than the real number of groups, the extra groups become empty.

Figure 1 shows the heat maps of the weight matrices learned by our method. The figures demonstrate the capture of a small set of features that are consistently associated to a certain group of cognitive measures (here the group includes all measures). Among the selected features, we found the top two are the hippocampal formation and thalamus, whose impacts on AD have already been proved in the previous papers [2, 3]. In summary, our model is competent to select a small set of features that consistently correlate with a certain group of cognitive measures along the time axis. And the effectiveness of the selected features can be confirmed by previous reports in the literature.

5 Conclusion

In this paper, we proposed a novel longitudinal structured low-rank regression model to study the longitudinal cognitive score prediction. Our model can simultaneously uncover the interrelation structures existing in different prediction tasks and utilize such learned interrelated structures to enhance the longitudinal learning model. Moreover, we utilized Schatten p-norm to extract the common subspace shared by the prediction tasks. Our new model is applied to ADNI cohort for cognitive impairment prediction using MRI data. Empirical results validate the effectiveness of our model, showing a potential to provide reference for current clinical research.