1 Introduction

Accruing scientific evidences have demonstrated that the neuroimaging techniques, such as magnetic resonance imaging (MRI), are important for the detection of early Alzheimer’s Disease (AD) [2, 4, 7, 13]. Current American Academy of Neurology (AAN) guidelines [3] for dementia diagnosis recommend imaging to identify structural brain diseases that can cause cognitive impairment. Because AD is a neurodegenerative disorder characterized by progressive impairment of cognitive functions, it is important to diagnose the degree of brain impairment, and how much it can influence the performance of cognitive tests. As a result, many studies have focused on using regression models to predict cognitive scores and track AD progression [10, 11]. In [10], the voxel-based morphometry (VBM) features extracted from the entire brain were jointly analyzed by the relevance vector regression method to predict different clinical scores individually. However, different neuroimaging features or different cognitive scores are often interrelated. To tackle this problem, several recent studies, such as [11, 12], tried to employ the multi-task learning models to uncover the inherent structures among neuroimaging features and cognitive scores. The low-rank regularization is an effective method to extract the common subspace for multiple tasks. Although trace norm is a widely used convex relaxation of low-rank regularization [1], its performance is easily influenced by the large singular values. For example, when the largest singular values of matrix M increase, the rank of M doesn’t change but the trace norm of M increases correspondingly.

To address the above problems, in this paper, we propose a novel multi-task learning model to learn the associations between neuroimaging features and cognitive scores and uncover the low-rank common subspace among different tasks by minimizing the k smallest singular values. Our new k minimal singular values minimization regularization is a tighter relaxation than trace norm for rank minimization, such that our new multi-task learning model can have better prediction performance. We derive a new optimization algorithm to solve the proposed objective function and demonstrate the proof of its convergence. The proposed new model is applied to analyze the Alzheimer’s Disease Neuroimaging Initiative (ADNI) cohort [16] data. In all empirical results, our new multi-task learning method consistently outperforms the widely used multivariate regression method, as well as different state-of-the-art multi-task learning approaches.

2 New Multi-task Learning Model

2.1 New Objective Function

In our new model, we focus on minimizing the k-smallest singular values of W and ignoring the largest singular values, such that our new regularization function is a better relaxation than trace norm. Thus, we propose to solve the following problem for multi-task learning:

$$\begin{aligned} J_{_{opt}} = \mathop {\min }\limits _{W = [{W_1},...,{W_T}]} \sum \limits _{t = 1}^T {f(W_t^T{X_t},{Y_t})} + \gamma \sum \limits _{i = 1}^k {\sigma _i(W)} \end{aligned}$$
(1)

Suppose there are T learning tasks, the t-th task has \(n_t\) training data points \(X_t=[x_1^t,x_2^t,...,x_{n_t}^t] \in \mathbb {R}^{d \times n_t}\). For each data \(x_i^t\), the label \(y_i^t\) is given with the label matrix \(Y_t=[y_1^t,y_2^t,...,y_{n_t}^t] \in \mathbb {R}^{c_t \times n_t}\) for each task t. \(W_t \in \mathbb {R}^{d\times c_t}\) is the projection matrix to be learned, \(W \in {R}^{d\times c}\) and \(c=\sum \limits _{t=1}^T c_t\).

It is interesting to see that when \(\gamma \) is large enough, then the k-smallest singular values of the optimal solution W to problem (1) will be zero as all the singular values of a matrix is non-negative. That is, when \(\gamma \) is large enough, it is equal to constrain the rank of W to be \(r=m-k\) in the problem (1).

2.2 Optimization Algorithm

As per the definition of \(||W||_*\) and singular value decomposition of W, it is known that:

$$\begin{aligned} \sum \limits _{i = 1}^k {{\sigma _i}(W)} = {\left\| W \right\| _*} - \mathop {\max }\limits _{F \in {R^{d \times r}},{F^T}F = I,\atop G \in {R^{c \times r}},{G^T}G = I} Tr({F^T}WG)\,, \end{aligned}$$
(2)

where \(\left\| W \right\| _*\) is the sum of all the singular values of W, and the optimal solution of right term is sum of r largest singular values, F is the r left singular vectors of W and G is the r right singular vectors of W.

According to Eq. (2), the objective \(J_{_{opt}}\) in Eq. (1) is equivalent to:

$$\begin{aligned} \!\! \mathop {\min }\limits _{W = [{W_1},...,{W_T}],\atop {F \in {R^{d \times r}},{F^T}F = I,\atop G \in {R^{T \times r}},{G^T}G = I}} \sum \limits _{t = 1}^T {f(W_t^T{X_t},{Y_t})} + \gamma {\left\| W \right\| _*} - \gamma Tr({F^T}WG)\,. \end{aligned}$$
(3)

When W is fixed, the problem (3) becomes:

$$\begin{aligned} \mathop {\max }\limits _{F \in {R^{d \times r}},{F^T}F = I,\atop G \in {R^{c \times r}},{G^T}G = I} Tr({F^T}WG) \end{aligned}$$
(4)

The optimal solution F to the problem (4) is formed by r left singular vectors of W corresponding to the r largest singular values, and the optimal solution G is formed by r right singular vectors of W corresponding to the r largest singular values.

When F and G are fixed, we define:

$$\begin{aligned} g({W_t}) = f(W_t^T{X_t},{Y_t}) - \gamma Tr(W_t^TF{G_t^T}), \end{aligned}$$
(5)

the problem (3) becomes:

$$\begin{aligned} \mathop {\min }\limits _{W = [{W_1},...,{W_T}]} \sum \limits _{t = 1}^T {g({W_t})} + \gamma {\left\| W \right\| _*}. \end{aligned}$$
(6)

Using the reweighted method [6], we can solve problem (6) by iteratively solving the following problem:

$$\begin{aligned} \mathop {\min }\limits _{W = [{W_1},...,{W_T}]} \sum \limits _{t = 1}^T {g({W_t})} + \gamma \sum \limits _{t = 1}^T {Tr(W_t{W_t^T}D)}, \end{aligned}$$
(7)

where D is computed according to the solution \(W^*\) in the last iteration and is defined as:

$$\begin{aligned} D = \frac{1}{2}(W^* {W^*}^T )^{ - \frac{1}{2}}. \end{aligned}$$
(8)

We can see that each subproblem of task t is independent of each other in problem (7). Thus, if we use the least square loss function, for each task \(W_t\), the objective function could be written as:

$$\begin{aligned} \mathop {\min }\limits _{W_t} { \left\| {W_t^T X_t + b_t \mathbf {1}_t^T - Y_t } \right\| _F^2-\gamma Tr(W_t^TF{G_t^T})} + \gamma {Tr(W_t{W_t^T}D)}. \end{aligned}$$
(9)

We take derivatives of Eq. (9) with respect to \(b_t\) and \(W_t\), and set them to zero. The optimal solution to problem (9) is as follows:

$$\begin{aligned} {W_t} = {({X_t}HX_t^T + \gamma D)^{ - 1}}({X_t}HY_t^T + \frac{1}{2}\gamma F{G_t^T}) \; \; \;\; \; \; \;\;H = I - \frac{1}{n_t} \mathbf {1}_t\mathbf {1}_t^T, \end{aligned}$$
(10)
$$\begin{aligned} b_t = \frac{1}{{n_t }}Y_t \mathbf {1}_t - \frac{1}{{n_t }}W_t^T X_t \mathbf {1}_t. \end{aligned}$$
(11)

We summarize the detailed algorithm to solve the objective \(J_{_{opt}}\) in Algorithm 1.

figure a

2.3 Algorithm Analysis

The Algorithm 1 will monotonically decrease the objective of the problem in Eq. (1) in each iteration. To prove it, we need the following lemma:

Lemma 1

For any positive definite matrices \(A,A_t \in R^{m\times m}\), the following inequality holds when \(0 < p \le 2\):

$$\begin{aligned} \ Tr(A^\frac{p}{2})-\frac{p}{2}Tr(AA_t^\frac{p-2}{2}) \le Tr(A_t^\frac{p}{2})-\frac{p}{2}Tr(A_tA_t^\frac{p-2}{2}). \end{aligned}$$
(12)

It is proved in [6] that Lemma 1 holds. Based on the Lemma, we have the following theorem:

Theorem 1

The Algorithm 1 will monotonically decrease the objective of the problem in Eq. (3) in each iteration till convergence.

Proof. In each iteration, at first, we fix W and compute \(\tilde{F}\) and \(\tilde{G}\). According to the solution of Eq. (4), we know:

$$\begin{aligned} - \gamma Tr({\tilde{F}^T}W\tilde{G}) \le - \gamma Tr({{F}^T}W{G}). \end{aligned}$$
(13)

When \(\tilde{F}\) and \(\tilde{G}\) are fixed, the problem becomes Eq. (7), by assuming that \(\tilde{W}\) is the solution in each iteration, we have:

$$\begin{aligned} \mathop \sum \limits _{t = 1}^T {g({\tilde{W}_t})} + \frac{\gamma }{2} Tr({\tilde{W}}\tilde{W}^T(WW^T)^{-\frac{1}{2}}) \le \sum \limits _{t = 1}^T {g({W_t})} + \frac{\gamma }{2} Tr({W}W^T(WW^T)^{-\frac{1}{2}}). \end{aligned}$$
(14)

On the other hand, according to Lemma 1, when \(p=1\), we have:

$$\begin{aligned} \ Tr(({\tilde{W}}\tilde{W}^T)^\frac{1}{2})-\frac{1}{2}Tr({\tilde{W}}\tilde{W}^T(WW^T)^{-\frac{1}{2}}) \le Tr((WW^T)^\frac{1}{2})-\frac{1}{2}Tr((WW^T)(WW^T)^{-\frac{1}{2}}). \end{aligned}$$
(15)

Combining (13), (14), and (15), we arrive at:

$$\begin{aligned} \sum \limits _{t = 1}^T {f(\tilde{W}_t^T{X_t},{Y_t})} + \gamma {|| \tilde{W} ||_*} - \gamma Tr({\tilde{F}^T}W\tilde{G}) \le \sum \limits _{t = 1}^T {f(W_t^T{X_t},{Y_t})} + \gamma {\left\| W \right\| _*} - \gamma Tr({{F}^T}W{G}). \end{aligned}$$
(16)

Thus the Algorithm 1 will not increase the objective function in (3) at each iteration. Note that the equalities in above questions hold only when the algorithm converges. Therefore, the Algorithm 1 monotonically decreases the objective value in each iteration till the convergence.

Because we alternatively solve F, G, and W, the Algorithm 1 will converge to the local optimum of the problem (3), which is equivalent to the proposed objective function.

3 Experimental Results and Discussions

3.1 Data Set Description

Data used in this paper were obtained from the ADNI database (adni.loni.usc.edu). One goal of ADNI has been to test whether serial MRI, PET, other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of MCI and early AD. For up-to-date information, we refer interested readers to visit www.adni-info.org.

The data processing steps are as follows. Each MRI T1-weighted image was first anterior commissure (AC)’s posterior commissure (PC) corrected using MIPAV2, intensity inhomogeneity corrected using the N3 algorithm [9], skull stripped [15] with manual editing, and cerebellum-removed [14]. We then used FAST [17] in the FSL package3 to segment the image into gray matter (GM), white matter (WM), and cerebrospinal fluid (CSF), and further used HAMMER [8] to register the images to a common space. GM volumes obtained from 93 ROIs defined in [5], normalized by the total intracranial volume, were extracted as features. Nine cognitive scores from five independent cognitive assessments were downloaded, including three scores from RAVLT cognitive assessment; two scores from Fluency cognitive assessment (FLU); two scores from Trail making test (TRAIL). A total of 525 subjects are involved in our study, including 78 AD, 260 MCI, and 187 HC participants.

3.2 Improved Cognitive Status Prediction for Individual Assessment Tests

First, we apply the proposed method to the ADNI cohort, and separately predict each of the following three sets of cognitive scores: RAVLT, TRAILS and FLUENCY. The morphometric variables \(\{x_i\}_{i=1}^n \in \mathbb {R}^d\), and \(d=93\) in this experiment.

Table 1. Prediction performance measured by RMSE (mean ± std)

We compare the proposed multi-task learning method to three most related methods: multivariate regression (MRV), multi-task learning model with \(\ell _{2,1}\)-norm regularization (\(\ell _{2,1}\)) [11], and multi-task learning model with trace norm (LS_TRACE) [1], in cognitive performance prediction. For each test case, we use 5-fold cross validation and the prediction performance is assessed by the root mean square error (RMSE). All experimental results are reported in Table 1. The proposed method consistently outperforms other methods in nearly all the test cases for all the cognitive tasks.

The heat maps of parameter weights are shown in Fig. 1. Visualizing the parameter weights can help us locate the features which play important roles in the corresponding cognitive prediction tasks. In this way, there is much potential to identify the relevant imaging predictors and explain the effects of morphometric changes in relation to cognitive performance. As we can see, different coefficient values are represented in different colors in heat map. The blue polar and red polar mean a significant effect of corresponding features on cognitive score performance.

Fig. 1.
figure 1

Heat map of corresponding features for cognitive score prediction.

Table 2. Prediction performance measured by RMSE (mean ± std) for joint assessment tests.

3.3 Improved Cognitive Performance Prediction for Joint Assessment Tests

To further evaluate the multi-task joint analysis power, we apply the proposed method to predict all five types of cognitive scores (RAVLT, TRAILS, FLUENCY) jointly. Such experiments will demonstrate how the interrelations among cognitive assessment tests are utilized to enhance the prediction performance.

Similar to the previous experiment, we also compare our method to three other related models. For each test case, we use 5-fold cross validation to evaluate the average performance of each algorithm. The prediction results are evaluated by RMSE and reported in Table 2. In all prediction cases, our method outperforms other methods.

4 Conclusion

In this paper, we proposed a new multi-task learning model for minimizing k smallest singular values to predict the cognitive scores for complex brain disorders. This proposed new low-rank regularization is a better approximation of rank minimization regularization problem than the standard trace norm regularization, thus our new multi-task learning method can uncover the shared common subspace efficiently and sufficiently. As a result, cognitive score prediction results are enhanced by the learned hidden structures among tasks and features. We also introduced an efficient optimization algorithm to solve our proposed objective function with rigorous theoretical analysis. Our experiments were conducted on the MRI and multiple cognitive scores data of the ADNI cohort and yield promising results: (1) Prediction performance of the proposed multi-task learning model is better than all related methods in all cases; (2) Our method can predict multiple cognitive scores at the same time and has a potential to play an important role in determining cognitive functions and characterizing AD progression.