Keywords

1 Introduction

Usually, the first step of voxel-based neuroimage analysis requires preprocessing the T\(_{1}\)-weighted image, such as segmentation and registration of grey matter (GM), white matter (WM) and cerebral spinal fluid (CSF). However, some systematic biases due to scanner difference and different population etc., can be introduced in this pipeline [2]. Part of them can be helpful to the discrimination of subjects from normal controls (NC), but may not be directly related to the disease. For example in structural Magnetic Resonance Imaging (sMRI) images of subjects with Alzheimer’s Disease (AD), after spatial normalization during simultaneous registration of GM, WM and CSF, the GM voxels surrounding lateral ventricle and subarachnoid space etc. may be mistakenly enlarged caused by the enlargement of CSF space in those locations [2] compared to normal template, as shown in Fig. 1. Although these voxels/features are highly correlated with disease, they can’t be regarded as lesion features in an interpretable model. In this paper we refer to them as “Procedural Bias”, which should be identified but is neglected in the literature. We observe that it can be harnessed in our voxel-based image analysis to improve the prediction of disease.

Fig. 1.
figure 1

The overlapped voxels among top 150 negative value voxels in each fold of \(\beta _{pre}\) at the time corresponding to the best average prediction result in the path of GSplit LBI using 10-fold cross-validation. For subjects with AD, they represent enlarged GM voxels surrounding lateral ventricle, subarachnoid space, edge of gyrus, etc.

Together with procedural bias, the lesion features are vital for prediction and lesion regions analysis tasks, which are commonly solved by two types of regularization models. Specifically, one kind of models such as general losses with \(l_{2}\) penalty, elastic net [13] and graphnet [5] select strongly correlated features to minimize classification error. However, such models don’t differentiate features either introduced by disease or procedural bias and may also introduce redundant features. Hence, the interpretability of such models are poor and the models are prone to over-fit. The other kind of models with sparsity enforcement such as TV-\(L_{1}\) (Combination of Total Variation [9] and \(L_{1}\)) and particularly \(n^{2}\) GFL [12] enforce strong prior of disease on the parameters of the models introduced in order to capture the lesion features. Although such features are disease-relevant and the selection is stable, the models ignore the inevitable procedural bias, hence, they are losing some prediction power.

To incorporate both tasks of prediction and selection of lesion features, we propose an iterative dual-task algorithm namely Generalized Split LBI (GSplit LBI) which can have better model selection consistency than generalized lasso [11]. Specifically, by the introduction of variable splitting term inspired by Split LBI [6], two estimators are introduced and split apart. One estimator is for prediction and the other is for selecting lesion features, both of which can be pursued separately with a gap control. Following an iterative scheme, they will be mutually monitored by each other: the estimator for selecting lesion features is gradually monitored to pursue stable lesion features; on the other hand, the estimator for prediction is also monitored to exploit both the procedural bias and lesion features to improve prediction. To show the validity of the proposed method, we successfully apply our model to voxel-based sMRI analysis for AD, which is challenging and attracts increasing attention.

2 Method

2.1 GSplit LBI Algorithm

Our dataset consists of N samples \(\{x_{i},y_{i}\}_{1}^{N}\) where \(x_{i} \in \mathbb {R}^{p}\) collects the \(i^{th}\) neuroimaging data with p voxels and \(y_{i} = \{\pm 1\}\) indicates the disease status (\(-1\) for Alzheimer’s disease in this paper). \(X \in \mathbb {R}^{N \times p}\) and \(y \in \mathbb {R}^{p}\) are concatenations of \(\{x_{i}\}_{i}\) and \(\{y_{i}\}_{i}\). Consider a general linear model to predict the disease status (with the intercept parameter \(\beta _{0} \in \mathbb {R}\)),

$$\begin{aligned} \log P( y_i=1| x_i) - \log P(y_i=-1|x_i) = x_i^T \beta _{pre} + \beta _{0}. \end{aligned}$$
(2.1)

A desired estimator \(\beta _{pre} \in \mathbb {R}^p\) should not only fit the data by maximizing the log-likelihood in logistic regression, but also satisfy the following types of structural sparsity: (1) the number of voxels involved in the disease prediction is small, so \(\beta _{pre}\) is sparse; (2) the voxel activities should be geometrically clustered or 3D-smooth, suggesting a TV-type sparsity on \(D_G \beta _{pre}\) where \(D_{G}\) is a graph difference operatorFootnote 1; (3) the degenerate GM voxels in AD are captured by nonnegative component in \(\beta _{pre}\). However, the existing procedural bias may violate these a priori sparsity properties, esp. the third one, yet increase the prediction power.

To overcome this issue, we adopt a variable splitting idea in [6] by introducing an auxiliary variable \(\gamma \in \mathbb {R}^{|V|+|E|}\) to achieve these sparsity requirements separately, while controlling the gap from \(D\beta _{pre}\) with penalty \(S_{\rho }(\beta _{pre},\gamma ) := \Vert D\beta _{pre} - \gamma \Vert _{2}^{2} := \Vert \beta _{pre} - \gamma _{V} \Vert _{2}^{2} + \Vert \rho D_{G}\beta _{pre} - \gamma _{G} \Vert _{2}^{2}\) with \(\displaystyle \gamma = \left[ \begin{array}{cc} \gamma _{V}^T&\gamma _{G}^T \end{array} \right] ^T\) and \(\displaystyle D = \left[ \begin{array}{cc} I&\rho D_{G}^T\end{array} \right] ^T\). Here \(\rho \) controls the trade-off between different types of sparsity. Our purpose is thus of two-folds: (1) use \(\beta _{pre}\) for prediction; (2) enforce sparsity on \(\gamma \). Such a dual-task scheme can be illustrated by Fig. 2.

Fig. 2.
figure 2

Illustration of GSplit LBI. The gap between \(\beta _{pre}\) for fitting data and \(\gamma \) for sparsity is controlled by \(S_{\rho }(\beta _{pre},\gamma )\). The estimate \(\beta _{les}\), as a projection of \(\beta _{pre}\) on support set of \(\gamma \), can be used for stable lesion features analysis when \(\nu \rightarrow 0\) (Sect. 3.2). When \(\nu \nrightarrow 0\) (Sect. 3.1) with appropriately large value, \(\beta _{pre}\) can be used for prediction by capturing both lesion features and procedural bias.

To implement it, we generalize the Split Linearized Bregman Iteration (Split LBI) algorithm in [6] to our setting with generalized linear models (GLM) and the three types of structural sparsity above, hence called Generalized Split LBI (or GSplit LBI). Algorithm 1 describes the procedure with a new loss:

$$\begin{aligned} \ell (\beta _{0},\beta _{pre},\gamma ;\{x_{i},y_{i}\}_{1}^{N},\nu ) := \ell (\beta _{0},\beta _{pre};\{x_{i},y_{i}\}_{1}^{N}) + \frac{1}{2\nu } S_\rho (\beta _{pre},\gamma ), \end{aligned}$$
(2.2)

where \(\ell (\beta _{pre};\{x_{i},y_{i}\}_{1}^{N})\) is the negative log-likelihood function for GLM and \(\nu >0\) tunes the strength of gap control. The algorithm returns a sequence of estimates as a regularization path, \(\{\beta _0^k, \beta _{pre}^{k},\gamma ^{k},\beta _{les}^{k}\}_{k\ge 0}\). In particular, \(\gamma ^k\) shows a variety of sparsity levels and \(\beta _{pre}^k\) is generically dense with different prediction powers. The projection of \(\beta _{pre}^k\) onto the subspace with the same support of \(\gamma ^k\) gives estimate \(\beta _{les}^k\), satisfying those a priori sparsity properties (sparse, 3D-smooth, nonnegative) and hence being regarded as the interpretable lesion features for AD. The remainder of this projection is heavily influenced by procedural bias; in this paper the non-zero elements in \(\beta _{pre}^k\) which are negative (−1 denotes disease label) with comparably large magnitude are identified as procedural bias, while others with tiny values can be treated as nuisance or weak features. In summary, \(\beta _{les}\) only selects lesion features; while \(\beta _{pre}\) also captures additional procedural bias. Hence, such two kinds of features can be differentiated, as illustrated in Fig. 2.

figure a

2.2 Setting the Parameters

A stopping time at \(t^{k}\) (line 10) is the regularization parameter, which can be determined via cross-validation to minimize the prediction error [7]. Parameter \(\rho \) is a tradeoff between geometric clustering and voxel sparsity. Parameter \(\kappa \), \(\alpha \) is damping factor and step size, which should satisfy \(\kappa \alpha \le \nu / \kappa (1 + \nu \Lambda _{H} + \Lambda _{D}^{2})\) to ensure the stability of iterations. Here \(\Lambda _{(\cdot )}\) denotes the largest singular value of a matrix and H denotes the Hessian matrix of \(\ell (\beta _{0},\beta _{pre};\{x_{i},y_{i}\}_{1}^{N})\).

Parameter \(\nu \) balances the prediction task and sparsity enforcement in feature selection. In this paper, it is task-dependent, as shown in Fig. 2. For prediction of disease, \(\beta _{pre}\) with appropriately larger value of \(\nu \) may increase the prediction power by harnessing both lesion features and procedural bias. For lesion features analysis, \(\beta _{les}\) with a small value of \(\nu \) is helpful to enhance stability of feature selection. For details please refer to supplementary information.

3 Experimental Results

We apply our model to AD/NC classification (namely ADNC) and MCI (Mild Cognitive Impairment)/NC (namely MCINC) classification, which are two fundamental challenges in diagnosis of AD. The data are obtained from ADNIFootnote 2 database, which is split into 1.5 T and 3.0T (namely 15 and 30) MRI scan magnetic field strength datasets. The 15 dataset contains 64 AD, 208 MCI and 90 NC; while the 30 dataset contains 66 AD and 110 NC. DARTEL VBM pipeline [1] is then implemented to preprocess the data. Finally, the input features consist of 2,527 8 \(\times \) 8 \(\times \) 8 mm\(^{3}\) size voxels with average values in GM population template greater than 0.1. Experiments are designed on 15ADNC, 30ADNC and 15MCINC tasks.

3.1 Prediction and Path Analysis

10-fold cross-validation is adopted for classification evaluation. Under exactly the same experimental setup, comparison is made between GSplit LBI and other classifiers: SVM, MLDA (univariate model via t-test + LDA) [3], Graphnet [5], Lasso [10], Elastic Net, TV+L\(_{1}\) and \(n^{2}\)GFL. For each model, optimal parameters are determined by grid-search. For GSplit LBI, \(\rho \) is chosen from \(\{1,2,...,10\}\), \(\kappa \) is set to 10; \(\alpha = \nu / \kappa (1 + \nu \Lambda _{X}^{2} + \Lambda _{D}^{2})\) Footnote 3; specifically, \(\nu \) is set to 0.2 (corresponding to \(\nu \nrightarrow 0\) in Fig. 2)Footnote 4. The regularization coefficient \(\lambda \) is ranged in \(\{0,0.05, 0.1,...,0.95,1,10,10^{2}\}\) for lassoFootnote 5 and \(2^{\{-20,-19,...,0,...,20\}}\) for SVM. For other models, parameters are optimized from \(\lambda : \{0.05,0.1,...,0.95,1,10,10^{2}\}\) and \(\rho : \{0.5,1,..,10\}\) (in addition, the mixture parameter \(\alpha \): \(\{0,0.05,...,0.95\}\) for Elastic Net).

Table 1. Comparison of GSplit LBI with other models

The best accuracy in the path of GSplit LBI and counterpart are reported. Table 1 shows that \(\beta _{pre}\) of our model outperforms that of others in all cases. Note that although our accuracies may not be superior to models with multi-modality data [8], they are the state-of-the-art results for only sMRI modality.

Fig. 3.
figure 3

Left image: Accuracy of \((\beta _{pre},\beta _{les})\) vs \(log t\) (t: regularization parameter). Right image: Six 2-d brain slice images of selected degenerative voxels of \(\beta _{les}\) and \(\beta _{pre}\) are sorted orderly at \(\{t_{1},...t_{6}\}\). As t grows, \(\beta _{pre}\) and \(\beta _{les}\) identify similar lesion features.

The process of feature selection combined with prediction accuracy can be analyzed together along the path. The result of 30ADNC is used as an illustration in Fig. 3. We can see that \(\beta _{pre}\) (blue curve) outperforms \(\beta _{les}\) (red curve) in the whole path for additional procedural bias captured by \(\beta _{pre}\). Specifically, at \(\beta _{pre}\)’s highest accuracy (\(t_{5}\)), there is a more than \(8\%\) increase in prediction accuracy by \(\beta _{pre}\). Early stopping regularization at \(t_5\) is desired, as \(\beta _{pre}\) converges to \(\beta _{les}\) in prediction accuracy with overfitting when t grows. Recall that positive (negative) features represent degenerate (enlarged) voxels. In each fold of \(\beta _{pre}\) at \(t_5\), the commonly selected voxels among top 150 negative (enlargement) voxels are identified as procedural bias shown in Fig. 1, where most of these GM voxels are enlarged and located near lateral ventricle or subarachnoid space etc., possibly due to enlargement of CSF space in those locations that are different from the lesion features.

3.2 Lesion Features Analysis

To quantitatively evaluate the stability of selected lesion features, multi-set Dice Coefficient (mDC)Footnote 6 [4, 12] is applied as a measurement. The 30ADNC task is again applied as an example, the mDC is computed for \(\beta _{les}\) which achieves highest accuracy by 10-fold cross-validation. As shown from Table 2, when \(\nu = 0.0002\) (corresponding to \(\nu \rightarrow 0\) in Fig. 2), the \(\beta _{les}\) of our model can obtain more stable lesion feature selection results than other models with comparable prediction power. Besides, the average number of selected features (line 3 in Table 2) are also recorded. Note that although elastic net is of slightly higher accuracy than \(\beta _{les}\), it selects much more features than necessary.

Table 2. mDC comparison between GSplit LBI and other models

For the meaningfulness of selected lesion features, they are shown in Fig. 4(a)–(c), located in hippocampus, parahippocampal gyrus and medial temporal lobe etc., which are believed to be early damaged regions for AD patients.

Fig. 4.
figure 4

(a)–(c): Stability of selected lesion features of \(\beta _{les}\) shown in 2-d 110 slice brain images when \(\nu = 0.0002\). (a)–(b): Results of fold 2 and fold 10. (c): The overlapped features in 10 folds. (d): The 2-d slice brain image of selected voxels with \(2\times 2\times 2\) mm\(^{3}\) using coarse-to-fine approach.

To further investigate the locus of lesion features, we conduct a coarse-to-fine experiment. Specifically, we project the selected overlapped voxels of \(8 \times 8 \times 8\) mm\(^{3}\) size (shown in Fig. 4(c)) onto MRI image with more finer scale voxels, i.e. in size of \(2 \times 2 \times 2\) mm\(^{3}\). Totally 4,895 voxels are served as input features after projection. Again, the GSplit LBI is implemented using 10-fold cross-validation. The prediction accuracy of \(\beta _{pre}\) is \(90.34\%\) and on average 446.6 voxels are selected by \(\beta _{les}\). As desired, these voxels belong to parts of lesion regions, such as those located in hippocampal tail, as shown in Fig. 4(d).

4 Conclusions

In this paper, a novel iterative dual task algorithm is proposed to incorporate both disease prediction and lesion feature selection in neuroimage analysis. With variable splitting term, the estimators for prediction and selecting lesion features can be separately pursued and mutually monitored under a gap control. The gap here is dominated by procedural bias, some specific features crucial for prediction yet ignored in a priori disease knowledge. With experimental studies conducted on 15ADNC, 30ADNC and 15MCINC tasks, we have shown that the leverage of procedural bias can lead to significant improvements in both prediction and model interpretability. In future works, we shall extend our model to other neuroimaging applications including multi-modality data.