Abstract
Neuroimaging data often take the form of high-dimensional arrays, also known as tensors. Addressing scientific questions arising from such data demands new regression models that take multidimensional arrays as covariates. Simply turning an image array into a vector would both cause extremely high dimensionality and destroy the inherent spatial structure of the array. In a recent work, Zhou et al. (J Am Stat Assoc, 108(502):540–552, 2013) proposed a family of generalized linear tensor regression models based upon the CP (CANDECOMP/PARAFAC) decomposition of regression coefficient array. Low-rank approximation brings the ultrahigh dimensionality to a manageable level and leads to efficient estimation. In this article, we propose a tensor regression model based on the more flexible Tucker decomposition. Compared to the CP model, Tucker regression model allows different number of factors along each mode. Such flexibility leads to several advantages that are particularly suited to neuroimaging analysis, including further reduction of the number of free parameters, accommodation of images with skewed dimensions, explicit modeling of interactions, and a principled way of image downsizing. We also compare the Tucker model with CP numerically on both simulated data and real magnetic resonance imaging data, and demonstrate its effectiveness in finite sample performance.
Similar content being viewed by others
References
ADHD (2017) The ADHD-200 sample. http://fcon$XXSlahUndXX$1000.projects.nitrc.org/indi/adhd200/. Accessed Mar 2017
ADNI (2017) Alzheimer’s disease neuroimaging initiative. http://adni.loni.ucla.edu. Accessed Mar 2017
Caffo B, Crainiceanu C, Verduzco G, Joel S, Mostofsky SH, Bassett S, Pekar J (2010) Two-stage decompositions for the analysis of functional connectivity for fMRI with application to Alzheimer’s disease risk. Neuroimage 51(3):1140–1149
Chen SS, Donoho DL, Saunders MA (2001) Atomic decomposition by basis pursuit. SIAM Rev. 43(1):129–159
de Leeuw J (1994) Block-relaxation algorithms in statistics. In: Information systems and data analysis. Springer, Berlin, pp 308–325
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360
Frank IE, Friedman JH (1993) A statistical view of some chemometrics regression tools. Technometrics 35(2):109–135
Friston K, Ashburner J, Kiebel S, Nichols T, Penny W (eds) (2007) Statistical parametric mapping: the analysis of functional brain images. Academic Press, London
Goldsmith J, Huang L, Crainiceanu C (2014) Smooth scalar-on-image regression via spatial bayesian variable selection. J Comput Graph Stat 23:46–64
Kolda TG, Bader BW (2009) Tensor decompositions and applications. SIAM Rev 51(3):455–500
Lange K (2004) Optimization. Springer texts in statistics. Springer, New York
Lange K (2010) Numerical analysis for statisticians. Statistics and computing, second edn. Springer, New York
Lehmann EL, Romano JP (2005) Testing statistical hypotheses. Springer texts in statistics, third edn. Springer, New York
Li Y, Zhu H, Shen D, Lin W, Gilmore JH, Ibrahim JG (2011) Multiscale adaptive regression models for neuroimaging data. J R Stat Soc 73:559–578
Li F, Zhang T, Wang Q, Gonzalez M, Maresh E, Coan J (2015) Spatial Bayesian variable selection and grouping in high-dimensional scalar-on-image regressions. Ann Appl Stat (in press)
McCullagh P, Nelder JA (1983) Generalized linear models. Monographs on statistics and applied probability. Chapman & Hall, London
Reiss P, Ogden R (2010) Functional generalized linear models with images as predictors. Biometrics 66:61–69
Rothenberg TJ (1971) Identification in parametric models. Econometrica 39(3):577–91
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc 58(1):267–288
Tucker LR (1966) Some mathematical notes on three-mode factor analysis. Psychometrika 31:279–311
van der Vaart AW (1998) Asymptotic statistics, volume 3 of Cambridge series in statistical and probabilistic mathematics. Cambridge University Press, Cambridge
Wang X, Nan B, Zhu J, Koeppe R (2014) Regularized 3D functional regression for brain image data via haar wavelets. Ann Appl Stat 8:1045–1064
Yue Y, Loh JM, Lindquist MA (2010) Adaptive spatial smoothing of fMRI images. Stat Interface 3:3–14
Zhou H, Li L (2014) Regularized matrix regression. J R Stat Soc 76:463–483
Zhou H, Li L, Zhu H (2013) Tensor regression with applications in neuroimaging data analysis. J Am Stat Assoc 108(502):540–552
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc 67(2):301–320
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
1.1 Proof of Lemma 1
We rewrite the array inner product
where the second and fourth equalities follow from (4) and the third follows from the invariance of trace function under cyclic permutation.
1.2 Proof of Proposition 1
It is easy to see that the block relaxation algorithm monotonically increases the objective values, i.e., \(\ell (\varvec{\theta }^{(t+1)}) \ge \ell (\varvec{\theta }^{(t)})\) for all \(t \ge 0\). Therefore its global convergence property follows from the standard theory for monotone algorithms [5, 11, 12]. Specifically global convergence is guaranteed under the following conditions: (i) \(\ell \) is coercive, (ii) the stationary points of \(\ell \) are isolated, (iii) the algorithmic mapping is continuous, (iv) \(\varvec{\theta }\) is a fixed point of the algorithm if and only if it is a stationarity point of \(\ell \), and (v) \(\ell (\varvec{\theta }^{(t+1)}) \ge \ell (\varvec{\theta }^{(t)})\) with equality if and only if \(\theta ^{(t)}\) is a fixed point of the algorithm. Condition (i) is guaranteed by the compactness of the set \(\{\varvec{\theta }: \ell (\varvec{\theta }) \ge \ell (\varvec{\theta }^{(0)})\). Condition (ii) is assumed. Condition (iii) follows from the strict concavity assumption and implicit function theorem. By Fermat’s principle, \(\varvec{\theta }= ({\varvec{G}}, {\varvec{B}}_1, \ldots , {\varvec{B}}_D)\) is a fixed point of the block relaxation algorithm if \(D\ell ({\varvec{G}}) = \mathbf{0}\) and \(D\ell ({\varvec{B}}_d)=\mathbf{0}\) for all d. Thus \(\varvec{\theta }\) is a fixed point if and only if it is a stationarity point of \(\ell \), i.e., condition (iv) is satisfied. Condition (v) follows from the monotonicity of the block relaxation algorithm. Local convergence follows from the classical Ostrowski theorem, which states that the algorithmic sequence \(\varvec{\theta }^{(t)}\) is local attracted to strictly local minimum \(\varvec{\theta }^{(\infty )}\) if the spectral radius of the differential of the algorithmic map \(\rho [dM(\varvec{\theta }^{(\infty )})]\) is strictly less than one. This follows from the strict concavity assumption of the block updates. See Zhou et al. [25] for more details.
1.3 Proof of Lemma 2
Assume \({\varvec{B}}\) admits the Tucker decomposition (3). By (4),
Using the well-known fact that \(\mathrm {vec}({\varvec{X}}{\varvec{Y}}{\varvec{Z}}) = ({\varvec{Z}}^{\tiny {\text{ T }}}\otimes {\varvec{X}}) \mathrm {vec}({\varvec{Y}})\),
Thus by the chain rule we have
Again by the chain rule, \(D\eta ({\varvec{B}}_d) = D\eta ({\varvec{B}}) \cdot D{\varvec{B}}({\varvec{B}}_d) = (\mathrm {vec}{\varvec{X}}) ^{\tiny {\text{ T }}}{\varvec{J}}_d\). For the derivative in \({\varvec{G}}\), the duality Lemma 1 implies \(\langle {\varvec{B}}, {\varvec{X}}\rangle = \langle {\varvec{G}}, \tilde{\varvec{X}}\rangle \) for \(\tilde{\varvec{X}}= \llbracket {\varvec{X}}; {\varvec{B}}_1 ^{\tiny {\text{ T }}}, \ldots , {\varvec{B}}_D ^{\tiny {\text{ T }}}\rrbracket \). Then, by (4), we have
Combining these results gives the gradient displayed in Lemma 2.
Next we consider the Hessian \(d^2 \eta \). Because \({\varvec{B}}\) is linear in \({\varvec{G}}\), the block \({\varvec{H}}_{{\varvec{G}},{\varvec{G}}}\) vanishes. For the block \({\varvec{H}}_{{\varvec{B}},{\varvec{B}}}\), the \((i_d,r_d,i_{d'},r_{d'})\)-entry is
The second derivative in the summand is nonzero only if \(j_d=i_d\), \(j_{d'}=i_{d'}\), \(s_d=r_d\), \(s_{d'}=r_{d'}\), and \(d \ne d'\). Therefore
The first sum is over \(\prod _{d''\ne d,d'}p_{d''}\) terms and the second term is over \(\prod _{d''\ne d,d'} R_{d''}\) terms. A careful inspection reveals that the sub-block \({\varvec{H}}_{dd'}\) shares the same entries as the matrix
Finally, for the \({\varvec{H}}_{{\varvec{G}},{\varvec{B}}}\) block, the \(\{(r_1,\ldots ,r_D),(i_d,r_d)\}\)-entry is
where the sum is over \(\prod _{d' \ne d} p_{d'}\) terms. The sub-block \({\varvec{H}}_d \in \mathrm {I \! R} ^{\prod _d R_d \times p_dR_d} \) has at most \(p_d \prod _d R_d\) nonzero entries. A close inspection suggests that the nonzero entries coincide with those in the matrix
1.4 Proof of Proposition 2
Since \(\mu = b'(\theta )\), \(d\mu /d\theta = b''(\theta ) = \sigma ^2 /a(\phi )\) and
by Lemma 2. Further differentiating shows
It is easy to see that \(\mathbf {E}[\nabla \ell ({\varvec{G}}, {\varvec{B}}_1,\ldots ,{\varvec{B}}_D)] = \mathbf{0}\). Moreover, \(\mathbf {E}[-d^2\ell ({\varvec{G}}, {\varvec{B}}_1,\ldots ,{\varvec{B}}_D)] = {\varvec{I}}({\varvec{G}}, {\varvec{B}}_1,\ldots ,{\varvec{B}}_D)\). Then (8) follows.
1.5 Proof of Proposition 3
The proof follows from a classical result [18] that states that, if \(\theta _0\) be a regular point of the information matrix \(I(\theta )\), then \(\theta _0\) is locally identifiable if and only if \(I(\theta _0)\) is nonsingular. The regularity assumptions are satisfied by Tucker regression model: (1) the parameter space \(\mathcal {{\varvec{B}}}\) is open, (2) the density \(p(y,{\varvec{x}}|{\varvec{B}})\) is proper for all \({\varvec{B}}\in \mathcal {{\varvec{B}}}\), (3) the support of the density \(p(y,{\varvec{x}}|{\varvec{B}})\) is same for all \({\varvec{B}}\in \mathcal {{\varvec{B}}}\), (4) the log-density \(\ell ({\varvec{B}}|y,{\varvec{x}}) = \ln p(y,{\varvec{x}}|{\varvec{B}})\) is continuously differentiable, and (5) the information matrix
is continuous in \({\varvec{B}}\) by Proposition 2. Therefore \({\varvec{B}}\in \mathcal {{\varvec{B}}}\) is locally identifiable if and only if \({\varvec{I}}({\varvec{B}})\) is nonsingular.
1.6 Proof of Theorem 1
The asymptotics for tensor regression follow from the standard theory of M-estimation. The key observation is that the nonlinear part of tensor model (4) is a degree-\((D+1)\) polynomial of parameters \({\varvec{G}}\) and \({\varvec{B}}_d\) and the collection of polynomials \(\{\langle {\varvec{B}}, {\varvec{X}}\rangle , {\varvec{B}}\in \mathcal {{\varvec{B}}}\}\) form a Vapnik–C̆ervonenkis (VC) class. Then the classical uniform convergence theory applies [21]. The arguments in [25] extends the classical argument for GLM [21, Example 5.40] to the CP tensor regression model. The same proof also applies to the Tucker model with little changes and thus is omitted here. For the asymptotic normality, we need to establish that the log-likelihood function of Tucker regression model is quadratic mean differentiable (q.m.d.) [13]. By a well-known result [13, Theorem 12.2.2] or [21, Lemma 7.6], it suffices to verify that the density is continuously differentiable in parameter for \(\mu \)-almost all x and that the Fisher information matrix exists and is continuous. The derivative of density is
which is well defined and continuous by Proposition 2. The same proposition shows that the information matrix exists and is continuous. Therefore the Tucker regression model is q.m.d. and the asymptotic normality follows from the classical result for q.m.d. families [21, Theorem 5.39].
Rights and permissions
About this article
Cite this article
Li, X., Xu, D., Zhou, H. et al. Tucker Tensor Regression and Neuroimaging Analysis. Stat Biosci 10, 520–545 (2018). https://doi.org/10.1007/s12561-018-9215-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12561-018-9215-6