, Volume 17, Issue 4, pp 515–545 | Cite as

Handling Multiplicity in Neuroimaging Through Bayesian Lenses with Multilevel Modeling

  • Gang ChenEmail author
  • Yaqiong Xiao
  • Paul A. Taylor
  • Justin K. Rajendra
  • Tracy Riggins
  • Fengji Geng
  • Elizabeth Redcay
  • Robert W. Cox
Original Article


Here we address the current issues of inefficiency and over-penalization in the massively univariate approach followed by the correction for multiple testing, and propose a more efficient model that pools and shares information among brain regions. Using Bayesian multilevel (BML) modeling, we control two types of error that are more relevant than the conventional false positive rate (FPR): incorrect sign (type S) and incorrect magnitude (type M). BML also aims to achieve two goals: 1) improving modeling efficiency by having one integrative model and thereby dissolving the multiple testing issue, and 2) turning the focus of conventional null hypothesis significant testing (NHST) on FPR into quality control by calibrating type S errors while maintaining a reasonable level of inference efficiency. The performance and validity of this approach are demonstrated through an application at the region of interest (ROI) level, with all the regions on an equal footing: unlike the current approaches under NHST, small regions are not disadvantaged simply because of their physical size. In addition, compared to the massively univariate approach, BML may simultaneously achieve increased spatial specificity and inference efficiency, and promote results reporting in totality and transparency. The benefits of BML are illustrated in performance and quality checking using an experimental dataset. The methodology also avoids the current practice of sharp and arbitrary thresholding in the p-value funnel to which the multidimensional data are reduced. The BML approach with its auxiliary tools is available as part of the AFNI suite for general use.


Null Hypothesis Significance Testing (NHST) False Positive Rate (FPR) Type S and type M errors Regions of Interest (ROIs) General Linear Model (GLM) Linear Mixed-Effects (LME) modeling Bayesian Multilevel (BML) modeling Markov Chain Monte Carlo (MCMC) Stan Priors Leave-one-out (LOO) cross-validation 



The research and writing of the paper were supported (GC, PAT, and RWC) by the NIMH and NINDS Intramural Research Programs (ZICMH002888) of the NIH/HHS, USA, and by the NIH grant R01HD079518A to TR and ER. Much of the modeling work here was inspired from Andrew Gelman’s blog. We are indebted to Paul-Christian Bürkner and the Stan development team members Ben Goodrich, Daniel Simpson, Jonah Sol Gabry, Bob Carpenter, and Michael Betancourt for their help and technical support. The simulations were performed in the R language for statistical computing and the figures were generated with the R package ggplot2 (Wickham 2009).


  1. Amrhein, V., & Greenland, S. (2017). Remove, rather than redefine, statistical significance. Nature Human Behavior, 1, 0224.Google Scholar
  2. Bates, B., Maechler, M., Bolker, B., Walker, S. (2015). Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 67(1), 1–48.Google Scholar
  3. Benjamin, D.J., Berger, J., Johannesson, M., Nosek, B.A., Wagenmakers, E.-J., Berk, R., Johnson, É.V. (2017). Redefine statistical significance. Nature Human Behavior, 1, 0189.Google Scholar
  4. Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B, 57, 289–300.Google Scholar
  5. Carp, J. (2012). On the plurality of (Methodological) worlds: estimating the analytic flexibility of fMRI experiments. Frontiers in Neuroscience, 6, 149.PubMedPubMedCentralGoogle Scholar
  6. Chen, G., Saad, Z.S., Nath, A.R., Beauchamp, M.S., Cox, R.W. (2012). FMRI Group analysis combining effect estimates and their variances. NeuroImage, 60, 747–765.PubMedGoogle Scholar
  7. Chen, G., Saad, Z.S., Britton, J.C., Pine, D.S., Cox, R.W. (2013). Linear mixed-effects modeling approach to FMRI group analysis. NeuroImage, 73, 176–190.PubMedPubMedCentralGoogle Scholar
  8. Chen, G., Adleman, N.E., Saad, Z.S., Leibenluft, E., Cox, R.W. (2014). Applications of multivariate modeling to neuroimaging group analysis: a comprehensive alternative to univariate general linear model. NeuroImage, 99, 571–588.PubMedPubMedCentralGoogle Scholar
  9. Chen, G., Taylor, P.A., Shin, Y.W., Reynolds, R.C., Cox, R.W. (2017a). Untangling the relatedness among correlations, part II: inter-subject correlation group analysis through linear mixed-effects modeling. NeuroImage, 147, 825–840.PubMedGoogle Scholar
  10. Chen, G., Taylor, P.A., Cox, R.W. (2017b). Is the statistic value all we should care about in neuroimaging? NeuroImage, 147, 952– 959.PubMedGoogle Scholar
  11. Chen, G., Taylor, P.A., Haller, S.P., Kircanski, K., Stoddard, J., Pine, D.S., Leibenluft, E., Brotman, M.A., Cox, R.W. (2018a). Intraclass correlation: improved modeling approaches and applications for neuroimaging. Human Brain Mapping, 39(3), 1187–1206. Scholar
  12. Chen, G., Cox, R.W., Glen, D.R., Rajendra, J.K., Reynolds, R.C., Taylor, P.A. (2018b). A tail of two sides: Artificially doubled false positive rates in neuroimaging due to the sidedness choice with t-tests. Human Brain Mapping. In press.Google Scholar
  13. Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49(12), 997–1003.Google Scholar
  14. Cox, R.W. (1996). AFNI: software for analysis and visualization of functional magnetic resonance neuroimages. Computers and Biomedical Research, 29, 162–173. Scholar
  15. Cox, R.W., Chen, G., Glen, D.R., Reynolds, R.C., Taylor, P.A. (2017). FMRI clustering in AFNI: false-positive rates redux. Brain Connection, 7(3), 152–171.Google Scholar
  16. Cox, R.W. (2018). Equitable Thresholding and Clustering. In preparation.Google Scholar
  17. Cox, R.W., & Taylor, P.A. (2017). Stability of Spatial Smoothness and Cluster-Size Threshold Estimates in FMRI using AFNI. arXiv:1709.07471.
  18. Cremers, H.R., Wager, T.D., Yarkoni, T. (2017). The relation between statistical power and inference in fMRI. PLoS ONE, 12(11), e0184923.PubMedPubMedCentralGoogle Scholar
  19. Eklund, A., Nichols, T.E., Knutsson, H. (2016). Cluster failure: why fMRI inferences for spatial extent have inflated false-positive rates. PNAS, 113(28), 7900–7905.PubMedGoogle Scholar
  20. Forman, S.D., Cohen, J.D., Fitzgerald, M., Eddy, W.F., Mintun, M.A., Noll, D.C. (1995). Improved assessment of significant activation in functional magnetic resonance imaging (fMRI): use of a cluster-size threshold. Magnetic Resonance Medicine, 33, 636– 647.Google Scholar
  21. Gelman, A. (2015). Statistics and the crisis of scientific replication. Significance, 12(3), 23–25.Google Scholar
  22. Gelman, A. (2016). The problems with p-values are not just with p-values. The American Statistician, Online Discussion.Google Scholar
  23. Gelman, A., & Carlin, J.B. (2014). Beyond power calculations: assessing type s (sign) and type m (magnitude) errors. Perspectives on Psychological Science, 1–11.Google Scholar
  24. Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B. (2014). Bayesian data analysis, Third edition. Boca Raton: Chapman & Hall/CRC Press.Google Scholar
  25. Gelman, A., & Hennig, C. (2017). Beyond subjective and objective in statistics. Journal of the Royal Statistical Society: Series A (Statistics in Society), 180(4), 1–31.Google Scholar
  26. Gelman, A., Hill, J., Yajima, M. (2012). Why we (usually) don’t have to worry about multiple comparisons. Journal of Research on Educational Effectiveness, 5, 189–211.Google Scholar
  27. Gelman, A., & Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no ”fishing expedition” or ”p-hacking” and the research hypothesis was posited ahead of time.
  28. Gelman, A., & Shalizi, C.R. (2013). Philosophy and the practice of Bayesian statistics. British Journal of Mathematical and Statistical Psychology, 66, 8–38.PubMedGoogle Scholar
  29. Gelman, A., Simpson, D., Betancourt, M. (2017). The prior can generally only be understood in the context of the likelihood. arXiv:1708.07487.
  30. Gelman, A., & Tuerlinckx, F. (2000). Type S error rates for classical and Bayesian single and multiple comparison procedures. Computational Statistics15, 373–390.Google Scholar
  31. Gonzalez-Castillo, J., Saad, Z.S., Handwerker, D.A., Inati, S.J., Brenowitz, N., Bandettini, P.A. (2012). Whole-brain, time-locked activation with simple tasks revealed using massive averaging and model-free analysis. PNAS, 109(14), 5487–5492.PubMedGoogle Scholar
  32. Gonzalez-Castillo, J., Chen, G., Nichols, T., Cox, R.W., Bandettini, P.A. (2017). Variance decomposition for single-subject task-based fMRI activity estimates across many sessions. NeuroImage, 154, 206–218.PubMedGoogle Scholar
  33. Lazzeroni, L.C., Lu, Y., Belitskaya-Lévy, I. (2016). Solutions for quantifying P-value uncertainty and replication power. Nature Methods, 13, 107–110.PubMedGoogle Scholar
  34. Lewandowski, D., Kurowicka, D., Joe, H. (2009). Generating random correlation matrices based on vines and extended onion method. Journal of Multivariate Analysis, 100, 1989–2001.Google Scholar
  35. Loken, E., & Gelman, A. (2017). Measurement error and the replication crisis. Science, 355(6325), 584–585.PubMedGoogle Scholar
  36. McElreath, R. (2016). Statistical Rethinking: a Bayesian course with examples in R and Stan. Boca Raton: Chapman & Hall/CRC Press.Google Scholar
  37. McShane, B.B., Gal, D., Gelman, A., Robert, C., Tackett, J.L. (2017). Abandon statistical significance. arXiv:1709.07588.
  38. Mejia, A., Yue, Y.R., Bolin, D., Lindren, F., Lindquist, M.A. (2017). A Bayesian general linear modeling approach to cortical surface fMRI data analysis. arXiv:1706.00959.
  39. Morey, R.D., Hoekstra, R., Rouder, J.N., Lee, M.D., Wagenmakers, E.-J. (2016). The fallacy of placing confidence in confidence intervals. Psychonomic Bulletin and Review, 23(1), 103–123.PubMedGoogle Scholar
  40. Mueller, K., Lepsien, J., Möller, H.E., Lohmann, G. (2017). Commentary: cluster failure: why fMRI inferences for spatial extent have inflated false-positive rates. Frontiers in Human Neuroscience, 11, 345.PubMedPubMedCentralGoogle Scholar
  41. Nichols, T.E., & Holmes, A.P. (2001). Nonparametric permutation tests for functional neuroimaging: a primer with examples. Human Brain Mapping, 15(1), 1–25.Google Scholar
  42. Olszowy, W., Aston, J., Rua, C., Williams, G.B. (2017). Accurate autocorrelation modeling substantially improves fMRI reliability. arXiv:1711.09877.
  43. Poline, J.B., & Brett, M. (2012). The general linear model and fMRI: does love last forever? NeuroImage, 62 (2), 871–880.PubMedGoogle Scholar
  44. R Core Team. (2017). R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
  45. Saad, Z.S., Reynolds, R.C., Argall, B., Japee, S., Cox, R.W. (2004). SUMA: an interface for surface-based intra- and inter-subject analysis with AFNI. In Proceedings of the 2004 IEEE International Symposium on Biomedical Imaging (pp. 1510–1513).Google Scholar
  46. Schaefer, A., Kong, R., Gordon, E.M., Zuo, X.N., Holmes, A.J., Eickhoff, S.B., Yeo, B.T. (2017). Local-global parcellation of the human cerebral cortex from intrinsic functional connectivity MRI. Cerebral Cortex. In press.Google Scholar
  47. Simmons, J.P., Nelson, L.D., Simonsohn, U. (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359–1366.PubMedGoogle Scholar
  48. Smith, S.M., & Nichols, T.E. (2009). Threshold-free cluster enhancement: addressing problems of smoothing, threshold dependence and localisation in cluster inference. Neuroimage, 44(1), 83–98.PubMedGoogle Scholar
  49. Stan Development Team. (2017). Stan modeling language users guide and reference manual, Version 2.17.0.
  50. Steegen, S., Tuerlinckx, F., Gelman, A., Vanpaemel, W. (2016). Increasing transparency through a multiverse Analysis. Perspectives on Psychological Science, 11(5), 702–712.PubMedGoogle Scholar
  51. Wasserstein, R.L., & Lazar, N.A. (2016). The ASA’s statement on p-values: context, process, and purpose. The American Statistician 70, 2, 129–133.Google Scholar
  52. Vehtari, A., Gelman, A., Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27(5), 1413–1432.Google Scholar
  53. Westfall, J., Nichols, T.E., Yarkoni, T. (2017). Fixing the stimulus-as-fixed-effect fallacy in task fMRI. Wellcome Open Research, 1, 23.PubMedCentralGoogle Scholar
  54. Wickham, H. (2009). Ggplot2: elegant graphics for data analysis. New York: Springer.Google Scholar
  55. Worsley, K.J., Marrett, S., Neelin, P., Evans, A.C. (1992). A three-dimensional statistical analysis for CBF activation studies in human brain. Journal of Cerebral Blood Flow and Metabolism, 12, 900–918.PubMedGoogle Scholar
  56. Xiao, Y., Geng, F., Riggins, T., Chen, G., Redcay, E. (2018). Neural correlates of developing theory of mind competence in early childhood. Under review.Google Scholar
  57. Yeung, A.W.K. (2018). An updated survey on statistical thresholding and sample size of fMRI studies. Frontiers in Human Neuroscience, 12, 16.PubMedPubMedCentralGoogle Scholar

Copyright information

© This is a U.S. Government work and not under copyright protection in the US; foreign copyright protection may apply 2019

Authors and Affiliations

  1. 1.Scientific and Statistical Computing CoreNational Institute of Mental HealthBethesdaUSA
  2. 2.Department of PsychologyUniversity of MarylandCollege ParkUSA

Personalised recommendations