Abstract
Biomarkers have great potential to improve disease diagnosis and treatment. Disease may arise via multiple pathways, however, each associated with distinct complex interactions among multiple biomarkers, and hence patients exhibit considerable heterogeneity in the biomarker-disease association despite sharing the same clinical diagnosis. Thus identification of clinically useful biomarker combinations requires statistical methods that accommodate population heterogeneity and enable discovery of possibly complex interactions among biomarkers that associate with disease. We address jointly modeling binary and continuous disease outcomes when the association between predictors and these outcomes exhibits heterogeneity. In the context of binary biomarkers, we use ideas from logic regression to find Boolean combinations of these biomarkers that predict the binary disease outcome. The associated continuous outcome is modeled as Gaussian. Heterogeneity is cast as unknown subgroups in the population, with the associations between the joint outcome and biomarkers and other covariates varying by subgroup. We adopt a mixture of finite mixtures (MFM) fully Bayesian formulation to simultaneously estimate the number of subgroups, the subgroup membership structure, and the subgroup-specific relationships between outcomes and predictors. We describe how our model incorporates the Boolean relations as parameters arising from the MFM model and our approach to the associated challenges of specifying the prior distribution and estimation using Markov chain Monte Carlo. We illustrate the performance of the methods using simulation and discuss application.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Henceforth we refer to the classes defining the underlying subpopulation structure as clusters for greater consistency with the machine learning and Bayesian literature. The cluster configuration is the cluster assignment information encoded by the {z i}. Because each individual is assigned to exactly one cluster, the cluster configuration is, equivalently, a partition of the n observations into K groups.
References
Aldous, D. J. (1985). Exchangeability and related topics. Berlin: Springer.
Allman, E. S., Matias, C., & Rhodes, J. A. (2009). Identifiability of parameters in latent structure models with many observed variables. The Annals of Statistics, 3099–3132.
Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics, 1152–1174.
Blackwell, D. & MacQueen, J. B. (1973). Ferguson distributions via Pólya urn schemes. The Annals of Statistics, 353–355.
Chipman, H. A., George, E. I., & Mcculloch, R. E. (1998). Bayesian CART model search. Journal of the American Statistical Association, 93(443), 935–960.
Chipman, H. A., George, E. I., & McCulloch, R. E. (2010). BART: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1), 266–298.
Dahl, D. B. (2006). Model-based clustering for expression data via a Dirichlet process mixture model. Bayesian inference for gene expression and proteomics, 4, 201–218.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1), 1–38.
Etzioni, R., Falcon, S., Gann, P. H., Kooperberg, C. L., Penson, D. F., & Stampfer, M. J. (2004). Prostate-specific antigen and free prostate-specific antigen in the early detection of prostate cancer: Do combination tests improve detection? Cancer Epidemiology Biomarkers and Prevention, 13(10), 1640–1645.
Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 209–230.
Fleisher, H., Tavel, M., & Yeager, J. (1983). Exclusive-OR representation of Boolean functions. IBM Journal of Research and Development, 27(4), 412–416.
Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82(4), 711–732.
Huang, G.-H., & Bandeen-Roche, K. (2004). Building an identifiable latent class model with covariate effects on underlying and measured variables. Psychometrika, 69(1), 5–32.
Ishwaran, H., & Zarepour, M. (2002). Dirichlet prior sieves in finite normal mixtures. Statistica Sinica, 12, 941–963.
Janes, H., Pepe, M., Kooperberg, C., & Newcomb, P. (2005). Identifying target populations for screening or not screening using logic regression. Statistics in Medicine, 24(9), 1321–1338.
Kooperberg, C., Bis, J. C., Marciante, K. D., Heckbert, S. R., Lumley, T., & Psaty, B. M. (2007). Logic regression for analysis of the association between genetic variation in the renin-angiotensin system and myocardial infarction or stroke. American Journal of Epidemiology, 165(3), 334–343.
Kooperberg, C., & Ruczinski, I. (2005). Identifying interacting SNPs using Monte Carlo logic regression. Genetic Epidemiology, 28(2), 157–70.
Lo, S. H., & Zhang, T. (2002). Backward haplotype transmission association (BHTA) algorithm – A fast multiple-marker screening method. Human Heredity, 53(4), 197–215.
MacEachern, S. N., & Muller, P. (1998). Estimating mixture of Dirichlet process models. Journal of Computational and Graphical Statistics, 7(2), 223–238.
Miller, J. W. (2014). Nonparametric and variable-dimension Bayesian mixture models: Analysis, comparison, and new methods. Ph.D. Thesis, Brown University.
Miller, J. W., & Harrison, M. T. (2018). Mixture models with a prior on the number of components. Journal of the American Statistical Association, 113(521), 340–356.
Mitra, A. P., Datar, R. H., & Cote, R. J. (2006). Molecular pathways in invasive bladder cancer: New insights into mechanisms, progression, and target identification. Journal of Clinical Oncolology, 24(35), 5552–5564.
Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2), 249–265.
Petrone, S., & Raftery, A. E. (1997). A note on the Dirichlet process prior in Bayesian nonparametric inference with partial exchangeability. Statistics & Probability Letters, 36(1), 69–83.
Pitman, J. (1995). Exchangeable and partially exchangeable random partitions. Probability Theory and Related Fields, 102(2), 145–158.
Proust-Lima, C., Séne, M., Taylor, J. M., & Jacqmin-Gadda, H. (2014). Joint latent class models for longitudinal and time-to-event data: A review. Statistical Methods in Medical Research, 23(1), 74–90.
Ruczinski, I., Kooperberg, C., & LeBlanc, M. (2003). Logic regression. Journal of Computational and graphical Statistics, 12(3), 475–511.
Schapire, R. E., & Freund, Y. (2012). Boosting: Foundations and Algorithms. Cambridge: The MIT Press.
Schwender, H., & Ickstadt, K. (2008). Identification of SNP interactions using logic regression. Biostatistics, 9(1), 187–198.
Slate, E. H., Geng, J., Wolf, B. J., & Hill, E. G. (2014). Discovery among binary biomarkers. In JSM Proceedings, WNAR. Alexandria: American Statistical Association.
Srivastava, S. (2005). Cancer biomarkers: an emerging means of detecting, diagnosing and treating cancer. Cancer Biomarkers, 1(1), 1–2.
Vermeulen, S. H., Den Heijer, M., Sham, P., & Knight, J. (2007). Application of multi-locus analytical methods to identify interacting loci in case-control studies. Annals of Human Genetics, 71, 689–700.
Acknowledgements
The authors were partially supported by grants R01MH104423, R01HD078410 and R01HD093055 from the National Institutes of Health. Portions of this work were revised while E. Slate was the Visiting Scholar in Honor of David C. Jordan at AbbVie, Inc. in North Chicago, IL and also a Research Fellow with the Statistical and Applied Mathematical Sciences Institute in Durham, NC. Additional support from the Graduate School and Department of Statistics at Florida State University is gratefully acknowledged. Figures 1 and 2 were adapted from a figure provided by Dr. Zhengwu Zhang, Univ. of Rochester. The authors thank the reviewers for comments that led to improvement of this manuscript.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Geng, J., Slate, E.H. (2020). Discovery Among Binary Biomarkers in Heterogeneous Populations. In: Zhao, Y., Chen, DG. (eds) Statistical Modeling in Biomedical Research. Emerging Topics in Statistics and Biostatistics . Springer, Cham. https://doi.org/10.1007/978-3-030-33416-1_11
Download citation
DOI: https://doi.org/10.1007/978-3-030-33416-1_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-33415-4
Online ISBN: 978-3-030-33416-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)