Discovery Among Binary Biomarkers in Heterogeneous Populations

  • Junxian Geng
  • Elizabeth H. SlateEmail author
Part of the Emerging Topics in Statistics and Biostatistics book series (ETSB)


Biomarkers have great potential to improve disease diagnosis and treatment. Disease may arise via multiple pathways, however, each associated with distinct complex interactions among multiple biomarkers, and hence patients exhibit considerable heterogeneity in the biomarker-disease association despite sharing the same clinical diagnosis. Thus identification of clinically useful biomarker combinations requires statistical methods that accommodate population heterogeneity and enable discovery of possibly complex interactions among biomarkers that associate with disease. We address jointly modeling binary and continuous disease outcomes when the association between predictors and these outcomes exhibits heterogeneity. In the context of binary biomarkers, we use ideas from logic regression to find Boolean combinations of these biomarkers that predict the binary disease outcome. The associated continuous outcome is modeled as Gaussian. Heterogeneity is cast as unknown subgroups in the population, with the associations between the joint outcome and biomarkers and other covariates varying by subgroup. We adopt a mixture of finite mixtures (MFM) fully Bayesian formulation to simultaneously estimate the number of subgroups, the subgroup membership structure, and the subgroup-specific relationships between outcomes and predictors. We describe how our model incorporates the Boolean relations as parameters arising from the MFM model and our approach to the associated challenges of specifying the prior distribution and estimation using Markov chain Monte Carlo. We illustrate the performance of the methods using simulation and discuss application.


Bayesian semiparametric model Clustering Joint modeling Markov chain Monte Carlo Product partition model 



The authors were partially supported by grants R01MH104423, R01HD078410 and R01HD093055 from the National Institutes of Health. Portions of this work were revised while E. Slate was the Visiting Scholar in Honor of David C. Jordan at AbbVie, Inc. in North Chicago, IL and also a Research Fellow with the Statistical and Applied Mathematical Sciences Institute in Durham, NC. Additional support from the Graduate School and Department of Statistics at Florida State University is gratefully acknowledged. Figures 1 and 2 were adapted from a figure provided by Dr. Zhengwu Zhang, Univ. of Rochester. The authors thank the reviewers for comments that led to improvement of this manuscript.


  1. 1.
    Aldous, D. J. (1985). Exchangeability and related topics. Berlin: Springer.CrossRefGoogle Scholar
  2. 2.
    Allman, E. S., Matias, C., & Rhodes, J. A. (2009). Identifiability of parameters in latent structure models with many observed variables. The Annals of Statistics, 3099–3132.Google Scholar
  3. 3.
    Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics, 1152–1174.Google Scholar
  4. 4.
    Blackwell, D. & MacQueen, J. B. (1973). Ferguson distributions via Pólya urn schemes. The Annals of Statistics, 353–355.Google Scholar
  5. 5.
    Chipman, H. A., George, E. I., & Mcculloch, R. E. (1998). Bayesian CART model search. Journal of the American Statistical Association, 93(443), 935–960.CrossRefGoogle Scholar
  6. 6.
    Chipman, H. A., George, E. I., & McCulloch, R. E. (2010). BART: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1), 266–298.MathSciNetCrossRefGoogle Scholar
  7. 7.
    Dahl, D. B. (2006). Model-based clustering for expression data via a Dirichlet process mixture model. Bayesian inference for gene expression and proteomics, 4, 201–218.CrossRefGoogle Scholar
  8. 8.
    Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1), 1–38.MathSciNetCrossRefGoogle Scholar
  9. 9.
    Etzioni, R., Falcon, S., Gann, P. H., Kooperberg, C. L., Penson, D. F., & Stampfer, M. J. (2004). Prostate-specific antigen and free prostate-specific antigen in the early detection of prostate cancer: Do combination tests improve detection? Cancer Epidemiology Biomarkers and Prevention, 13(10), 1640–1645.Google Scholar
  10. 10.
    Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 209–230.Google Scholar
  11. 11.
    Fleisher, H., Tavel, M., & Yeager, J. (1983). Exclusive-OR representation of Boolean functions. IBM Journal of Research and Development, 27(4), 412–416.CrossRefGoogle Scholar
  12. 12.
    Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82(4), 711–732.MathSciNetCrossRefGoogle Scholar
  13. 13.
    Huang, G.-H., & Bandeen-Roche, K. (2004). Building an identifiable latent class model with covariate effects on underlying and measured variables. Psychometrika, 69(1), 5–32.MathSciNetCrossRefGoogle Scholar
  14. 14.
    Ishwaran, H., & Zarepour, M. (2002). Dirichlet prior sieves in finite normal mixtures. Statistica Sinica, 12, 941–963.MathSciNetzbMATHGoogle Scholar
  15. 15.
    Janes, H., Pepe, M., Kooperberg, C., & Newcomb, P. (2005). Identifying target populations for screening or not screening using logic regression. Statistics in Medicine, 24(9), 1321–1338.MathSciNetCrossRefGoogle Scholar
  16. 16.
    Kooperberg, C., Bis, J. C., Marciante, K. D., Heckbert, S. R., Lumley, T., & Psaty, B. M. (2007). Logic regression for analysis of the association between genetic variation in the renin-angiotensin system and myocardial infarction or stroke. American Journal of Epidemiology, 165(3), 334–343.CrossRefGoogle Scholar
  17. 17.
    Kooperberg, C., & Ruczinski, I. (2005). Identifying interacting SNPs using Monte Carlo logic regression. Genetic Epidemiology, 28(2), 157–70.CrossRefGoogle Scholar
  18. 18.
    Lo, S. H., & Zhang, T. (2002). Backward haplotype transmission association (BHTA) algorithm – A fast multiple-marker screening method. Human Heredity, 53(4), 197–215.CrossRefGoogle Scholar
  19. 19.
    MacEachern, S. N., & Muller, P. (1998). Estimating mixture of Dirichlet process models. Journal of Computational and Graphical Statistics, 7(2), 223–238.Google Scholar
  20. 20.
    Miller, J. W. (2014). Nonparametric and variable-dimension Bayesian mixture models: Analysis, comparison, and new methods. Ph.D. Thesis, Brown University.Google Scholar
  21. 21.
    Miller, J. W., & Harrison, M. T. (2018). Mixture models with a prior on the number of components. Journal of the American Statistical Association, 113(521), 340–356.MathSciNetCrossRefGoogle Scholar
  22. 22.
    Mitra, A. P., Datar, R. H., & Cote, R. J. (2006). Molecular pathways in invasive bladder cancer: New insights into mechanisms, progression, and target identification. Journal of Clinical Oncolology, 24(35), 5552–5564.CrossRefGoogle Scholar
  23. 23.
    Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2), 249–265.MathSciNetGoogle Scholar
  24. 24.
    Petrone, S., & Raftery, A. E. (1997). A note on the Dirichlet process prior in Bayesian nonparametric inference with partial exchangeability. Statistics & Probability Letters, 36(1), 69–83.MathSciNetCrossRefGoogle Scholar
  25. 25.
    Pitman, J. (1995). Exchangeable and partially exchangeable random partitions. Probability Theory and Related Fields, 102(2), 145–158.MathSciNetCrossRefGoogle Scholar
  26. 26.
    Proust-Lima, C., Séne, M., Taylor, J. M., & Jacqmin-Gadda, H. (2014). Joint latent class models for longitudinal and time-to-event data: A review. Statistical Methods in Medical Research, 23(1), 74–90.MathSciNetCrossRefGoogle Scholar
  27. 27.
    Ruczinski, I., Kooperberg, C., & LeBlanc, M. (2003). Logic regression. Journal of Computational and graphical Statistics, 12(3), 475–511.MathSciNetCrossRefGoogle Scholar
  28. 28.
    Schapire, R. E., & Freund, Y. (2012). Boosting: Foundations and Algorithms. Cambridge: The MIT Press.zbMATHGoogle Scholar
  29. 29.
    Schwender, H., & Ickstadt, K. (2008). Identification of SNP interactions using logic regression. Biostatistics, 9(1), 187–198.CrossRefGoogle Scholar
  30. 30.
    Slate, E. H., Geng, J., Wolf, B. J., & Hill, E. G. (2014). Discovery among binary biomarkers. In JSM Proceedings, WNAR. Alexandria: American Statistical Association.Google Scholar
  31. 31.
    Srivastava, S. (2005). Cancer biomarkers: an emerging means of detecting, diagnosing and treating cancer. Cancer Biomarkers, 1(1), 1–2.CrossRefGoogle Scholar
  32. 32.
    Vermeulen, S. H., Den Heijer, M., Sham, P., & Knight, J. (2007). Application of multi-locus analytical methods to identify interacting loci in case-control studies. Annals of Human Genetics, 71, 689–700.CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Boehringer Ingelheim Pharmaceuticals Inc.RidgefieldUSA
  2. 2.Department of StatisticsFlorida State UniversityTallahasseeUSA

Personalised recommendations