Skip to main content

Discovery Among Binary Biomarkers in Heterogeneous Populations

  • Chapter
  • First Online:
Statistical Modeling in Biomedical Research

Part of the book series: Emerging Topics in Statistics and Biostatistics ((ETSB))

Abstract

Biomarkers have great potential to improve disease diagnosis and treatment. Disease may arise via multiple pathways, however, each associated with distinct complex interactions among multiple biomarkers, and hence patients exhibit considerable heterogeneity in the biomarker-disease association despite sharing the same clinical diagnosis. Thus identification of clinically useful biomarker combinations requires statistical methods that accommodate population heterogeneity and enable discovery of possibly complex interactions among biomarkers that associate with disease. We address jointly modeling binary and continuous disease outcomes when the association between predictors and these outcomes exhibits heterogeneity. In the context of binary biomarkers, we use ideas from logic regression to find Boolean combinations of these biomarkers that predict the binary disease outcome. The associated continuous outcome is modeled as Gaussian. Heterogeneity is cast as unknown subgroups in the population, with the associations between the joint outcome and biomarkers and other covariates varying by subgroup. We adopt a mixture of finite mixtures (MFM) fully Bayesian formulation to simultaneously estimate the number of subgroups, the subgroup membership structure, and the subgroup-specific relationships between outcomes and predictors. We describe how our model incorporates the Boolean relations as parameters arising from the MFM model and our approach to the associated challenges of specifying the prior distribution and estimation using Markov chain Monte Carlo. We illustrate the performance of the methods using simulation and discuss application.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 74.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Henceforth we refer to the classes defining the underlying subpopulation structure as clusters for greater consistency with the machine learning and Bayesian literature. The cluster configuration is the cluster assignment information encoded by the {z i}. Because each individual is assigned to exactly one cluster, the cluster configuration is, equivalently, a partition of the n observations into K groups.

References

  1. Aldous, D. J. (1985). Exchangeability and related topics. Berlin: Springer.

    Book  Google Scholar 

  2. Allman, E. S., Matias, C., & Rhodes, J. A. (2009). Identifiability of parameters in latent structure models with many observed variables. The Annals of Statistics, 3099–3132.

    Google Scholar 

  3. Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics, 1152–1174.

    Google Scholar 

  4. Blackwell, D. & MacQueen, J. B. (1973). Ferguson distributions via Pólya urn schemes. The Annals of Statistics, 353–355.

    Google Scholar 

  5. Chipman, H. A., George, E. I., & Mcculloch, R. E. (1998). Bayesian CART model search. Journal of the American Statistical Association, 93(443), 935–960.

    Article  Google Scholar 

  6. Chipman, H. A., George, E. I., & McCulloch, R. E. (2010). BART: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1), 266–298.

    Article  MathSciNet  Google Scholar 

  7. Dahl, D. B. (2006). Model-based clustering for expression data via a Dirichlet process mixture model. Bayesian inference for gene expression and proteomics, 4, 201–218.

    Article  Google Scholar 

  8. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1), 1–38.

    Article  MathSciNet  Google Scholar 

  9. Etzioni, R., Falcon, S., Gann, P. H., Kooperberg, C. L., Penson, D. F., & Stampfer, M. J. (2004). Prostate-specific antigen and free prostate-specific antigen in the early detection of prostate cancer: Do combination tests improve detection? Cancer Epidemiology Biomarkers and Prevention, 13(10), 1640–1645.

    Google Scholar 

  10. Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 209–230.

    Google Scholar 

  11. Fleisher, H., Tavel, M., & Yeager, J. (1983). Exclusive-OR representation of Boolean functions. IBM Journal of Research and Development, 27(4), 412–416.

    Article  Google Scholar 

  12. Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82(4), 711–732.

    Article  MathSciNet  Google Scholar 

  13. Huang, G.-H., & Bandeen-Roche, K. (2004). Building an identifiable latent class model with covariate effects on underlying and measured variables. Psychometrika, 69(1), 5–32.

    Article  MathSciNet  Google Scholar 

  14. Ishwaran, H., & Zarepour, M. (2002). Dirichlet prior sieves in finite normal mixtures. Statistica Sinica, 12, 941–963.

    MathSciNet  MATH  Google Scholar 

  15. Janes, H., Pepe, M., Kooperberg, C., & Newcomb, P. (2005). Identifying target populations for screening or not screening using logic regression. Statistics in Medicine, 24(9), 1321–1338.

    Article  MathSciNet  Google Scholar 

  16. Kooperberg, C., Bis, J. C., Marciante, K. D., Heckbert, S. R., Lumley, T., & Psaty, B. M. (2007). Logic regression for analysis of the association between genetic variation in the renin-angiotensin system and myocardial infarction or stroke. American Journal of Epidemiology, 165(3), 334–343.

    Article  Google Scholar 

  17. Kooperberg, C., & Ruczinski, I. (2005). Identifying interacting SNPs using Monte Carlo logic regression. Genetic Epidemiology, 28(2), 157–70.

    Article  Google Scholar 

  18. Lo, S. H., & Zhang, T. (2002). Backward haplotype transmission association (BHTA) algorithm – A fast multiple-marker screening method. Human Heredity, 53(4), 197–215.

    Article  Google Scholar 

  19. MacEachern, S. N., & Muller, P. (1998). Estimating mixture of Dirichlet process models. Journal of Computational and Graphical Statistics, 7(2), 223–238.

    Google Scholar 

  20. Miller, J. W. (2014). Nonparametric and variable-dimension Bayesian mixture models: Analysis, comparison, and new methods. Ph.D. Thesis, Brown University.

    Google Scholar 

  21. Miller, J. W., & Harrison, M. T. (2018). Mixture models with a prior on the number of components. Journal of the American Statistical Association, 113(521), 340–356.

    Article  MathSciNet  Google Scholar 

  22. Mitra, A. P., Datar, R. H., & Cote, R. J. (2006). Molecular pathways in invasive bladder cancer: New insights into mechanisms, progression, and target identification. Journal of Clinical Oncolology, 24(35), 5552–5564.

    Article  Google Scholar 

  23. Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2), 249–265.

    MathSciNet  Google Scholar 

  24. Petrone, S., & Raftery, A. E. (1997). A note on the Dirichlet process prior in Bayesian nonparametric inference with partial exchangeability. Statistics & Probability Letters, 36(1), 69–83.

    Article  MathSciNet  Google Scholar 

  25. Pitman, J. (1995). Exchangeable and partially exchangeable random partitions. Probability Theory and Related Fields, 102(2), 145–158.

    Article  MathSciNet  Google Scholar 

  26. Proust-Lima, C., Séne, M., Taylor, J. M., & Jacqmin-Gadda, H. (2014). Joint latent class models for longitudinal and time-to-event data: A review. Statistical Methods in Medical Research, 23(1), 74–90.

    Article  MathSciNet  Google Scholar 

  27. Ruczinski, I., Kooperberg, C., & LeBlanc, M. (2003). Logic regression. Journal of Computational and graphical Statistics, 12(3), 475–511.

    Article  MathSciNet  Google Scholar 

  28. Schapire, R. E., & Freund, Y. (2012). Boosting: Foundations and Algorithms. Cambridge: The MIT Press.

    MATH  Google Scholar 

  29. Schwender, H., & Ickstadt, K. (2008). Identification of SNP interactions using logic regression. Biostatistics, 9(1), 187–198.

    Article  Google Scholar 

  30. Slate, E. H., Geng, J., Wolf, B. J., & Hill, E. G. (2014). Discovery among binary biomarkers. In JSM Proceedings, WNAR. Alexandria: American Statistical Association.

    Google Scholar 

  31. Srivastava, S. (2005). Cancer biomarkers: an emerging means of detecting, diagnosing and treating cancer. Cancer Biomarkers, 1(1), 1–2.

    Article  Google Scholar 

  32. Vermeulen, S. H., Den Heijer, M., Sham, P., & Knight, J. (2007). Application of multi-locus analytical methods to identify interacting loci in case-control studies. Annals of Human Genetics, 71, 689–700.

    Article  Google Scholar 

Download references

Acknowledgements

The authors were partially supported by grants R01MH104423, R01HD078410 and R01HD093055 from the National Institutes of Health. Portions of this work were revised while E. Slate was the Visiting Scholar in Honor of David C. Jordan at AbbVie, Inc. in North Chicago, IL and also a Research Fellow with the Statistical and Applied Mathematical Sciences Institute in Durham, NC. Additional support from the Graduate School and Department of Statistics at Florida State University is gratefully acknowledged. Figures 1 and 2 were adapted from a figure provided by Dr. Zhengwu Zhang, Univ. of Rochester. The authors thank the reviewers for comments that led to improvement of this manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Elizabeth H. Slate .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Geng, J., Slate, E.H. (2020). Discovery Among Binary Biomarkers in Heterogeneous Populations. In: Zhao, Y., Chen, DG. (eds) Statistical Modeling in Biomedical Research. Emerging Topics in Statistics and Biostatistics . Springer, Cham. https://doi.org/10.1007/978-3-030-33416-1_11

Download citation

Publish with us

Policies and ethics