Statistics in Biosciences

, Volume 10, Issue 1, pp 41–58 | Cite as

A Two-Stage Hidden Markov Model Design for Biomarker Detection, with Application to Microbiome Research

  • Yi-Hui ZhouEmail author
  • Paul Brooks
  • Xiaoshan Wang


It has been recognized that for appropriately ordered data, hidden Markov models (HMM) with local false discovery rate (FDR) control can increase the power to detect significant associations. For many high-throughput technologies, the cost still limits their application. Two-stage designs are attractive, in which a set of interesting features or biomarkers is identified in a first stage and then followed up in a second stage. However, to our knowledge, no two-stage FDR control with HMMs has been developed. In this paper, we study an efficient HMM–FDR-based two-stage design, using a simple integrated analysis procedure across the stages. Numeric studies show its excellent performance when compared to available methods. A power analysis method is also proposed. We use examples from microbiome data to illustrate the methods.


Biomarker False discovery rates Hidden Markov model Metagenomics Metatranscriptomics PCR 



This work was supported by R21HG007840.

Supplementary material

12561_2017_9187_MOESM1_ESM.pdf (687 kb)
Supplementary material 1 (pdf 687 KB)


  1. 1.
    Zehetmayer S, Bauer P, Posch M (2005) Two-stage designs for experiments with a large number of hypotheses. Bioinformatics 21:3771–3777CrossRefGoogle Scholar
  2. 2.
    Tickle TL, Segata N, Waldron L, Weingart U, Huttenhower C (2013) Two-stage microbial community experimental design. ISME J 7:2330–9CrossRefGoogle Scholar
  3. 3.
    Breslow NE, Cain KC (1988) Logistic regression for two-stage case-control data. Biometrika 71:11–20MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Haneuse S, Schildcrout J, Gillen D (2012) A two-stage strategy to accommodate general patterns of confounding in the design of observational studies. Biostatistics 13:274–88CrossRefGoogle Scholar
  5. 5.
    Goll A, Bauer P (2007) Two-stage designs applying methods differing in costs. Bioinformatics 23:1519–26CrossRefGoogle Scholar
  6. 6.
    Kraft P, Cox DG (2008) Study designs for genome-wide association studies. Adv Genet 60:465–504Google Scholar
  7. 7.
    Stanhope SA, Skol AD (2012) Improved minimum cost and maximum power two stage genome-wide association study designs. PLoS One 7:e42367CrossRefGoogle Scholar
  8. 8.
    Simon-Sanchez J et al (2009) Genome-wide association study reveals genetic risk underlying Parkinson’s disease. Nat Genet 41(12):1308–1312CrossRefGoogle Scholar
  9. 9.
    McCarthy MI, Hirschhorn JN (2008) Genome-wide association studies: potential next steps on a genetic journey. Hum Mol Genet 17(R2):R156–R165CrossRefGoogle Scholar
  10. 10.
    Skol AD, Scott LJ, Abecasis GR, Boehnke M (2006) Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat Genet 38:209–13CrossRefGoogle Scholar
  11. 11.
    Zehetmayer S, Bauer P, Posch M (2005) Two-stage designs for experiments with a large number of hypotheses. Bioinformatics 21:3771–7CrossRefGoogle Scholar
  12. 12.
    Sarkar S, Chen J, Guo W (2013) Multiple testing in a two-stage adaptive design with combination tests controlling FDR. J Am Stat Assoc 108:1385–1401MathSciNetCrossRefzbMATHGoogle Scholar
  13. 13.
    Sun W, Tony Cai T (2009) Large-scale multiple testing under dependence. J R Stat Soc 71:393–424MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Efron B, Storey J, Tibshirani R (2001) Microarrays empirical Bayes methods, and false discovery ratesGoogle Scholar
  15. 15.
    Lehmann EL (1986) Testing statistical hypotheses. Wiley, New YorkCrossRefzbMATHGoogle Scholar
  16. 16.
    Benjamini Y, Yekutieli D (2001) The control of the false discovery rate in multiple testing under dependency. Ann Stat 29(4):1165–1188MathSciNetCrossRefzbMATHGoogle Scholar
  17. 17.
    Hathaway RJ (1985) A constrained formulation of maximum-likelihood estimation for normal mixture distributions. Ann Stat, 795–800Google Scholar
  18. 18.
    Efron B (2004) Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. J Am Stat Assoc 99:96–104MathSciNetCrossRefzbMATHGoogle Scholar
  19. 19.
    Guan Z, Wu B, Zhao H (2008) Nonparametric estimator of false discovery rate based on Bernstein polynomials. Stat Sin 18:905–923MathSciNetzbMATHGoogle Scholar
  20. 20.
    Strimmer K (2008) A unified approach to false discovery rate estimation. BMC Bioinf 9:303CrossRefGoogle Scholar
  21. 21.
    Guedj M, Robin S, Celisse A, Nuel G (2009) Kerfdr: a semi-parametric kernel-based approach to local false discovery rate estimation. BMC Bioinf 10:84CrossRefGoogle Scholar
  22. 22.
    Rüschendorf L (1982) Random variables with maximum sums. Adv Appl Probab 14:623–632MathSciNetCrossRefzbMATHGoogle Scholar
  23. 23.
    Human Microbiome Project Consortium (2012) Structure, function and diversity of the healthy human microbiome. Nature 486(7402): 207–214Google Scholar
  24. 24.
    Markle JG et al (2013) Sex differences in the gut microbiome drive hormone-dependent regulation of autoimmunity. Science 339(6123):1084–1088CrossRefGoogle Scholar

Copyright information

© International Chinese Statistical Association 2017

Authors and Affiliations

  1. 1.Department of Biological Sciences, Bioinformatics Research CenterNorth Carolina State UniversityRaleighUSA
  2. 2.Department of Statistical Sciences and Operations Research and Department of Supply Chain Management and AnalyticsVirginia Commonwealth UniversityRichmondUSA
  3. 3.IMEDACS, LLCAnn ArborUSA

Personalised recommendations