Estimating total species using a weighted combination of expected mixture distribution component counts


In this paper we present a weighted mixture distribution component counts (MDCC) approach for estimating total number of species. The proposed method combines conditional estimates of component counts from several candidate mixture distributions and uses bootstrap for confidence interval estimation. The distribution specification is flexible and can be adjusted to suit a variety of datasets. Smoothing techniques can also be incorporated to improve modeling of sparse data. The method is tested by a simulation study and applied to two microbiome datasets for illustration. Simulation results indicate improved bias, mean squared error and confidence interval coverage relative to comparison methods, as well as robustness to underlying data structure.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Data availability

Office dust microbiome data is available from Qiita (, Study ID: 10423). Plant microbiome data is available from Dryad (


  1. Bunge J, Barger K (2008) Parametric models for estimating the number of classes. Biom J 50:971–982

    Article  PubMed Central  Google Scholar 

  2. Bunge J, Fitzpatrick M (1993) Estimating the number of species: a review. J Am Stat Assoc 88:364–373

    Google Scholar 

  3. Bunge J, Woodard L, Böhning D, Foster JA, Connolly S, Allen HK (2012) Estimating population diversity with CatchAll. Bioinformatics 28:1045–1047.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  4. Bunge J, Willis A, Walsh F (2014) Estimating the number of species in microbial diversity studies. Annu Rev Stat Appl 1:427–445.

    Article  Google Scholar 

  5. Burnham KP, Overton WS (1979) Robust estimation of population size when capture probabilities vary among animals. Ecology 60:927–936

    Article  Google Scholar 

  6. Chao A, Bunge J (2002) Estimating the number of species in a stochastic abundance model. Biometrics 58:531–539

    Article  PubMed Central  Google Scholar 

  7. Chao A, Lee S-M (1992) Estimating the number of classes via sample coverage. J Am Stat Assoc 87:210–217

    Article  Google Scholar 

  8. Chase J, Fouquier J, Zare M, Sonderegger DL, Knight R, Scott TK, Siegel J, Caporaso JG (2016) Geography and location are the primary drivers of office microbiome composition. mSystems.

    Article  PubMed  PubMed Central  Google Scholar 

  9. Choi K, Bulgern WG (1968) An estimation procedure for mixtures of distributions. J R Stat Soc. 30:444–460

    Google Scholar 

  10. Efron B, Thisted R (1976) Estimating the number of unseen species: how many words did Shakespeare know? Biometrika 63:435–447

    Google Scholar 

  11. Fisher RA, Corbet SA, Williams CB (1943) The relation between the number of species and the number of individuals in a random sample of an animal population. J Anim Ecol 12:42–58

    Article  Google Scholar 

  12. Good IJ (1953) The population frequencies of species and the estimation of population parameters. Biometrika 40(3):237–264

    Article  Google Scholar 

  13. Good IJ, Toulmin GH (1956) The number of new species, and the increase in population coverage, when a sample is increased. Biometrika 43(1):45–63

    Article  Google Scholar 

  14. Norris JL, Pollock KH (1998) Non-parametric MLE for Poisson species abundance models allowing for heterogeneity between species. Environ Ecol Stat 5:391–402

    Article  Google Scholar 

  15. Rocchetti I, Bunge J, Böhning D (2011) Population size estimation based upon ratios of recapture probabilities. Ann Appl Stat 5:1512–1533

    Article  Google Scholar 

  16. Shestopaloff K (2017) Analysis of ecological communities using mixture models. University of Toronto, Toronto

    Google Scholar 

  17. Shestopaloff K, Escobar MD, Xu W (2018) Analyzing differences between microbiome communities using mixture distributions. Stat Med 37:4036–4053

    Article  PubMed Central  Google Scholar 

  18. Wagner MR, Lundberg DS, del Rio TG, Tringe SG, Dangl JL, Mitchell-Olds T (2016) Host genotype and age shape the leaf and root microbiomes of wild perennial plant. Nat Commun.

    Article  PubMed  PubMed Central  Google Scholar 

  19. Wang J-P (2010) Estimating species richness by a Poisson-compound gamma model. Biometrika 97:727–740

    Article  PubMed Central  Google Scholar 

  20. Wang J-P (2011) SPECIES: an R package for species richness estimation. J Stat Softw. 40(9):1–15

    Article  Google Scholar 

  21. Wang J-P, Lindsay BG (2005) A penalized nonparametric maximum likelihood approach to species richness estimation. J Am Stat Assoc 100:942–959

    CAS  Article  Google Scholar 

  22. Willis A, Bunge J (2015) Estimating diversity via frequency ratios. Biometrics 71:1042–1049

    Article  Google Scholar 

  23. Yule GU (1925) II.—A mathematical theory of evolution, based on the conclusions of Dr. JC Willis, FR S. Philos Trans R Soc Lond Ser B 213(402):21–87

    Google Scholar 

Download references


We would like to thank the Associate Editor and anonymous reviewer for their comments and feedback, which improved the substance and presentation of this manuscript.

Author information



Corresponding author

Correspondence to Konstantin Shestopaloff.

Additional information

Handling Editor: Pierre Dutilleul.

Appendix A: simulation scenarios

Appendix A: simulation scenarios

See Table 4.

Table 4 List of component weights for each of the simulation scenarios, ordered by the expected proportion of zeros or unobserved species

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Shestopaloff, K., Xu, W. & Escobar, M.D. Estimating total species using a weighted combination of expected mixture distribution component counts. Environ Ecol Stat (2020).

Download citation


  • Mixture distribution
  • Statistical ecology
  • Total species
  • Unobserved species
  • Weighted estimator