Abstract
We consider the classic problem of estimating T, the total number of species in a population, from repeated counts in a simple random sample. We first show that the frequently used Chao-Lee estimator can in fact be obtained by Bayesian methods with a Dirichlet prior, and then use such clarification to develop a new estimator; numerical tests and some real experiments show that the new estimator is more flexible than existing ones, in the sense that it adapts to changes in the normalized interspecies variance γ 2. Our method involves simultaneous estimation of T, γ 2, and of the parameter λ in the Dirichlet prior, and the only limitation seems to come from the required convergence of the prior which imposes the restriction γ 2 ≤ 1. We also obtain confidence intervals for T and an estimation of the species’ distribution. Some numerical examples are given, together with applications to sampling from a Census database closely following Benford’s law, showing good performances of the new estimator, even beyond γ 2 = 1. Tests on confidence intervals show that the coverage frequency appears to be in good agreement with the desired confidence level.
Similar content being viewed by others
References
Benford, F. (1938). The law of anomalous numbers. Proc. Am. Philos. Soc., 78, 551–572.
Bishop, Y.M.M., Fienberg, S.E. and Holland, P.W. (1975). Discrete multivariate analysis: theory and practice. MIT Press, Cambridge.
Boender, C.G.E. and Rinnoy Kan, A.H.G. (1987). A multinomial Bayesan approach to the estimation of population and vocabulary size. Biometrika, 74, 849–856.
Böhning, D. and Schön, D. (2005). Nonparametric maximum likelihood estimation of population size based on the counting distribution. J. R. Stat. Soc. Ser. C Appl. Stat, 54, Part 4, 721–737.
Böhning, D., Suppawattanabe, B., Kusolvisitkul, W. and Vivatwongkasem, C. (2004). Estimating the number of drug users in Bangkok 2001: A capture-recapture approach using repeated entries in the list. Eur. J. Epidemiol., 19, 1075–1083.
Brose, U., Martinez, M.D. and Williams, R.J. (2003). Estimating species richness: sensitivity to sample coverage and insensitivity to spatial patterns. Ecology, 84, 2364–2377.
Bunge, J. and Fitzpatrick, M. (1993). Estimating the number of species: a review. J. Amer. Statist. Assoc., 88, 364–373.
Burnahm, K.P. and Overton, W.S. (1979). Robust estimation of population size when capture probabilities vary among animals. Ecology, 60, 927–936.
Burton, D. (2005). The history of mathematics: an introduction. McGraw-Hill.
Chao, A. (1984). Non-parametric estimation of the number of classes in a population. Scand. J. Stat., 11, 265–270.
Chao, A. (2004). Species richness estimation. In Encyclopedia of Statistical Sciences (N. Balakrishnan, C. B. Read and B. Vidakovic, eds.). Wiley, New York.
Chao, A. and Lee, S.M. (1992). Estimating the number of classes via sample coverage. J. Amer. Statist. Assoc., 87, 210–217.
Chao, A., Ma, M.-C. and Yang, M.C.K. (1993). Stopping rules and estimation for recapture debugging with unequal failure rates. Biometrika, 80, 193–201.
Chao, A., Hwang, W.-H., Chen, Y.-C. and Kuo, C.-Y. (2000). Estimating the number of shared species in two communities. Statist. Sinica, 10, 227–246.
Church, K.W., Gale, W.A. (1991). Enhanced Good-Turing and Cat-Cal: two new methods for estimating probabilities of English bigrams. Comput. Speech Lang., 5, 19–54.
Darroch, J.N. and Ratcliff (1980). A Note on Capture-Recapture Estimation. Biometrics, 36, 149–153.
Efron, B. and Thisted, R. (1976). Estimating the number of unseen species: how many words did Shakespeare know? Biometrika, 63, 435–467.
Esty, W.W. (1985). Estimation of the number of classes in a population and the coverage of a sample. Mathematical Scientist, 10, 41–50.
Esty, W.W. (1986). The size of a coverage. Numismatic Chronicle, 146, 185–215.
Fewster, R.M. (2009). A simple explanation of Benford’s Law. Am. Stat., 63, 26–32.
Gandolfi, A. and Sastri, C.C.A. (2004). Nonparametric estimations about species not observed in a random sample. Milan J. Math 72, 81–105.
Good, I.J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika, 40, 237–266.
Good, I.J. (1965). The estimation of probabilities: an essay on modern bayesian method. Research Monograph No. 30, MIT Press, Cambridge, MA.
Good, I.J. (1967). A Bayesian significance test for multinomial distributions. J. Roy. Statist. Soc. Ser. B, 29, 399–431.
Good, I.J. and Toulmin, G. (1956). The number of new species and the increase in population coverage when a sample is increased. Biometrika, 43, 45–63.
Harris, B. (1968). Statistical inference in the classical occupancy problem: unbiased estimation of the number of classes. J. Amer. Statist. Assoc., 63, 837– 847.
Hill, T.P. (1995). The significant-digit phenomenon. Am. Math. Month., 102, 322–327.
Jeffreys, H. (1961). Theory of probability. Clarendom Press, Oxford, Third Edition.
Johnson, W.E. (1932). Probability: the deductive and inductive problems. Mind, 49, 409–423.
Laplace (1995). Philosophical essays in probabilities. Springer Verlag, New York.
Lewand, R.E. (2008). Relative frequencies of letters in general English plain text. Cryptographical Mathematics.
Lijoi, A., Mena, H.R. and Prünster, I. (2007). Bayesian nonparametric estimation of the probability of discovering new species. Biometrika, 94, 769–786.
Lindsay, B.G. and Roeder, K. (1987). A unified treatment of integer parameter models. J. Amer. Statist. Assoc., 82, 758–764.
Mao, C.X. (2004). Predicting the conditional probability of discovering a new class. J. Amer. Statist. Assoc., 99, 1108–1118.
Marchand, J.P. and Schroeck, F.E. (1982). On the estimation of the number of equally likely classes in a population. Comm. Statist. Theory Methods, 11, 1139– 1146.
Orlitsky, A., Santhanam, N.P. and Zhang, J. (2003). Always good Turing: asimptotically optimal probability estimation. Science, 302, 427–431.
Pietronero, L., Tosatti, E., Tosatti, V. and Vespignani, A. (2001). Explaining theuneven distribution of numbers in nature: The laws of Benford and Zipf. Phys. A, 293, 297–304.
Shen, T-J., Chao, A. and Lin, C-F. (2003). Predicting the number of new species in further taxonomic sampling. Ecology, 84, 798–804.
Tao, T. (2009). Benford’s law, Zipf’s law, and the Pareto distribution, Terence Tao’s blog. http://terrytao.wordpress.com/2009/07/03/benfords-law-zipfs-law-and-the-pareto-distribution/
Zabell, S.L. (1982). W. E Johnson’s “Sufficientness” postulate. Ann. Statist., 10, 1090–1099.
Zipf, G.K. (1935). The psychobiology of language; an introduction to dynamic philology. Houghton Mifflin, Boston.
Acknowledgement
This work was done during visits by one of us (CCAS) to the Università di Roma, Tor Vergata; Università di Milano-Bicocca; and Università di Firenze. He takes pleasure in thanking those universities for their warm hospitality and GNAMPA for its support. We would also like to thank J.S. Rao, M. Scarsini, J. Sethuraman and S.R.S. Varadhan for helpful comments.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Cecconi, L., Gandolfi, A. & Sastri, C.C.A. A new estimator for the number of species in a population. Sankhya A 74, 80–100 (2012). https://doi.org/10.1007/s13171-012-0012-x
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13171-012-0012-x