# Asymptotic properties of Turing’s formula in relative error

- 371 Downloads
- 1 Citations

## Abstract

Turing’s formula allows one to estimate the total probability associated with letters from an alphabet, which are not observed in a random sample. In this paper we give conditions for the consistency and asymptotic normality of the relative error of Turing’s formula of any order. We then show that these conditions always hold when the distribution is regularly varying with index \(\alpha \in (0,1]\).

## Keywords

Asymptotic normality Consistency Distributions on alphabets Missing mass Regular variation Turing’s formula## 1 Introduction

In many situations one works with data that has no natural ordering and is categorical in nature. In such cases, an important problem is to estimate the probability of seeing a new category that has not been observed before. This probability is called the missing mass. See McAllester and Ortiz (2003), Berend and Kontorovich (2013), Ben-Hamou et al. (2017), Decrouez et al. (2016) and the references therein for a discussion of its properties. The problem of estimating the missing mass arises in many applications, including ecology (Good and Toulmin 1956; Chao 1981; Chao et al. 2015), genomics (Mao and Lindsay 2002), speech recognition (Gupta et al. 1992; Chen and Goodman 1999), authorship attribution (Efron and Thisted 1976; Thisted and Efron 1987; Zhang and Huang 2007), and computer networks (Zhang 2005). Perhaps the most famous estimator of the missing mass is Turing’s formula, sometimes also called the Good–Turing formula. This formula was first published by Good (1953), where the idea is credited, largely, to Alan Turing. To discuss this estimator and how it works, we begin by formally defining our framework.

*n*from \({\mathcal {A}}\) according to \({\mathcal {P}}\), let

*r*times.

We will prove (4) under a new sufficient condition. Interestingly, this condition turn out to be more restrictive than the one for (3). This is likely due to the fact that, since we are now dividing by \(\pi _{0,n}\), we must make sure that it does not approach zero too quickly. We note that our approach is quite different from the one used in Ohannessian and Dahleh (2012) to prove (2). That approach uses tools specific to regularly varying distributions, which we will not need.

The remainder of the paper is organized as follows. In Sect. 2 we recall Turing’s formulae of higher orders and give conditions for the asymptotic normality and consistency of their relative errors. In Sect. 3, we give some comments on the assumptions of the main results and give alternate ways of checking them. Then, in Sect. 4, we show that the assumptions always hold when the distribution, \({\mathcal {P}}\), is regularly varying with index \(\alpha \in (0,1]\). For \(\alpha \ne 1\) this is the condition under which the consistency results of Ohannessian and Dahleh (2012) were obtained. Finally, proofs of the main results are given in Sect. 5.

Before proceeding we introduce some notation. For \(x>0\) we denote the gamma function by \(\Gamma (x)=\int _0^\infty e^{-t} t^{x-1}\mathrm dt\). For real valued functions *f* and *g*, we write \(f(x)\sim g(x)\) as \(x\rightarrow c\) to mean \(\lim _{x\rightarrow c}\frac{f(x)}{g(x)} = 1\). For sequences \(a_n\) and \(b_n\), we write \(a_n\sim b_n\) to mean \(\lim _{n\rightarrow \infty }\frac{a_n}{b_n}=1\). We write \(N(\mu ,\sigma ^2)\) to refer to a normal distribution with mean \(\mu \) and variance \(\sigma ^2\). We write \(\mathop {\rightarrow }\limits ^{d}\) to refer to convergence in distribution and \(\mathop {\rightarrow }\limits ^{p}\) to refer to convergence in probability.

## 2 Main results

*r*times, and let

*r*. Turing’s formula of order 0 is just called Turing’s formula and is the most useful in applications as it estimates the probability of seeing a letter that has never been observed before. We now recall the results about asymptotic normality given in Zhang and Huang (2008) and Zhang (2013).

*s*, we say that Condition \(A_s\) is satisfied if

### Lemma 1

We are now ready to state our main result.

### Theorem 1

### Proof

The proof is given in Sect. 5. \(\square \)

### Remark 1

Note that Theorem 1 does not, in general, give \(\sqrt{n}\)-convergence. In fact, the rate of convergence is different for different distributions. In Sect. 4 we will characterize the rates for the case of regularly varying distributions.

Since the most important case is when \(r=0\), we restate Theorem 1 for this case.

### Corollary 1

The results of Theorem 1 may not appear to be of practical use since we generally do not know the values of \(g_n\), \(\mu _{r,n}\), \(c_{r+1}\), or \(c_{r+2}\). However, it turns out that we do not need to know these quantities. So long as a sequence \(g_n\) satisfying the assumptions exists, it and everything else can be estimated.

### Corollary 2

### Proof

The proof is given in Sect. 5. \(\square \)

In the proof of Theorem 1, it is shown that, under the assumptions of that theorem, \(\mu _{r,n} g_n\rightarrow \infty \). This means that we can immediately get consistency.

### Corollary 3

The assumptions of Corollary 3 are quite general. Different conditions are given in Corollary 5.3 of Ben-Hamou et al. (2017). The most general possible conditions for (8) are not known, but it is known is that some conditions are necessary. In fact, Mossel and Ohannessian (2015) showed that there cannot exist an estimator of \(\pi _{0,n}\) for which (8) holds for every distribution.

## 3 Discussion

### Lemma 2

### Proof

The proof is given in Sect. 5. \(\square \)

### Remark 2

An intuitive interpretation of \(\Phi _s(n)\) is as follows. Consider the case where the sample size is not fixed at *n*, but is a random variable \(n^*\), where \(n^*\) follows a Poisson distribution with mean *n*. In this case \(y_{k,n^*}\) follows a Poisson distribution with mean \(np_k\) and \(\Phi _s(n)=\mathrm E[N_{s,n^*}]\). In this sense, Condition (d) can be thought of as a Poissonization of Condition (c). Poissonization is a useful tool when studying the occupancy problem and is discussed, at length, in Gnedin et al. (2007).

We now turn to the effects of Condition \(A_s\).

### Lemma 3

- 1.Let \(s\ge 1\) be an integer. If Condition \(A_s\) holds with \(c_s>0\) thenand$$\begin{aligned} \lim _{n\rightarrow \infty }\frac{g_n}{n^{1/2}}=\infty \end{aligned}$$(9)$$\begin{aligned} \lim _{n\rightarrow \infty }\frac{g_n}{g_{n+1}}=1. \end{aligned}$$(10)
- 2.When \(r\ge 2\) and Condition \(A_{r+1}\) holds, (7) is equivalent to$$\begin{aligned} \limsup _{n\rightarrow \infty } \sum _{k:p_k<1/n} \left( g_np_k\right) ^2<\infty . \end{aligned}$$

### Proof

The proof is given in Sect. 5. \(\square \)

It is important to note that (9) is implicitly used in the proof of Lemma 1 as given in Zhang and Huang (2008) and Zhang (2013), although it is not directly mentioned there. Further, (9) implies that the assumption in (5) that \(\beta \in (0,1/2)\) is not much of a restriction. It also tells us that \(g_n\) must approach infinity quickly, but (5) tell us that it should not do so too quickly. On the other hand, (10) is a smoothness assumption. It looks like a regular variation condition, but is a bit weaker, see Theorem 1.9.8 in Bingham et al. (1987).

### Remark 3

Lemmas 2 and 3 help to explain what kind of distributions satisfy the assumptions of Theorem 1. Specifically, the two lemmas imply that \(r!n^{-1}g_n^2\mu _{r,n} \rightarrow c_{r+1}>0\). In light of (5), this means that \(\mu _{r,n}\) cannot approach zero too quickly. Thus, \(\pi _{r,n}\) cannot approach zero quickly either. This condition means that the distribution must have heavy tails of some kind. Arguably, the best known distributions with heavy tails are those that are regularly varying, which we focus on in the next section.

## 4 Regular variation

*y*, and the function

### Proposition 1

### Proof

Next, we turn to the case when \(\alpha =1\).

### Proposition 2

We note that the integral in the definition of \(\ell _1\) converges, see the proof of Proposition 14 in Gnedin et al. (2007). Further, by Karamata’s Theorem (see e.g. Theorem 2.1 in Resnick 2007), \(\ell _1\) is slowly varying at infinity.

### Proof

### Remark 4

### Remark 5

When \(\alpha =0\) the distributions may no longer be heavy tailed and the results of Theorem 1 need not hold. In fact, while all geometric distributions are regularly varying with \(\alpha =0\), Ohannessian and Dahleh (2012) showed that for some of them (8) does not hold, and thus neither does the result of Theorem 1.

Combining Corollaries 2 and 3 with Propositions 1 and 2 gives the following.

### Corollary 4

Note that, in the above, we do not need to know what \(\alpha \) and \(\ell \) are, only that they exist.

## 5 Proofs

*at most*

*r*times, and let

### Lemma 4

The proof is similar to that of Lemma 20 in Ohannessian and Dahleh (2012). We include it for completeness.

### Proof

Define the events \(A=[-\epsilon /2<\Pi _{r,n}-M_{r,n}<\epsilon /2]\), \(B=[-\epsilon /2<M_{r-1,n}-\Pi _{r-1,n}<\epsilon /2]\), and \(C=[-\epsilon<\pi _{r,n}-\mu _{r,n}<\epsilon ]\). Since \(A\cap B\subset C\) it follows that \(P(C^c) \le P( A^c\cup B^c)\le P(A^c)+P(B^c)\), which gives the first inequality. The second follows by Chebyshev’s inequality. \(\square \)

We will need bounds on the variances in the above lemma.

### Lemma 5

This result follows from the fact that the random variables \(\{y_{k,n}:k=1,2,\dots \}\) are negatively associated, see Dubhashi and Ranjan (1998). For completeness we give a detailed proof.

### Proof

*n*and \(p_k\) gives

To help simplify the above bound, we give the following result.

### Lemma 6

### Proof

### Proof of Lemma 2

### Proof of Lemma 3

*s*/

*n*, 1] and that \(sn^{-1}<n^{-1/2}\) for large enough

*n*gives

### Proof of Theorem 1

### Proof of Corollary 2

## Notes

### Acknowledgements

The authors wish to thank the anonymous referees whose detailed comments led to improvements in the presentation of this paper.

## References

- Abramowitz, M., & Stegun, I. A. (1972).
*Handbook of mathematical functions*(10th ed.). New York: Dover Publications.zbMATHGoogle Scholar - Ben-Hamou, A., Boucheron, S., & Ohannessian, M. I. (2017). Concentration inequalities in the infinite urn scheme for occupancy counts and the missing mass, with applications.
*Bernoulli*,*23*(1), 249–287.MathSciNetCrossRefzbMATHGoogle Scholar - Berend, D., & Kontorovich, A. (2013). On the concentration of the missing mass.
*Electronic Communications in Probability*,*18*(3), 1–7.MathSciNetzbMATHGoogle Scholar - Bingham, N. H., Goldie, C. M., & Teugels, J. L. (1987).
*Regular variation. Encyclopedia of mathematics and its applications*. Cambridge: Cambridge University Press.zbMATHGoogle Scholar - Chao, A. (1981). On estimating the probability of discovering a new species.
*The Annals of Statistics*,*9*(6), 1339–1342.MathSciNetCrossRefzbMATHGoogle Scholar - Chao, A., Hsieh, T. C., Chazdon, R. L., Colwell, R. K., & Gotelli, N. J. (2015). Unveiling the species-rank abundance distribution by generalizing the Good–Turing sample coverage theory.
*Ecology*,*96*(5), 1189–1201.CrossRefGoogle Scholar - Chen, S. F., & Goodman, J. (1999). An empirical study of smoothing techniques for language modeling.
*Computer Speech and Language*,*13*(4), 359–394.CrossRefGoogle Scholar - Cohen, A., & Sackrowitz, H. B. (1990). Admissibility of estimators of the probability of unobserved outcomes.
*Annals of the Institute of Statistical Mathematics*,*42*(4), 623–636.MathSciNetCrossRefzbMATHGoogle Scholar - Decrouez, G., Grabchak, M., & Paris, Q. (2016). Finite sample properties of the mean occupancy counts and probabilities.
*Bernoulli*(to appear). arXiv:1601.06537v2. - Dubhashi, D., & Ranjan, D. (1998). Balls and bins: A study in negative dependence.
*Random Structures and Algorithms*,*13*(2), 99–124.MathSciNetCrossRefzbMATHGoogle Scholar - Efron, B., & Thisted, R. (1976). Estimating the number of unseen species: How many words did Shakespeare know?
*Biometrika*,*63*(3), 435–447.zbMATHGoogle Scholar - Esty, W. W. (1983). A normal limit law for a nonparametric estimator of the coverage of a random sample.
*Annals of Statistics*,*11*(3), 905–912.MathSciNetCrossRefzbMATHGoogle Scholar - Favaro, S., Nipoti, B., & Teh, Y. W. (2016). Rediscovery of Good–Turing estimators via Bayesian nonparametrics.
*Biometrics*,*72*(1), 136–145.MathSciNetCrossRefzbMATHGoogle Scholar - Gnedin, A., Hansen, B., & Pitman, J. (2007). Notes on the occupancy problem with infinitely many boxes: General asymptotics and power laws.
*Probability Surveys*,*4*, 146–171.MathSciNetCrossRefzbMATHGoogle Scholar - Good, I. J. (1953). The population frequencies of species and the estimation of population parameters.
*Biometrika*,*40*(3/4), 237–264.MathSciNetCrossRefzbMATHGoogle Scholar - Good, I. J., & Toulmin, G. H. (1956). The number of new species, and the increase in population coverage, when a sample is increased.
*Biometrika*,*43*(1–2), 45–63.MathSciNetCrossRefzbMATHGoogle Scholar - Grabchak, M., & Cosme, V. (2017). On the performance of Turing’s formula: A simulation study.
*Communication in Statistics: Simulation and Computation*,*46*(6), 4199–4209.MathSciNetCrossRefGoogle Scholar - Gupta, V., Lennig, M., & Mermelstein, P. (1992). A language model for very large-vocabulary speech recognition.
*Computer Speech and Language*,*6*(4), 331–344.CrossRefGoogle Scholar - Karlin, S. (1967). Central limit theorems for certain infinite urn schemes.
*Journal of Mathematical Mechanics*,*17*, 373–401.MathSciNetzbMATHGoogle Scholar - Mallows, C. L. (1968). An inequality involving multinomial probabilities.
*Biometrika*,*55*(2), 422–424.CrossRefzbMATHGoogle Scholar - Mao, C. X., & Lindsay, B. G. (2002). A Poisson model for the coverage problem with a genomic application.
*Biometrika*,*89*(3), 669–681.MathSciNetCrossRefzbMATHGoogle Scholar - McAllester, D. A., & Schapire, R. E. (2000). On the convergence rate of Good–Turing estimators. In
*COLT ’00: Proceedings of the thirteenth annual conference on computational learning theory*(pp. 1–6).Google Scholar - McAllester, D. A., & Ortiz, L. E. (2003). Concentration inequalities for the missing mass and for histogram rule error.
*Journal of Machine Learning Research*,*4*(Oct), 895–911.MathSciNetzbMATHGoogle Scholar - Mossel, E., & Ohannessian, M. I. (2015). On the impossibility of learning the missing mass. arXiv:1503.03613v1.
- Ohannessian, M. I., & Dahleh, M. A. (2012). Rare probability estimation under regularly varying heavy tails. In
*JMLR workshop and conference proceedings*(Vol. 23, pp. 21.1–21.24).Google Scholar - Resnick, S. I. (2007).
*Heavy-tail phenomena: Probabilistic and statistical modeling*. New York: Springer.zbMATHGoogle Scholar - Robbins, H. E. (1968). Estimating the total probability of the unobserved outcomes of an experiment.
*Annals of Mathematical Statistics*,*39*(1), 256–257.MathSciNetCrossRefzbMATHGoogle Scholar - Thisted, R., & Efron, B. (1987). Did Shakespeare write a newly discovered poem.
*Biometrika*,*74*(3), 445–455.MathSciNetCrossRefzbMATHGoogle Scholar - Zhang, C. H. (2005). Estimation of sums of random variables: Examples and information bounds.
*The Annals of Statistics*,*33*(5), 2022–2041.MathSciNetCrossRefzbMATHGoogle Scholar - Zhang, C. H., & Zhang, Z. (2009). Asymptotic normality of a nonparametric estimator of sample coverage.
*Annals of Statistics*,*37*(5A), 2582–2595.MathSciNetCrossRefzbMATHGoogle Scholar - Zhang, Z. (2013). A multivariate normal law for Turing’s formulae.
*Sankhya A*,*75*(1), 51–73.MathSciNetCrossRefzbMATHGoogle Scholar - Zhang, Z., & Huang, H. (2007). Turing’s formula revisited.
*Journal of Quantitative Linguistics*,*14*(2–3), 222–241.CrossRefGoogle Scholar - Zhang, Z., & Huang, H. (2008). A sufficient normality condition for Turing’s formula.
*Journal of Nonparametric Statistics*,*20*(5), 431–446.MathSciNetCrossRefzbMATHGoogle Scholar