Estimation of Tail Probabilities by Repeated Augmented Reality

Abstract

Synthetic data, when properly used, can enhance patterns in real data and thus provide insights into different problems. Here, the estimation of tail probabilities of rare events from a moderately large number of observations is considered. The problem is approached by a large number of augmentations or fusions of the real data with computer-generated synthetic samples. The tail probability of interest is approximated by subsequences created by a novel iterative process. The estimates are found to be quite precise.

Introduction

The citation accompanying his U.S. National Medal of Science in 2002 honored Calyampudi Radhakrishna Rao “as a prophet of new age for his pioneering contributions to the foundations of statistical theory and multivariate statistical methodology and their applications.” When Professor Rao organized the ‘International Conference on the Future of Statistics, Practice and Education’ in Hyderabad (Indian School of Business, 12.29.04–01.01.05), one of us participated in it. Befitting this connection, we decided to contribute what we believe is a “futuristic” application of augmented reality in his honor.

In its February 4th 2017 edition, The Economist noted the promise of augmented reality, claiming that “Replacing the real world with a virtual one is a neat trick. Combining the two could be more useful.” We concur. Combining real data with synthetic data, i.e., augmented reality (AR), opens up new perspectives regarding statistical inference. Indeed, augmentation of real data with virtual information is an idea that has already found applications in fields such as robotics, medicine, and education.

In this article, we advance the notion of repeated augmented reality in the estimation of very small tail probabilities even from moderately sized samples. Our approach, much like the bootstrap, is computationally intensive and could not have been viable without the computing power of modern systems. However, rather than looking repeatedly inside the sample, we look repeatedly outside the sample. Fusing a given sample repeatedly with computer-generated data is referred to as repeated out of sample fusion (ROSF) in Pan et al. [1, 2]. Related ideas concerning a single fusion are studied in Fithian and Wager [3], Fokianos and Qin [4], Katzoff et al. [5], and Zhou [6].

In 1984, the so-called Watras incident led to intense media and congressional attention in the USA to the problem of residential exposure to radon, a known carcinogenic gas. Radon in the home of Stanley Watras, a construction engineer, located in Boyertown, Berks county, on the Reading Prong geological formation in Pennsylvania, was recorded as almost 700 times the safe level, which is a lung cancer risk equivalent of smoking 250 packs of cigarettes per day! As noted by George [7], this news caused a major alarm and led the US EPA to establish a radon measurement program. In this regard, the present article will review the underpinnings of ROSF in estimation of small tail exceedance probabilities. We will illustrate its application using residential radon level data from Beaver County, Pennsylvania.

The Problem

Consider a random variable \(X \sim g\) and the corresponding moderately large random sample \(\varvec{X}_0=(X_1,\ldots ,X_{n_{0}})\) where all the observations are smaller than a high threshold T, that is \({\mathrm{max}}(\varvec{X}_0) < T\). We wish to estimate \(p=P(X > T)\) without knowing g. However, as is, the sample may not contain sufficient amount of information to tackle the problem. To gain more information, the problem is approached by combining or fusing the sample repeatedly with externally generated computer data. That is, ROSF.

The Approach

Let \(\varvec{X}_i\) denote the ith computer-generated sample of size \(n_1=n_0\). Then, the fused samples are the augmentations

$$\begin{aligned} (\varvec{X}_0,\varvec{X}_1),(\varvec{X}_0,\varvec{X}_2),(\varvec{X}_0,\varvec{X}_3)\ldots \end{aligned}$$
(1)

where \(\varvec{X}_0\) is a real reference sample and the \(\varvec{X}_i\) are different independent computer-generated samples supported on (0, U), where \(U > T\). The number of fusions can be as large as we wish. From each pair \((\varvec{X}_0,\varvec{X}_j)\), under a mild condition, we get in a certain way an upper bound \(B_j\) for p. Let \(\{B_{(j)}\}\) be the sequence of order statistics. Then, the sorted pairs

$$\begin{aligned} (1,B_{(1)}), (2, B_{(2)}), (3,B_{(3)}), \ldots (n, B_{(N)}) \end{aligned}$$

produce a monotone curve, referred to as the B-curve, which for large N, contains a point “\(\bullet \)” as in Fig. 1. As N increases, the ordinate of the point essentially coincides with p with probability approaching one. It follows that the sequence

$$\begin{aligned} B_{(1)}, B_{(2)}, B_{(3)}, \ldots B_{(N)} \end{aligned}$$

contains subsequences which approach p. The subsequences can be obtained by an iterative process to be described in Sect. 3.

Illustrations of an Iterative Process

Deferring details to later sections, it is helpful to shed light early on and introduce our iterative method which produces estimates of tail probabilities, using reference samples \(\varvec{X}_0\) from \(F_{(2,7)}\) and LN(1, 1) distributions.

In the first illustration, \(\varvec{X}_0\) is a random sample from \(F_{(2,7)}\), \(T=21.689\), giving \(p=0.001\). Here, \(n_0=n_1=100\), \({\mathrm{max}}(\varvec{X}_0)=12.25072\), and the computer-generated samples consist of independent \({\mathrm{Unif}}(0,50)\). With \(N=10,000\) fusions, and starting from \(j=450\), our iterative process (9) bellow produces a converging subsequence which approaches p from above, a “Down” subsequence:

$$\begin{aligned}&450 \rightarrow 0.001703351 \rightarrow 438 \rightarrow 0.001603351 \rightarrow 407 \rightarrow 0.001503351 \rightarrow \\&369 \rightarrow 0.001403351 \rightarrow 341 \rightarrow 0.001303351 \rightarrow 312 \rightarrow 0.001203351 \rightarrow \\&278 \rightarrow 0.001103351 \rightarrow 246 \rightarrow 0.001003351 \rightarrow 221 \rightarrow 0.001003351 \cdots \end{aligned}$$

Starting from \(j=210\), our iterative process (9) produces an “Up” subsequence which converges by a single iteration giving:

$$\begin{aligned} 210 \rightarrow 0.001003351 \rightarrow 219 \rightarrow 0.001003351 \rightarrow 219 \rightarrow 0.001003351 \cdots \end{aligned}$$

In the second illustration, \(\varvec{X}_0\) is a random sample from LN(1, 1), \(T=59.75377\), giving \(p=0.001\). Here, \(n_0=n_1=200\), \({\mathrm{max}}(\varvec{X}_0)=33.63386\), and the computer-generated samples consist of independent \({\mathrm{Unif}}(0,80)\). With \(N=10,000\) fusions, and starting from \(j=800\), our iterative process (9) bellow produces a converging “Down” subsequence which approaches p from above by a single iteration:

$$\begin{aligned} 800 \rightarrow 0.001000281 \rightarrow 788 \rightarrow 0.001000281 \rightarrow 788 \rightarrow 0.001000281 \cdots \end{aligned}$$

And staring from \(j=790\), our iterative process (9) produces an “Up” subsequence which converges by a single iteration giving:

$$\begin{aligned} 790 \rightarrow 0.001000281 \rightarrow 815 \rightarrow 0.001000281 \rightarrow 815 \rightarrow 0.001000281 \cdots \end{aligned}$$

Notice that the “Down-Up” convergence in both illustrations is remarkably close to the true \(p=0.001\). We have had quite a few similar results where the tail behavior differed markedly. The computation here required an important parameter called “p-increment” which in the present examples was 0.0001. We shall deal with this numerical issue soon.

A Useful Feature

A useful feature of the present article is the realization that we can come up with educated guesses as to the magnitude of p from the value of \({\mathrm{max}}(\varvec{X}_0)\) relative to T. And this in turn suggests a set of discrete points in the interval \(({\mathrm{min}}(B_j),{\mathrm{max}}(B_j))\) at which p-estimates are evaluated, along the “p-increments” mentioned above. The p-increment is a single number used to create the grid for searching for p-estimates.

Getting Upper Bounds for p by Data Fusion

Recall that \(\varvec{X}_0=(X_1,\ldots ,X_{n_0})\) is a reference sample from some reference probability density (pdf) g(x) and let G(x) denote the corresponding distribution function (CDF) . Since we shall deal with radon data, we assume that \(x \in (0,\infty )\). The goal is to estimate a small tail probability

$$\begin{aligned} p=P(X > T)=1-G(T)=\int _{T}^{\infty }g(x){\mathrm{d}}x. \end{aligned}$$

Let \(\varvec{X}_1\) be a computer-generated random sample of size \(n_1\) and assume \(\varvec{X}_1 \sim g_1,G_1\). The augmentation

$$\begin{aligned} \varvec{t}= (t_1,\ldots ,t_{n_0 + n_1}) = (\varvec{X}_0,\varvec{X}_1), \end{aligned}$$
(2)

of size \(n_0 + n_1\) gives the fused data from \(\varvec{X}_0\) and \(\varvec{X}_1\). We shall assume the density ratio model [8, 9]

$$\begin{aligned} \frac{g_1(x)}{g(x)} = \exp (\alpha _1 + \varvec{\beta }_1^{\prime } \varvec{h}(x)) \end{aligned}$$
(3)

where \(\alpha _1\) is a scalar parameter, \(\varvec{\beta }_j\) is an \(r \times 1\) vector parameter, and \(\varvec{h}(x)\) is an \(r \times 1\) vector valued function. Clearly, to generate \(\varvec{X}_1\), we must know the corresponding \(g_1\). However, beyond the generating process, we do not make use of this knowledge. Thus, by our estimation procedure, none of the probability densities \(g,g_1\) and the corresponding \(G,G_1\), and none of the parameters \(\alpha _1\) and \(\varvec{\beta }_1\) are assumed known, but, strictly speaking, the so called tilt function \(\varvec{h}\) must be a known function. However, in the present application, the requirement of a known \(\varvec{h}\) is weakened considerably by the mild assumption (4) below, which may hold even for misspecified \(\varvec{h}\), as numerous examples with many different tail types show. Accordingly, based on numerous experiments, some of which discussed in Pan et al. [1], we assume the “gamma tilt” \(h(x)=(x,\log x)\). Further justification for the gamma tilt is provided by our data analysis below.

Under the density ratio model (11), the maximum likelihood estimate of G(x) based on the fused data \(\varvec{t}=(\varvec{X}_0,\varvec{X}_1)\) is given in (14) in the “Appendix A.1”, along with its asymptotic distribution described in Theorem A.1. From the theorem, we obtain confidence intervals for \(p=1-G(T)\) for any threshold T using (17). In particular, we get an upper bound \(B_1\) for p. In the same way, from additional independent computer-generated samples \(\varvec{X}_2,\varvec{X}_3,\ldots ,\varvec{X}_N\) we get upper bounds for p from the pairs \((\varvec{X}_0,\varvec{X}_2),(\varvec{X}_0,\varvec{X}_3),\ldots (\varvec{X}_0,\varvec{X}_N)\). Thus, conditional on \(\varvec{X}_0\), the sequence of upper bounds \(B_1,B_2,\ldots ,B_N\) is then an independent and identically distributed sequence of random variables from some distribution \(F_{B}\). It is assumed that

$$\begin{aligned} 0< F_B(p) < 1 \end{aligned}$$
(4)

so that

$$\begin{aligned} P(B_1> p) = 1 - F_{B}(p) > 0. \end{aligned}$$

Let \(B_{(1)},B_{(2)},\ldots ,B_{(N)}\) be a sequence of order statistics from smallest to largest. Then, as \(N \rightarrow \infty \), \(B_{(1)}\) decreases and \(B_{(N)}\) increases. Hence, as mentioned before, as the number of fusions N increases the plot consisting of the pairs

$$\begin{aligned} (1,B_{(1)}),(2,B_{(2)}),\ldots ,(N,B_{(N)}) \end{aligned}$$
(5)

contains a point “\(\bullet \)” whose ordinate is p with probability approaching 1. It follows that as \(N \rightarrow \infty \), there is a \(B_{(j)}\) which essentially coincides with p. The plot of points consisting of the pairs \((j,B_{(j)})\) in (5) is referred to as the B-curve.

We now make the following important observations.

a.:

Assumption (4) implies that as N increases,

$$\begin{aligned} B_{(1)}< p < B_{(N)} \end{aligned}$$
(6)

with probability approaching one.

b.:

The point “\(\bullet \)” moves down the B-curve when \({\mathrm{max}}(\varvec{X}_0)\) approaches T. The point “\(\bullet \)” moves up the B-curve when \({\mathrm{max}}(\varvec{X}_0)\) decreases away from T.

Hence, as N increases, the size of \({\mathrm{max}}(\varvec{X}_0)\) relative to T provides useful information as to the approximate magnitude of p. Specifically, the first quartile of \(B_1,B_2,\ldots ,B_N\) is a sensible guess of p as \({\mathrm{max}}(\varvec{X}_0)\) approaches T, and the third quartile, or even \({\mathrm{max}}(B)\), is a sensible approximation of p when \({\mathrm{max}}(\varvec{X}_0)\) is small. Otherwise the mean or median of \(B_1,B_2,\ldots ,B_N\) provides practical guesses of the approximate magnitude of p.

c.:

Let \({\hat{F}}_{B}\) be the empirical distribution obtained from the sequence of upper bounds \(B_1,B_2,\ldots ,B_N\). Then, from the Glivenko–Cantelli Theorem, \({\hat{F}}_{B}\) converges to \(F_{B}\) almost surely uniformly as N increases. Since the number of fusions can be as large as we wish, our key idea, \(F_B\) is known for all practical purposes. Hence, as seen from b, \(F_{B}\) provides information about p.

Knowing \(F_B\) is a significant consequence of repeated out of sample fusion. Its implication is that the exact distribution of any \(B_{(j)}\) is practically known.

Capturing p

For a sufficiently large number of fusions N, the monotonicity of the B-curve and (6) imply there are \(B_{(j)}\) which approach p from above so that there is a \(B_{(j)}\) very close to p. Likewise, the \(B_{(j)}\) can approach p from below. Thus, the B-curve establishes a relationship between j and p.

Another relationship between j and p is obtained from the well-known distribution of order statistics,

$$\begin{aligned} P(B_{(j)} > p) = \sum _{k=0}^{j-1} {N \atopwithdelims ()k} [F_B(p)]^k [1- F_B(p)]^{N-k} \end{aligned}$$
(7)

which can be computed since \(F_B\) is practically known for sufficiently large N. Iterating between these two relationships provides a way to approximate p as is described next.

From (7), we can get the smallest \(p_{j}\) such that

$$\begin{aligned} P(B_{(j)} > p_j)= \sum _{k=0}^{j-1} {N \atopwithdelims ()k} [F_B(p_j)]^k [1- F_B(p_j)]^{N-k} \le 0.95, \end{aligned}$$
(8)

The 0.95 probability bound was chosen arbitrarily and can be replaced by other high probabilities.

It is important to note that in practice, and in what follows, the \(p_j\) in (8) are evaluated on a grid incrementally along specified small increments.

Thus, with \(B_{(j_k)}\)’s from the B-curve, and \(p_{(j_k)}\)’s the smallest p’s satisfying (8) with \(j=j_k\), and \(B_{(j_{k+1})}\) closest to \(p_{(j_k)}\), \(k=1,2,\ldots \), we have the iterative process [1, 10],

$$\begin{aligned} B_{(j_1)} \rightarrow p_{(j_1)} \rightarrow B_{(j_2)} \rightarrow \cdots B_{(j_k)} \rightarrow p_{j_k} \rightarrow B_{(j_{k+1})} \rightarrow p_{j_k} \rightarrow B_{(j_{k+1})} \rightarrow p_{j_k} \cdots \end{aligned}$$

so that \(p_{j_k}\) keeps giving the same \(B_{(j_{k+1})}\) (and hence the same \(j_{k+1}\)) and vice versa. This can be expressed more succinctly as,

$$\begin{aligned} j_1 \rightarrow p_{(j_1)} \rightarrow j_2 \rightarrow p_{(j_2)} \rightarrow \cdots j_k \rightarrow p_{j_k} \rightarrow j_{k+1} \rightarrow p_{j_k} \rightarrow j_{k+1} \rightarrow p_{j_k} \cdots \end{aligned}$$
(9)

In general, starting with any j, convergence occurs when for the first time \(B_{(j_k)} = B_{(j_{k+1})}\) for some k and we keep getting the same probability \(p_{j_k}\).

Clearly, the \(p_{j_k}\) sequence could decrease or increase producing “down” and “up” subsequences. For example, suppose that the probabilities

$$\begin{aligned} P(B_{(j_1)}> p_{j_1}), \ P(B_{(j_2)} > p_{j_2}),\ldots . \end{aligned}$$

are sufficiently high probabilities, and that from the B-curve we get the closest approximations

$$\begin{aligned} p_{j_1} {\mathop {=}\limits ^{.}} B_{(j_2)}, \ p_{j_2} {\mathop {=}\limits ^{.}} B_{(j_3)}\ldots . \end{aligned}$$

Then, with a high probability we get a decreasing “down” sequence

$$\begin{aligned} B_{(j_1)} \ge B_{(j_2)} \ge B_{(j_3)} \cdots . \end{aligned}$$

However, when the probabilities are sufficiently low it is possible for the closest \(B_{(j)}\) approximations of the \(p_j\) to reverse course leading to an increasing “up” sequence

$$\begin{aligned} B_{(j^{\prime }_1)} \le B_{(j^{\prime }_2)} \le B_{(j^{\prime }_3)} \cdots . \end{aligned}$$

This “down-up” tendency has been observed numerous times with real and artificial data. It manifests itself clearly in the radon examples below.

In particular, as was illustrated earlier in Sect. 1.3, this “down-up” phenomenon tends to occur in a neighborhood of the true p, where a transition or shift occurs from “down” to “up” or vice versa, resulting in a “capture” of p. This is summarized in the following proposition.

Proposition

Assume that the samples size \(n_0\) of \(\varvec{X}_0\) is large enough, and that the number of fusions N is sufficiently large so that \(B_{(1)}< p < B_{(N)}\). Consider the smallest \(p_j \in (0,1)\) which satisfy the inequality (8) where the \(p_j\) are evaluated along appropriate numerical increments. Then, (8) produces “down” and “up” sequences depending on the \(B_{(j)}\) relative to \(p_j\). In particular, in a neighborhood of the true tail probability p, with a high probability, there are “down” sequences which converge from above and “up” sequences which converge from below to points close to p.

Illustrations Using Radon Data

We shall now demonstrate the proposition using radon data examples. Many additional examples were given in Pan et al. [1]. All the examples point to a remarkable “down-up” patterns in a neighborhood of the true p, providing surprisingly precise estimates of p. It should be noted that the number of iterations decreases as the \(B_{(j)}\) approach p, a telltale sign that convergence is about to occur.

The iterative process (9) is repeated with different starting j’s until a clear pattern emerges where different adjacent j’s give rise to Down-Up subsequences which converge to the same value, it being our estimate \({\hat{p}}\). The process may be repeated with different p-increments.

Computational Considerations

To enable computation with R, in (8) the binomial coefficients were evaluated with \(N=1000\), as if there were 1000 fusions only. However, there are no restrictions on the number of fusions and \(F_B\) was obtained throughout from 10, 000 fusions, and hence 10, 000 B’s.

Each entry in the following tables was obtained from a different sample of 1000 B’s sampled at random from 10,000 B’s. More precisely, each entry was obtained from an approximate B-curve obtained from the sampled 1000 B’s and an approximate (8) with \(N=1000\). Thus, for each entry, we iterated between an approximate B-curve and approximate (8) with \(N=1000\).

Choice of p-Increment

An important consideration is the choice of the increments of p along which the probability (8) is evaluated. Certainly, any approximation of p must reside between consecutive B’s. Hence, sensible p-increments are fractions of either the mean, median, first or third quartiles, or even fractions of \({\mathrm{max}}(B)=B_{(10,000)}\). In the following example, the p-increments are of the order of magnitude approximately equal to one tenth of one of these quantities.

Beaver County Radon Tail Probabilities

Radon-222, or just radon, is a tasteless, colorless and odorless radioactive gas, which is a product of Uranium-238 and Radium-226, both of which are naturally abundant in the soil. Radon is known around the world as a carcinogen, and its exposure is the leading risk factor of lung cancer among non-smokers. Geological radon exposure takes place mostly through cracks and openings in the ground due to underlying geological formations. Approximately 40 percent of Pennsylvania (PA) homes have radon levels above the US EPA action guideline of 4 pCi/L. Residential radon test levels were collected by PA Department of Environmental Protection (PADEP) statewide in the period from 1990 to 2007. See Zhang et al. [11] for a study of indoor radon concentrations from Beaver County and its neighboring counties in PA.

In the following examples, ROSF is applied to Beaver County radon data from 1989 to 2017, for various p-increments. There were 7425 radon observations, taken as a population, of which only 2 exceed 200. Hence, with \(T=200\) we wish to estimate the small probability \(p=2/7425=0.0002693603\). Throughout the examples, \(\varvec{X}_0\) is a reference random sample chosen without replacement from the 7425 radon observations. The generated \(\varvec{X}_1\) samples are from \({\mathrm{Unif}}(0,300)\) and \(n_0=n_1=500.\)

In the tables below, “Down”, “Up”, “No j change”, means that in the iterative process (9) there was a downward, or upward, or no change in j, respectively.

Fig. 1
figure1

B-Curves, 10,000 B’s, from residential radon sample \(\varvec{X}_0\). \(p=0.0002693603\), \(\varvec{X}_1 \sim {\mathrm{Unif}}(0,300)\), \(T=200\), \(n_0=n_1=500\), \(h=(x,\log x)\). \({\mathrm{max}}(\varvec{X}_0)\) values: top left 77.9, top right 107. Bottom left 143, bottom right 193.7. The point “\(\bullet \)” moves to the left as \({\mathrm{max}}(\varvec{X}_0)\) increases relative to \(T=200\). The fusion samples are uniform with support covering T

Figure 1 shows how the “\(\bullet \)” moves along the B-curve as a function of the size of \({\mathrm{max}}(\varvec{X}_0)\) relative to T. The figure should be referred to when reading the following examples.

Example 1

\({\mathrm{max}}(\varvec{X}_0)=107\)

Since 107 is close to T/2, the “\(\bullet \)” point is in the “middle” of the B-curve, far removed from both ends. Hence, we use as p-increment \(Median/10 \approx 0.000018\). We observed that the third quartile was 0.0002686, very close to the true p.

From Table 1, the shift from down to up occurs at \({\hat{p}}=0.0002689389\) very close the true \(p=0.0002693603\), giving an error of an order of \(10^{-7}\).

Table 1 \(\mathbf{p=0.0002693603}\), \(\varvec{X}_1 \sim {\mathrm{Unif}}(0,300)\), \({\mathrm{max}}(\varvec{X}_0)=107\), \(T=200\), \(n_0=n_1=500\), \(h=(x,\log x)\), p-increment 0.000018

Example 2

\({\mathrm{max}}(\varvec{X}_0)=123.1\)

A different reference sample \(\varvec{X}_0\) was fused again 10,000 times with different \(\varvec{X}_1 \sim {\mathrm{Unif}}(0,300)\) independent samples. Since \({\mathrm{max}}(\varvec{X}_0)=123.1\), again we have, relative to \(T=200\), a “middle” “\(\bullet \)” point suggesting a p-increment of one tenth of the mean of the B’s. As the order of the mean was \(10^{-4}\) we chose p-increment 0.00002, which is of the same order as that of Mean(B)/10.

From Table 2, the shift from Down to Up occurs at \({\hat{p}}=0.0002601254\) not far from \(p=0.0002693603\), giving an error on the order of \(10^{-5}\).

Table 2 \(\mathbf{p=0.0002693603}\), \(\varvec{X}_1 \sim {\mathrm{Unif}}(0,300)\), \({\mathrm{max}}(\varvec{X}_0)=123.1\), \(T=200\), \(n_0=n_1=500\), \(h=(x,\log x)\), p-increment 0.00002

Example 3

\({\mathrm{max}}(\varvec{X}_0)=193.7\)

A different reference sample \(\varvec{X}_0\) was fused again 10,000 times with different \(\varvec{X}_1 \sim {\mathrm{Unif}}(0,300)\) independent samples. Since \({\mathrm{max}}(\varvec{X}_0)=193.7\), we have, relative to \(T=200\), a “\(\bullet \)” point close to the lower end of the B-curve, suggesting a p-increment on the order of one tenth of the first quartile of the 10,000 B’s. As the first quartile was 0.0002697, we chose a p-increment of 0.00001. A p-increment of 0.00002 gave identical results. We observe that the first quartile is very close to p.

From Table 3, the shift from Down to Up occurs at \({\hat{p}}=0.0002600818\) not far from \(p=0.0002693603\), giving an error on the order of \(10^{-5}\).

Table 3 \(\mathbf{p=0.0002693603}\), \(\varvec{X}_1 \sim {\mathrm{Unif}}(0,300)\), \({\mathrm{max}}(\varvec{X}_0)=193.7\), \(T=200\), \(n_0=n_1=500\), \(h=(x,\log x)\), p-increment 0.00001

Example 4

\({\mathrm{max}}(\varvec{X}_0)=77.9\)

A different reference sample \(\varvec{X}_0\) was fused again 10,000 times with different \(\varvec{X}_1 \sim {\mathrm{Unif}}(0,300)\) independent samples. Since \({\mathrm{max}}(\varvec{X}_0)=77.9\), we have, relative to \(T=200\), a “\(\bullet \)” point close to the upper end of the B-curve, a difficult case, suggesting a p-increment on the order of one tenth of \({\mathrm{max}}(B)\) from 10,000 B’s. As \({\mathrm{max}}(B)=0.0004583\), we chose a p-increment of 0.00004583.

From Table 4, the shift from Down to Up occurs at \({\hat{p}}=0.0002286204\) not far from \(p=0.0002693603\), giving an error on the order of \(10^{-5}\).

Table 4 \(\mathbf{p=0.0002693603}\), \(\varvec{X}_1 \sim {\mathrm{Unif}}(0,300)\), \({\mathrm{max}}(\varvec{X}_0)=77.9\), \(T=200\), \(n_0=n_1=500\), \(h=(x,\log x)\), p-increment 0.00004583

Summary of ROSF Applied to Beaver Radon Data

Table 5 provides our estimates of \(p=0.0002693603\) from various random radon samples \(\varvec{X}_0\) of size \(n_0=500\) fused repeatedly with independent \(\varvec{X}_1 \sim {\mathrm{Unif}}(0,300)\) of size \(n_1=500\). In all cases, \(h(x)=(x, \log x)\). Some of the \(\varvec{X}_0\) samples are the same, but the p-increments are different still leading to similar results. The mean and standard deviation of the \({\hat{p}}\) in the table are equal to \(\bar{{\hat{p}}}=0.0002606333\) and 1.052197e–05, respectively. In general, variance estimates can be obtained by repeating ROSF again and again using different B-curves and different p-increments. Evidently the choice of \(h(x)=(x, \log x)\) is a reasonable choice as the present radon analysis and many other examples with very diverse tail types indicate.

Table 5 \(\mathbf{p=0.0002693603}\), \(\varvec{X}_1 \sim {\mathrm{Unif}}(0,300)\), \(T=200\), \(n_0=n_1=500\), \(h=(x,\log x)\)

Discussion

There are numerous situations where the interest is in the prediction of an observable exceeding a large or even a catastrophically large threshold level where the data at hand fall short of the threshold. For example, consider the daily rainfall amount in a region where all the diurnal amounts fall short of a high threshold level, say, 10 inches in 24 hours, and yet for risk management it is important to obtain the chance that a future amount exceeds 10 inches in 24 hours, an extreme situation by all accounts. Similar problems regard annual flood levels, daily coronavirus counts, monthly insurance claims, earthquake magnitudes, and so on, where the sample values are below certain high thresholds, and the interest is in very small tail probabilities. Furthermore, in many cases, the given data could be only moderately large.

In this paper, it has been shown how to approach such problems by a large number of augmentations or fusions of the given data with computer-generated external samples. From this we obtained a curve, called B-curve, containing a point whose ordinate was close to the tail probability of interest. Moreover, the magnitude of the largest sample value relative to a given high threshold provided rough guesses as to the true value of the tail probability. The rough guesses were needed for successful applications of our iterative method which produced accurate estimates of tail probabilities.

Indeed, as illustrated in the paper, \({\mathrm{max}}(\varvec{X}_0)\) relative to T provides useful information about the true tail probability p represented as “the point,” and this fact can be interpreted in terms of under-specification and over-specification of the tail probability under the density ratio model. This clearly is a consequence of the fact that \(F_B\) provides information about p.

The large number of fusions resulted in a large number of upper bounds \(B_1,\ldots ,B_N\), for a tail probability p, from some unknown CDF \(F_B(x)\) where it was assumed that \(0< F_B(p) < 1\). The examples in this paper and many more in Pan et al. [1] indicate that the choice of the (mostly misspecified) tilt function \(h(x)=(x,\log x)\) in the density ratio model did not go against that assumption. Clearly, other tilts are possible as long as \(F_B(p)\) is bounded away from 0 and 1.

The estimation of very small tail probabilities can be approached by extreme value methods. A well-known method is referred to as peaks-over-threshold [12, 13], whereas the name suggests, only values above a sufficiently high threshold are used. However, if the sample is not large to begin with, any reduction in the sample size, by discarding those values deemed not sufficiently large, reduces the sample size and calls into question the applicability of the method. A comparison with ROSF is given in Wang [10] and in Pan et al. [1].

The estimation of tail probabilities from fused residential radon data has been studied recently in Zhang et al. [11, 14] by using the density ratio model with variable tilts. There a given radon sample from a county of interest was fused with radon samples from neighboring counties.

References

  1. 1.

    Kedem B, Pan L, Smith P, Wang C (2019) Estimation of small tail probabilities by repeated fusion. Math Stat 7:172–181

    Article  Google Scholar 

  2. 2.

    Kedem B, Pan L, Zhou W, Coelho CA (2016) Interval estimation of small tail probabilities—application in food safety. Stat Med 35:3229–3240

    MathSciNet  Article  Google Scholar 

  3. 3.

    Fithian W, Wager S (2015) Semiparametric exponential families for heavy-tailed data. Biometrika 102:486–493

    MathSciNet  Article  Google Scholar 

  4. 4.

    Fokianos K, Qin J (2008) A note on Monte Carlo maximization by the density ratio model. J Stat Theory Pract 2:355–367

    MathSciNet  Article  Google Scholar 

  5. 5.

    Katzoff M, Zhou W, Khan D, Lu G, Kedem B (2014) Out of sample fusion in risk prediction. J Stat Theory Pract 8:444–459

    MathSciNet  Article  Google Scholar 

  6. 6.

    Zhou W (2013) Out of sample fusion. Ph.D. thesis, University of Maryland

  7. 7.

    George AC (2015) The history, development and the present status of the radon measurement programme in the United States of America. Radiat Prot Dosim 167:8–14

    Article  Google Scholar 

  8. 8.

    Lu G (2007) Asymptotic theory for multiple-sample semiparametric density ratio models and its application to mortality forecasting. Ph.D. thesis, University of Maryland

  9. 9.

    Qin J, Zhang B (1997) A goodness of fit test for logistic regression models based on case-control data. Biometrika 84:609–618

    MathSciNet  Article  Google Scholar 

  10. 10.

    Wang C (2018) Data fusion based on the density ratio model. Ph.D. thesis, University of Maryland

  11. 11.

    Zhang X, Pyne S, Kedem B (2020) Estimation of residential radon concentration in Pennsylvania counties by data fusion. Appl Stoch Models Bus Ind. https://doi.org/10.1002/asmb.2546

    Article  Google Scholar 

  12. 12.

    Beirlant J, Goegebeur Y, Teugels JL, Segers J (2004) Statistics of extremes: theory and applications. Wiley, Hoboken

    Google Scholar 

  13. 13.

    Ferreira A, De Haan L (2015) On the block maxima method in extreme value theory: PWM estimators. Ann Stat 43:276–298

    MathSciNet  Article  Google Scholar 

  14. 14.

    Zhang X, Pyne S, Kedem B (2020) Model selection in radon data fusion. In: Statistics in transition, new series, special issue, pp 167–174

  15. 15.

    Owen A (2001) Empirical likelihood. Chapman & Hall/CRC, Boca Raton

    Google Scholar 

  16. 16.

    Zhang B (2000) A goodness of fit test for multiplicative-intercept risk models based on case-control data. Stat Sin 10:839–865

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

Research supported by a Faculty-Student Research Award, University of Maryland, College Park. The authors are grateful to the referees for their encouragement.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Saumyadipta Pyne.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Celebrating the Centenary of Professor C. R. Rao” guest edited by Ravi Khattree, Sreenivasa Rao Jammalamadaka, and M. B. Rao.

A Appendix

A Appendix

The appendix addresses the density ratio model (11) for \(m+1\) data sources. Thus, we deal with the density ratio model more generally where \(\varvec{X}_0\) is fused with m computer-generated samples. Above we dealt with the special case of \(m=1\).

Assume that the reference random sample \(\varvec{X}_0\) of size \(n_0\) follows an unknown reference distribution with probability density g, and let G be the corresponding cumulative distribution function (cdf).

Let

$$\begin{aligned} \varvec{X}_1,\ldots ,\varvec{X}_m, \end{aligned}$$

be additional computer-generated random samples where \(\varvec{X}_j \sim g_j,G_j\), with size \(n_j\), \(j=1,\ldots ,m\). The augmentation of \(m+1\) samples

$$\begin{aligned} \varvec{t}= (t_1,\ldots ,t_n) = (\varvec{X}_0,\varvec{X}_1,\ldots , \varvec{X}_m), \end{aligned}$$
(10)

of size \(n_0 + n_1 + \cdots + n_m\) gives the fused data. The density ratio model stipulates that

$$\begin{aligned} \frac{g_j(x)}{g(x)} = \exp (\alpha _j + \varvec{\beta }_j^{\prime } \varvec{h}(x)), \quad j=1,\ldots ,m, \end{aligned}$$
(11)

where \(\varvec{\beta }_j\) is an \(r \times 1\) parameter vector, \(\alpha _j\) is a scalar parameter, and \(\varvec{h}(x)\) is an \(r \times 1\) vector valued distortion or tilt function. None of the probability densities \(g,g_1,\ldots ,g_m\) and the corresponding \(G_j\)’s, and none of the parameters \(\alpha \)’s and \(\varvec{\beta }\)’s are assumed known, but, strictly speaking, the so called tilt function \(\varvec{h}\) must be a known function.

A.1 Asymptotic Distribution of \({\hat{G}}(x)\)

Define \(\alpha _0 \equiv 0, \beta _0 \equiv 0\), \(w_j(x) = \exp (\alpha _j + \beta _j^{\prime }h(x))\), \(\rho _i=n_i/n_0\), \(j = 1, \ldots , m\).

Maximum likelihood estimates for all the parameters and G(x) can be obtained by maximizing the empirical likelihood over the class of step cumulative distribution functions with jumps at the observed values \(t_1,\ldots ,t_n\) [15].

Let \(p_i = {\mathrm{d}}G(t_i)\) be the mass at \(t_i\), for \(i=1,\ldots ,n\). Then, the empirical likelihood becomes

$$\begin{aligned} {\mathcal {L}}(\varvec{\theta },G) = \prod _{i=1}^n p_i \prod _{j=1}^{n_1}\exp (\alpha _1+\beta _1^{\prime }h(x_{1j}))\cdots \prod _{j=1}^{n_m}\exp (\alpha _m+\beta _m^{\prime }h(x_{mj})). \end{aligned}$$
(12)

Maximizing \({\mathcal {L}}(\varvec{\theta },G)\) subject to the constraints

$$\begin{aligned} \sum _{i=1}^np_i = 1, \; \sum _{i=1}^np_i [w_1(t_i)-1] = 0, \ldots , \sum _{i=1}^np_i [w_m(t_i)-1] = 0 \end{aligned}$$
(13)

we obtain the desired estimates. In particular,

$$\begin{aligned} {\hat{G}}(t) = \frac{1}{n_0} \cdot \sum _{i=1}^n \frac{I(t_i \le t)}{1 + \rho _1 \exp ({\hat{\alpha }}_1 + {\hat{\beta }}_1^{\prime } h(t_i)) + \cdots + \rho _m \exp ({\hat{\alpha }}_m + {\hat{\beta }}_m ^{\prime } h(t_i))}, \end{aligned}$$
(14)

where \(I(t_i\le t)\) equals one for \(t_i\le t\) and is zero, otherwise. Similarly, \({\hat{G}}_j\) is estimated by summing \(\exp ({\hat{\alpha }}_j + {\hat{\beta }}_j ^{\prime } h(t_i)){\mathrm{d}}G(t_i)\).

The asymptotic properties of the estimators have been studied by a number of authors including Qin and Zhang [9], Lu [8], and Zhang [16].

Define the following quantities: \(\varvec{\rho }= \text{ diag }\{\rho _1, \ldots , \rho _m\}\),

$$\begin{aligned} A_j(t)= & {} \int \frac{w_j(y)I(y \le t)}{\sum _{k=0}^m \rho _k w_k(y)} {\mathrm{d}}G(y), \quad B_j(t) = \int \frac{w_j(y) h(y)I(y \le t)}{\sum _{k=0}^m \rho _k w_k(y)} {\mathrm{d}}G(y),\\ {\bar{A}}(t)= & {} (A_1(t), \ldots , A_m(t))^{\prime }, \quad {\bar{B}}(t) = (B^{\prime }_1(t), \ldots , B^{\prime }_m(t))^{\prime }. \end{aligned}$$

Then, the asymptotic distribution of \({\hat{G}}(t)\) for \(m \ge 1\) is given by the following result due to Lu [8].

Theorem A.1

[8] Assume that the sample size ratios \(\rho _j = n_j/n_0\) are positive and finite and remain fixed as the total sample size \(n = \sum _{j=0}^m n_j \rightarrow \infty \). The process \(\sqrt{n}({\hat{G}}(t)- G(t))\) converges to a zero-mean Gaussian process in the space of real right continuous functions that have left limits with covariance matrix given by

$$\begin{aligned}&\text{ Cov } \{ \sqrt{n}({\hat{G}}(t)- G(t)), \sqrt{n}({\hat{G}}(s)- G(s)) \}\nonumber \\&\quad = \left( \sum _{k=0}^m \rho _k \right) \biggl ( G(t\wedge s) -G(t)G(s) - \sum _{j=1}^m \rho _j A_j(t\wedge s)\biggr ) \nonumber \\&\qquad + \biggl ({\bar{A}}^{\prime }(s)\varvec{\rho }, {\bar{B}}^{\prime }(s)(\varvec{\rho }\otimes I_p) \biggr ) S^{-1} \left( \begin{array}{c}\varvec{\rho }{\bar{A}}(t) \\ (\varvec{\rho }\otimes I_p) {\bar{B}}(t) \end{array} \right) . \end{aligned}$$
(15)

where \(I_p\) is the \(p \times p\) identity matrix, and \(\otimes \) denotes Kronecker product.

For a complete proof, see Lu [8]. The proof for \(m=1\) is given in Zhang [16].

Denote by \({\hat{V}}(t)\) the estimated variance of \({\hat{G}}(t)\) as given in (15). Replacing parameters by their estimates, a \(1-\alpha \) level pointwise confidence interval for G(t) is approximated by

$$\begin{aligned} \left( {\hat{G}}(t) - z_{\alpha /2}\sqrt{{\hat{V}}(t)}, \;\; {\hat{G}}(t) + z_{\alpha /2} \sqrt{{\hat{V}}(t)} \right) , \end{aligned}$$
(16)

where \(z_{\alpha /2}\) is the upper \(\alpha /2\) point of the standard normal distribution. Hence, a \(1-\alpha \) level pointwise confidence interval for \(1-G(T)\) for any T, and in particular for relatively large thresholds T is approximated by

$$\begin{aligned} \bigg (1-{\hat{G}}(t)-z_{\alpha /2}\sqrt{{\hat{V}}(t)}, 1-{\hat{G}}(t)+z_{\alpha /2}\sqrt{{\hat{V}}(t)}\bigg ). \end{aligned}$$
(17)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kedem, B., Pyne, S. Estimation of Tail Probabilities by Repeated Augmented Reality. J Stat Theory Pract 15, 25 (2021). https://doi.org/10.1007/s42519-020-00152-1

Download citation

Keywords

  • Repeated out of sample fusion
  • Density ratio model
  • Residential radon
  • Upper bounds
  • Iterative process
  • B-curve

Mathematics Subject Classification

  • Primary 62F40
  • Secondary 62F25