Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

6.1 Introduction to Four Bootstrap Papers

6.1.1 Introduction and Summary

In this short article we discuss four of Peter Bickel’s seminal papers on theory and methodology for the bootstrap. We address the context of the work as well as its contributions and influence. The work began at the dawn of research on Efron’s bootstrap. In fact, Bickel and his co-authors were often the first to lay down the directions that others would follow when attempting to discover the strengths, and occasional weaknesses, of bootstrap methods.

Peter Bickel made major contributions to the development of bootstrap methods, particularly by delineating the range of circumstances where the bootstrap is effective. That topic is addressed in the first, second and fourth papers treated here. Looking back over this work, much of it done 25–30 years ago, it quickly becomes clear just how effectively these papers defined the most appropriate directions for future research.

We shall discuss the papers in chronological order, and pay particular attention to the contributions made by Bickel and Freedman (1981), since this was the first article to demonstrate the effectiveness of bootstrap methods in many cases, as well as to raise concerns about them in other situations. The results that we shall introduce in Sect. 6.1.2, when considering the work of Bickel and Freedman (1981), will be used frequently in later sections, especially Sect. 6.1.5.

The paper by Bickel and Freedman (1984), which we shall discuss in Sect. 6.1.3, pointed to challenges experienced by the bootstrap in the context of stratified sampling. This is ironic, not least because some of the earliest developments of what, today, are called bootstrap methods, involved sampling problems; see, for example, Jones (1956), Shiue (1960), Gurney (1963) and McCarthy (19661969).

Section 6.1.4 will treat the work of Bickel and Yahav (1988), which contributed very significantly to methodology for efficient simulation, at a time when the interest in this area was particularly high. Bickel et al. (1997), which we shall discuss in Sect. 6.1.5, developed deep and widely applicable theory for the m-out-of-n bootstrap. The authors showed that their approach overcame consistency problems inherent in the conventional n-out-of-n bootstrap, and gave rates of convergence applicable to a large class of problems.

6.1.2 Laying Foundations for the Bootstrap

Thirty years ago, when Efron’s (1979) bootstrap method was in its infancy, there was considerable interest in the extent to which it successfully accomplished its goal of estimating parameters, variances, distributions etc. As Bickel and Freedman (1981) noted, Efron’s paper “gives a series of examples in which [the bootstrap] principle works, and establishes the validity of the approach for a general class of statistics when the sample space is finite.” Bickel and Freedman (1981) set out to assess the bootstrap’s success in a much broader setting than this.

In the early 1980s, saying that the bootstrap “works” meant that bootstrap methods gave consistent estimators, and in this sense were competitive with more conventional methods, for example those based on asymptotic analysis. Within about 5 years the goals had changed; it had been established that bootstrap methods “work” in a very wide variety of circumstances, and, although there were counterexamples to this general rule, by the mid 1980s the task had become largely one of comparing the effectiveness of the bootstrap relative to more conventional techniques. But in 1981 the extent to which the bootstrap was consistent was still largely unknown. Bickel and Freedman (1981) contributed mightily to the process of discovery there.

In particular, Bickel and Freedman (1981) were the first to establish rigorously that bootstrap methodology is consistent in a wide range of settings. The impact of their paper was dramatic. It provided motivation for exploring the bootstrap more deeply in a great many settings, and furnished some of the mathematical tools for that development. In the same year, in fact in the preceding paper in the Annals, Singh (1981) explored second-order properties of the bootstrap. However, Bickel and Freedman (1980) also took up that challenge at a particularly early stage.

As a prelude to describing the results of Bickel and Freedman (1981) we give some notation. Let χ n  = X 1, , X n denote a sample of n independent observations from a given univariate distribution with finite variance σ2, write \(\bar{{X}}_{n} = {n}^{-1}\,{ \sum \nolimits }_{i}\,{X}_{i}\) for the sample mean, and define

$${\hat{\sigma }}_{n}^{2} ={ 1 \over n} \,\sum \limits_{i=1}^{n}\,{({X}_{ i} -\bar{ {X}}_{n})}^{2}\,,$$

the bootstrap estimator of σ2. Let χ m  ∗  = { X 1  ∗ , , X m  ∗ } denote a resample of size m drawn by sampling randomly, with replacement, from χ, and put \(\bar{{X}}_{m}^{{_\ast}} = {m}^{-1}\,\sum \limits _{i\leq m}\,{X}_{i}^{{_\ast}}\). Bickel and Freedman’s (1981) first result was that, in the case of m-resamples, the m-resample bootstrap version of \({\hat{\sigma }}_{n}^{2}\), i.e.

$${\hat{\sigma }}_{m}^{{_\ast}2} ={ 1 \over m} \,\sum \limits _{i=1}^{m}\,{({X}_{ i}^{{_\ast}}-\bar{ {X}}_{ m}^{{_\ast}})}^{2}\,,$$

converges to σ2 as both m and n increase, in the sense that, for each ε > 0,

$$P(\vert {\hat{\sigma }}_{m}^{{_\ast}}- \sigma \vert > \epsilon \,\vert \,{\chi }_{ n}) \rightarrow 0$$
(6.1)

with probability 1. Moreover, Bickel and Freedman (1981) showed that the conditional distribution of \({m}^{1/2}\,(\bar{{X}}_{m}^{{_\ast}}-\bar{ {X}}_{n})\), given χ n , converges to the normal N(0, σ2) distribution. Taking m = n, the latter property can be restated as follows:

$$\begin{array}{rlrlrl} &\mbox{ the probabilities $P\{{n}^{1/2}\,({\hat{\theta }}^{{_\ast}}-\hat{\theta }) \leq \sigma \,x\,\vert \,{\chi }_{n}\}$ and $P\{{n}^{1/2}\,(\hat{\theta } - \theta ) \leq \sigma \,x\}$} & & \\ &\mbox{ both converge to $\Phi (x)$, the former converging with probability 1,} &\end{array}$$
(6.2)

where Φ denotes the standard normal distribution and, on the present occasion, θ = E(X i ), \(\hat{\theta } =\bar{ {X}}_{n}\) and \({\hat{\theta }}^{{_\ast}} =\bar{ {X}}_{n}^{{_\ast}}\).

The second result established by Bickel and Freedman (1981) was a generalisation of this property to multivariate settings. Highlights of subsequent parts of the paper included its contributions to theory for the bootstrap in the context of functionals of a distribution function. For example, Bickel and Freedman (1981) considered von Mises functionals of a distribution function H, defined by

$$g(H) = \int \nolimits \nolimits \int \nolimits \nolimits \omega (x,y)\,dH(x)\,dH(y)\,,$$

where the function ω of two variables is symmetric, in the sense that ω(x, y) = ω(y, x), and where

$$\int \nolimits \nolimits \int \nolimits \nolimits \omega {(x,y)}^{2}\,dH(x)\,dH(y) + \int \nolimits \nolimits \omega {(x,x)}^{2}\,dH(x) < \infty \,.$$
(6.3)

If we take H to be either \({\widehat{F}}_{n}\), the empirical distribution function of the sample χ n , or \({\widehat{F}}_{n}^{{_\ast}}\), the version of \({\widehat{F}}_{n}\) computed from χ n  ∗ , then

$$g({\widehat{F}}_{n}) ={ 1 \over {n}^{2}} \, \sum \limits_{{i}_{1}=1}^{n}\, \sum \limits _{{i}_{2}=1}^{n}\,\omega ({X}_{{i}_{1}},{X}_{{i}_{2}})\,,\quad g({\widehat{F}}_{n}^{{_\ast}}) ={ 1\over {n}^{2}} \, \sum \limits _{{i}_{1}=1}^{n}\,\sum \limits_{{i}_{2}=1}^{n}\,\omega ({X}_{{ i}_{1}}^{{_\ast}},{X}_{{i}_{2}}^{{_\ast}})\,.$$

Bickel and Freedman (1981) studied properties of this quantity. In particular they proved that if (6.3) holds with H = F, denoting the common distribution function of the X i s, then the distribution of \({n}^{1/2}\,\{g({\widehat{F}}_{n}^{{_\ast}}) - g({\widehat{F}}_{n})\}\), conditional on the data, is asymptotically normal N(0, τ2) where

$${\tau }^{2} = 4\,\left[\int \nolimits \nolimits \{\int \nolimits \nolimits \omega (x,y)\,dF(y)\}{}^{2}\,dF(x) - g{(F)}^{2}\right]\,.$$

This limit distribution is the same as that of \({n}^{1/2}\,\{g({\widehat{F}}_{n}) - g(F)\}\), and so the above result of Bickel and Freedman (1981) confirms, in the context of von Mises functions of the empirical distribution function, that (6.2) holds once again, provided that σ there is replaced by τ and we redefine θ = g(F), \(\hat{\theta } = g({\widehat{F}}_{n})\) and \({\hat{\theta }}_{n}^{{_\ast}} = g({\widehat{F}}_{n}^{{_\ast}})\). That is, the bootstrap correctly captures, once more, first-order asymptotic properties. Subsequent results of Bickel and Freedman (1981) also showed that the same property holds for the empirical process, and in particular that the process \({n}^{1/2}\,({\widehat{F}}_{n}^{{_\ast}}-{\widehat{F}}_{n})\) has the same first-order asymptotic properties as \({n}^{1/2}\,({\widehat{F}}_{n} - F)\). Bickel and Freedman (1981) also derived the analogue of this result for the quantile process.

Importantly, Bickel and Freedman (1981) addressed cases where the bootstrap fails to enjoy properties such as (6.2). In their Sect. 6 they gave two counterexamples, one involving U-statistics and the other, spacings between extreme order statistics, where the bootstrap fails to capture large-sample properties even to first order. In both settings the problems are attributable, at least in part, to failure of the bootstrap to correctly capture the relationships among very high-ranked, or very low-ranked, order statistics, and in that context we shall relate below some of the issues to which Bickel and Freedman’s (1981) work pointed. This account will be given in detail because it is relevant to later sections.

Let X (1) <  < X (n) denote the ordered values in χ n ; we assume that the common distribution of the X i s is continuous, so that the probability of a tie equals zero. In this case the probability, conditional on χ n , of the event ε n that the largest X i , i.e. X (n), is in χ n  ∗ , equals 1 minus the conditional probability that X (n) is not contained in in χ n  ∗ . That is, it equals \(1 - {(1 - {n}^{-1})}^{n} = 1 - {e}^{-1} + O({n}^{-1})\). Therefore, as n → ,

$$P({X}_{(n)}^{{_\ast}} = {X}_{ (n)}\,\vert \,{\chi }_{n}) = P({X}_{(n)} \in {\chi }_{n}^{{_\ast}}\,\vert \,{\chi }_{ n}) \rightarrow 1 - {e}^{-1}\,,$$

where the convergence is deterministic. Similarly, for each integer k ≥ 1,

$${\pi }_{nk} \equiv P({X}_{(n)}^{{_\ast}} = {X}_{ (n-k)}\,\vert \,{\chi }_{n}) \rightarrow {\pi }_{k} \equiv {e}^{-k}\,\left(1 - {e}^{-1}\right)$$
(6.4)

as n → ; again the convergence is deterministic. Consequently the distribution of X (n)  ∗ , conditional on χ n , is a mixture, and in particular is equal to X (n − k) with probability π nk , for k ≥ 1. Therefore:

$$\begin{array}{l} \mbox{ given $\epsilon >0$ and any metric, for example the L$\acute{{\rm e}}$vy metric, between distributions, we} \\ \mbox{ may choose $k = k(\epsilon ) \geq 1$ so large that the distribution of ${X}_{(n)}^{{_\ast}}$, conditional on ${\chi }_{n}$,} \\ \mbox{ is no more than $\epsilon $ from the discrete mixture ${\sum \nolimits }_{0\leq j\leq k}\,{X}_{(n-j)}\,{I}_{j}$,}\mbox{ where (a) exactly one} \\ \mbox{ of the random variables ${I}_{1},{I}_{2},\ldots $ is nonzero, (b) that variable takes the value 1, and} \\ \mbox{ (c) $P({I}_{k} = 1) = {\pi }_{k}$ for $k \geq 0$. The upper bound of $\epsilon $ applies deterministically, in that} \\ \mbox{ it is valid with probability 1, in an unconditional sense.} \end{array}$$
(6.5)

To indicate the implications of this property we note that, for many distributions F, there exist constants a n and b n , at least one of them diverging to infinity in absolute value as n increases; and a nonstationary stochastic process ξ1, ξ2, ; such that, for each k ≥ 0, the joint distribution of \(({X}_{(n)} - {a}_{n})/{b}_{n},\ldots ,({X}_{(n-k)} - {a}_{n})/{b}_{n}\) converges to the distribution of (ξ1, , ξ k ). See, for example, Hall (1978). In view of (6.5) the distribution function of \(({X}_{(n)}^{{_\ast}}- {a}_{n})/{b}_{n}\), conditional on χ n , converges to that of

$$Z =\sum \limits _{j=0}^{\infty }\,{\xi }_{ j}\,{I}_{j}\,,$$

where the sequence I 1, I 2,  is distributed as in (6.5) and is chosen to be independent of ξ1, ξ2, . In this notation,

$$P({X}_{(n)}^{{_\ast}}- {a}_{ n} \leq {b}_{n}\,z\,\vert \,{\chi }_{n}) \rightarrow P(Z \leq z)$$
(6.6)

in probability, whenever z is a continuity point of the distribution of Z. On the other hand,

$$P({X}_{(n)} - {a}_{n} \leq {b}_{n}\,z) \rightarrow P({\xi }_{1} \leq z)\,.$$
(6.7)

A comparison of (6.6) and (6.7) reveals that there is little opportunity for estimating consistently the distribution of X (n), using standard bootstrap methods. Bickel and Freedman (1981) first drew our attention to this failing of the conventional bootstrap. The issue was to be the object of considerable research for many years after the appearance of Bickel and Freedman’s paper. Methodology for solving the problem, and ensuring consistency, was eventually developed and scrutinised; commonly the m-out-of-n bootstrap is used. See, for example, Swanepoel (1986), Bickel et al. (1997) and Bickel and Sakov (2008).

6.1.3 The Bootstrap in Stratified Sampling

Bickel and Freedman (1984) explored properties of the bootstrap in the case of stratified sampling from finite or infinite populations, and concluded that, with appropriate scaling, the bootstrap can give consistent distribution estimators in cases where asymptotic methods fail. However, without the proper scaling the bootstrap can be inconsistent.

The problem treated is that of estimating a linear combination,

$$\gamma ={ \sum \nolimits }_{j=1}^{p}\,{c}_{ j}\,{\mu }_{j}\,,$$
(6.8)

of the means μ1, , μ p of p populations Π 1, , Π p with corresponding distributions F 1, , F p . The c j s are assumed known, and the μ j s are estimated from data. To construct estimators, a random sample \(\chi (j) =\{ {X}_{j1},\ldots ,{X}_{j{n}_{j}}\}\) is drawn from the jth population, and the sample mean \(X (j) = {n}_{j}^{-1}\,{ \sum \nolimits }_{i}\,{X}_{ji}\) is computed in each case. Bickel and Freedman (1984) considered two different choices of c j , valid in two respective cases: (a) if it is known that each E(X ji ) = μ, not depending on j, and that the variance σ j 2 of Π j is proportional to r j , say, then

$${c}_{j} ={ {n}_{j}/{r}_{j} \over {\sum \nolimits }_{k}\,({n}_{k}/{r}_{k})} \,;$$

and (b) if the populations are finite, and in particular Π j is of size N j for j = 1, , p, then

$${c}_{j} ={ {N}_{j} \over {\sum \nolimits }_{k}\,{N}_{k}} \,.$$

In either case the estimator \(\hat{\gamma }\) of γ reflects the definition of γ at (6.8):

$$\hat{\gamma } =\sum \limits _{j=1}^{p}\,{c}_{ j}\,\bar{X}(j)\,,$$

where \(\bar{X}(j)\) is the mean value of the data in χ(j).

In both cases Bickel and Freedman (1984) showed that, particularly if the sample sizes n j are small, the bootstrap estimator of the distribution of \(\hat{\gamma } - \gamma \) is not necessarily consistent, in the sense that the distribution estimator minus the true distribution may not converge to zero in probability. The asymptotic distribution of \(\hat{\gamma } - \gamma \) is normal N(0, τ1 2), say; and the bootstrap estimator of that distribution, conditional on the data, is asymptotically normal N(0, τ2 2); but the ratio τ1 2 ∕ τ2 2 does not always converge to 1. Bickel and Freedman (1984) demonstrated that this difficulty can be overcome by estimating scale externally to the bootstrap process, in effect incorporating a scale correction to set the bootstrap on the right path. Bickel and Freedman also suggested other, more ad hoc remedies.

These contributions added immeasurably to our knowledge of the bootstrap. Combined with the counterexamples given earlier by Bickel and Freedman (1981), those authors showed that the bootstrap was not a device that could be used naively in all cases, without careful consideration.

Some researchers, a little outside the statistics community, had felt that bootstrap resampling methods freed statisticians from influence by a mathematical “priesthood” which was “frank about viewing resampling as a frontal attack upon their own situations” (Simon 1992). To the contrary, the work of Bickel and Freedman (19811984) showed that a mathematical understanding of the problem was fundamental to determining when, and how, to apply bootstrap methods successfully. They demonstrated that mathematical theory was able to provide considerable assistance to the introduction and development of practical bootstrap methods, and they provided that aid to statisticians and non-statisticians alike.

6.1.4 Efficient Bootstrap Simulation

By the mid to late 1980s the strengths and weaknesses of bootstrap methods were becoming more clear, especially the strengths. However, computers with power comparable to that of today’s machines were not readily available at the time, and so efficient methods were required for computation. The work of Bickel and Yahav (1988) was an important contribution to that technology. It shared the limelight with other approaches to achieving computational efficiency, including the balanced bootstrap, which was a version for the bootstrap of Latin hypercube sampling and was proposed by Davison et al. (1986) (see also Graham et al. 1990); importance resampling, suggested by Davison (1988) and Johns (1988); the centring method, proposed by Efron (1990); and antithetic resampling, introduced by Hall (1990).

The main impediment to quick calculation for the bootstrap was the resampling step. In the 1980s, when for many of us computing power was in short supply, bootstrap practitioners nevertheless advocated thousands, rather than hundreds, of simulations for each sample. For example Efron (1988), writing for an audience of psychologists, argued that “It is not excessive to use 2,000 replications, as in this paper, though we might have stopped at 1,000.” In fact, if the number of simulations, B, is chosen so that the nominal coverage level of a confidence interval can be expressed as \(b/(B + 1)\), where b is an integer, then the size of B has very little bearing on the coverage accuracy of the interval; (see Hall 1986). However, choosing B too small can result in overly variable Monte Carlo approximations to endpoints for bootstrap confidence intervals, and to critical points for bootstrap hypothesis tests.

It is instructive here to relate a story that G.S. Watson told me in 1988, the year in which Bickel and Yahav’s paper was published. Throughout his professional life Watson was an enthusiast of the latest statistical methods, and the bootstrap was no exception. Shortly after the appearance of Efron’s (1979) seminal paper he began to experiment with the percentile bootstrap technique. Not for Watson a tame problem involving a sample of scalar data; in what must have been one of the first applications of the bootstrap to spatial or spherical data, he used that technique to construct confidence regions for the mean direction derived from a sample of points on a sphere. He wrote a program that constructed bootstrap confidence regions, put the code onto a floppy disc, and passed the disc to a Princeton geophysicist to experiment with. This, he told the geophysicist, was the modern alternative to conventional confidence regions based on the von Mises-Fisher distribution. The latter regions, of course, took their shape from the mathematical form of the fitted distribution, with relatively little regard for any advice that the data might have to offer. What did the geophysicist think of the new approach?

In due course Watson received a reply, to the effect that the method was very interesting and remarkably flexible, adapting itself well to quite different datasets. But it had a basic flaw, the geophysicist said, that made it unattractive—every time he applied the code on the floppy disc to the same set of spherical data, he got a different answer! Watson, limited by the computational resources of the day, and by the relative complexity of computations on a sphere, had produced software that did only about B = 40 simulations each time the algorithm was implemented. Particularly with the extra degree of freedom that two dimensions provided for fluctuations, the results varied rather noticeably from one time-based simulation seed to another.

This tale defines the context of Bickel and Yahav’s (1988) paper. Their goal was to develop algorithms for reducing the variability, and enhancing the accuracy in that sense, of Monte Carlo procedures for implementing the bootstrap. Their approach, a modification for the bootstrap of the technique of Richardson extrapolation (a classical tool in numerical analysis; see Jeffreys and Jeffreys 1988, p. 288), ran as follows. Let \({\widehat{F}}_{n}\) (not to be confused with the same notation, but having a different meaning, in Sect. 6.1.2) denote the data-based distribution function of interest, and let F n be the quantity of which \({\widehat{F}}_{n}\) is an approximation. For example, \({\widehat{F}}_{n}(x)\) might equal \(P({\hat{\theta }}_{n}^{{_\ast}}-{\hat{\theta }}_{n} \leq x\,\vert \,{\chi }_{n})\), where \({\hat{\theta }}_{n}\) denotes an estimator of a parameter θ, computed from a random sample χ n of size n, in which case \({\hat{\theta }}_{n}^{{_\ast}}\) would be the bootstrap version of \({\hat{\theta }}_{n}\). (In this example, \({F}_{n}(x) = P({\hat{\theta }}_{n} - \theta \leq x)\).) Instead of estimating \({\widehat{F}}_{n}\) directly, compute estimators of the distribution functions \({\widehat{F}}_{{n}_{1}},\ldots ,{\widehat{F}}_{{n}_{r}}\), where the sample sizes n 1, , n r are all smaller than n, and in fact so small that \({n}_{1} + \ldots + {n}_{r}\) is markedly less than n. In some instances we may also know the limit F of F n , or at least its form, \({\widetilde{F}}_{\infty }\) say, constructed by replacing any unknown quantities (for example, a variance) by estimators computed from χ n . The quantities \({\widehat{F}}_{{n}_{1}},\ldots ,{\widehat{F}}_{{n}_{r}}\) and \({\widetilde{F}}_{\infty }\) are much less expensive, i.e. much faster, to compute than \({\widehat{F}}_{n}\), and so, by suitable “interpolation” from these functions, we can hope to get a very good approximation to \({\widehat{F}}_{n}\) without going to the expense of actually calculating the latter.

In general the cost of simulating, or equivalently the time taken to simulate, is approximately proportional to C n B, where C n depends only on n and increases with that quantity. Techniques for enhancing the performance of Monte Carlo methods can either directly produce greater accuracy for a given value of B (the balanced bootstrap has this property), or reduce the value of C n and thereby allow a larger value of B (hence, greater accuracy from the viewpoint of reduced variability) for a given cost. Bickel and Yahav’s (1988) method is of the latter type. By enabling a larger value of B it alleviates the problem encountered by Watson and his geophysicist friend.

Bickel and Yahav’s (1988) technique is particularly widely applicable, and has the potential to improve efficiency more substantially than, say, the balanced bootstrap. Today, however, statisticians’ demands for efficient bootstrap methods have been largely assuaged by the development of more powerful computers. In the last 15 years there have been very few new simulation algorithms tailored to the bootstrap. Philippe Toint’s aphorism that “I would rather have today’s algorithms on yesterday’s computers, than vice versa,” loses impact when an algorithm is to some extent problem-specific, and its implementation requires skills that go beyond those needed to purchase a new, faster computer.

6.1.5 The m-Out-of-n Bootstrap

The m-out-of-n bootstrap is another example revealing that, in science, less is often more. Bickel and Freedman (19811984) had shown that the standard bootstrap can fail, even at the level of statistical consistency, in a variety of settings; and, as we noted in Sect. 6.1.2, the m-out-of-n bootstrap, where m is an order of magnitude smaller than n, is often a remedy. Swanepoel (1986) was the first to suggest this method, which we shall define in the next paragraph. Bickel et al. (1997) made major contributions to the study of its theoretical properties. We shall give an example that provides further detail than we gave in Sect. 6.1.2 about the failure of the bootstrap in certain cases. Then we shall summarise briefly the contributions made by Bickel et al. (1997).

Consider drawing a resample χ m  ∗  = { X 1  ∗ , , X m  ∗ }, of size m, from the original dataset χ n  = { X 1, , X n } of size n, and let \(\hat{\theta } ={ \hat{\theta }}_{n}\) denote the bootstrap estimator of θ computed from χ n . In particular, if we can express θ as a functional, say θ(F), of the distribution function F of the data X i , then

$${ \hat{\theta }}_{n} = \theta ({\widehat{F}}_{n})\,,$$
(6.9)

where \({\widehat{F}}_{n}\) is the empirical distribution function computed from χ n . Likewise we can define \({\hat{\theta }}_{m}^{{_\ast}} = \theta ({\widehat{F}}_{m}^{{_\ast}})\), where \({\widehat{F}}_{m}^{{_\ast}}\) is the empirical distribution function for χ m  ∗ . As we noted in Sect. 2, Bickel and Freedman (1981) showed that first-order properties of \({\hat{\theta }}_{m}^{{_\ast}}\) are often robust against the value of m. In particular it is often the case that, for each ε > 0,

$$P(\vert {\hat{\theta }}_{m}^{{_\ast}}-{\hat{\theta }}_{ n}\vert > \epsilon \,\vert \,{\chi }_{n}) \rightarrow 0\,,\quad P(\vert {\hat{\theta }}_{n} - \theta \vert > \epsilon ) \rightarrow 0$$
(6.10)

as m and n diverge, where the first convergence is with probability 1. Compare (6.1). For example, (6.10) holds if θ is a moment, such as a mean or a variance, and if the sampling distribution has sufficiently many finite moments.

The definition (6.9) is conventionally used for a bootstrap estimator, and it does not necessarily involve simulation. For example, if θ = ∫xdF(x) is a population mean then

$${\hat{\theta }}_{n} = \int \nolimits \nolimits x\,d{\widehat{F}}_{n}(x) =\bar{ X}\,,\quad {\hat{\theta }}_{m}^{{_\ast}} = \int \nolimits \nolimits x\,d{\widehat{F}}_{m}^{{_\ast}}(x) =\bar{ {X}}^{{_\ast}}$$

are the sample mean and resample mean, respectively. However, in a variety of other cases the most appropriate way of defining and computing \({\hat{\theta }}_{n}\) is in terms of the resample χ n  ∗ ; that is, χ m  ∗  with m = n. Consider, for instance, the case where

$$\theta = P\left ({X}_{(n)} - {X}_{(n-1)} > {X}_{(n-1)} - {X}_{(n-2)}\right )\,,$$
(6.11)

in which, as in Sect. 6.1.2, we take X (1) <  < X (n) to be an ordering of the data in χ n , assumed to have a common continuous distribution. For many sampling distributions, in particular distributions that lie in the domain of attraction of an extreme-value law, θ depends on n but converges to a strictly positive number as n increases.

In this example the bootstrap estimator, \({\hat{\theta }}_{n}\), of θ, based on a sample of size n, is defined by

$${ \hat{\theta }}_{n} = P\left({X}_{(n)}^{{_\ast}}- {X}_{ (n-1)}^{{_\ast}} > {X}_{ (n-1)}^{{_\ast}}- {X}_{ (n-2)}^{{_\ast}}\;\left\vert \;{\chi }_{ n}\right)\right.,$$
(6.12)

where X (1)  ∗  ≤  ≤ X (n)  ∗  are the ordered data in χ n  ∗ . Analogously, the bootstrap version, \({\hat{\theta }}_{n}^{{_\ast}}\), of \({\hat{\theta }}_{n}\) is defined using the double bootstrap:

$${\hat{\theta }}_{n}^{{_\ast}} = P\left({X}_{ (n)}^{{_\ast}{_\ast}}- {X}_{ (n-1)}^{{_\ast}{_\ast}} > {X}_{ (n-1)}^{{_\ast}{_\ast}}- {X}_{ (n-2)}^{{_\ast}{_\ast}}\;\left\vert \;{\chi }_{ n}^{{_\ast}}\right.\right),$$

where X (1)  ∗ ∗  ≤  ≤ X (n)  ∗ ∗  are the ordered data in χ n  ∗ ∗  = { X 1  ∗ ∗ , , X n  ∗ ∗ }, drawn by sampling randomly, with replacement, from χ n  ∗ . However, for the reasons given in the paragraph containing (6.5), property (6.10) fails in this example, no matter how we choose m. (The m in (6.2) is different from the m for the m-out-of-n bootstrap.) The bootstrap fails to model accurately the relationships among large order statistics, to such an extent that, in the example characterised by (6.11), \({\hat{\theta }}_{n}\) does not converge to θ.

This problem evaporates if, in defining \({\hat{\theta }}_{n}\) at (6.12), we take the resample χ m  ∗  to have size m = m(n), where

$$m \rightarrow \infty \quad \mbox{ and}\quad m/n \rightarrow 0$$
(6.13)

as n → . That is, instead of (6.12) we define

$${ \hat{\theta }}_{n} = P\left({X}_{(m)}^{{_\ast}}- {X}_{ (m-1)}^{{_\ast}} > {X}_{ (m-1)}^{{_\ast}}- {X}_{ (m-2)}^{{_\ast}}\;\left\vert \;{\chi }_{ n}\right)\right.,$$
(6.14)

where X 1  ∗ , , X m  ∗  are drawn by sampling randomly, with replacement, from χ n . In this case, provided (6.5) holds, (6.2) is correct in a wide range of settings.

Deriving this result mathematically takes a little effort, but intuitively it is rather clear: By taking m to be of strictly smaller order than n we ensure that the probability that X (m)  ∗  equals any given data value in χ n , for example X (n), converges to zero, and so the difficulties raised in the paragraph containing (6.5) no longer apply. In particular, instead of (6.4) we have:

$$P({X}_{(m-k)}^{{_\ast}} = {X}_{ (m-\mathcal{l})}\,\vert \,{\chi }_{n}) \rightarrow 0$$

in probability, for each fixed, nonnegative integer k and , as n → . Further thought along the same lines indicates that the conditional distribution of \({X}_{(m)}^{{_\ast}}- {X}_{(m-1)}^{{_\ast}}\) should now, under mild assumptions, be a consistent estimator of the distribution of \({X}_{(n)} - {X}_{(n-1)}\).

Bickel et al. (1997) gave a sequence of four counter-examples illustrating cases where the bootstrap fails, and provided two examples of the success of the bootstrap. The first two counter-examples relate to extrema, and so are closely allied to the example considered above. The next two treat, respectively, hypothesis testing and improperly centred U and V statistics, and estimating nonsmooth functionals of the population distribution function. Bickel et al. (1997) then developed a deep, general theory which allowed them to construct accurate and insightful approximations to bootstrap statistics \({\hat{\theta }}_{n}\), such as that at (6.9), not just in that case but also when \({\hat{\theta }}_{n}\) is defined using the m-out-of-n bootstrap, as at (6.14). This enabled them to show that, in a large class of problems for which (6.13) holds, the m-out-of-n bootstrap overcomes consistency problems inherent in the conventional n-out-of-n approach, and also to derive rates of convergence.

A reliable way of choosing m empirically is of course necessary if the m-out-of-n bootstrap is to be widely adopted. In many cases this is still an open problem, although important contributions were made recently by Bickel and Sakov (2008).