Boostrap Resampling

Hall, Peter

doi:10.1007/978-1-4614-5544-8_6

Peter Hall⁹

Part of the book series: Selected Works in Probability and Statistics ((SWPS,volume 13))

1974 Accesses

Abstract

In this short article we discuss four of Peter Bickel’s seminal papers on theory and methodology for the bootstrap. We address the context of the work as well as its contributions and influence. The work began at the dawn of research on Efron’s bootstrap. In fact, Bickel and his co-authors were often the first to lay down the directions that others would follow when attempting to discover the strengths, and occasional weaknesses, of bootstrap methods.

You have full access to this open access chapter, Download chapter PDF

On Bootstrapping Using Smoothed Bootstrap

Soft Bootstrapping in Cluster Analysis and Its Comparison with Other Resampling Methods

Bootstrap and Subsampling Methods

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

6.1 Introduction to Four Bootstrap Papers

6.1.1 Introduction and Summary

In this short article we discuss four of Peter Bickel’s seminal papers on theory and methodology for the bootstrap. We address the context of the work as well as its contributions and influence. The work began at the dawn of research on Efron’s bootstrap. In fact, Bickel and his co-authors were often the first to lay down the directions that others would follow when attempting to discover the strengths, and occasional weaknesses, of bootstrap methods.

Peter Bickel made major contributions to the development of bootstrap methods, particularly by delineating the range of circumstances where the bootstrap is effective. That topic is addressed in the first, second and fourth papers treated here. Looking back over this work, much of it done 25–30 years ago, it quickly becomes clear just how effectively these papers defined the most appropriate directions for future research.

We shall discuss the papers in chronological order, and pay particular attention to the contributions made by Bickel and Freedman (1981), since this was the first article to demonstrate the effectiveness of bootstrap methods in many cases, as well as to raise concerns about them in other situations. The results that we shall introduce in Sect. 6.1.2, when considering the work of Bickel and Freedman (1981), will be used frequently in later sections, especially Sect. 6.1.5.

The paper by Bickel and Freedman (1984), which we shall discuss in Sect. 6.1.3, pointed to challenges experienced by the bootstrap in the context of stratified sampling. This is ironic, not least because some of the earliest developments of what, today, are called bootstrap methods, involved sampling problems; see, for example, Jones (1956), Shiue (1960), Gurney (1963) and McCarthy (1966, 1969).

Section 6.1.4 will treat the work of Bickel and Yahav (1988), which contributed very significantly to methodology for efficient simulation, at a time when the interest in this area was particularly high. Bickel et al. (1997), which we shall discuss in Sect. 6.1.5, developed deep and widely applicable theory for the m-out-of-n bootstrap. The authors showed that their approach overcame consistency problems inherent in the conventional n-out-of-n bootstrap, and gave rates of convergence applicable to a large class of problems.

6.1.2 Laying Foundations for the Bootstrap

Thirty years ago, when Efron’s (1979) bootstrap method was in its infancy, there was considerable interest in the extent to which it successfully accomplished its goal of estimating parameters, variances, distributions etc. As Bickel and Freedman (1981) noted, Efron’s paper “gives a series of examples in which [the bootstrap] principle works, and establishes the validity of the approach for a general class of statistics when the sample space is finite.” Bickel and Freedman (1981) set out to assess the bootstrap’s success in a much broader setting than this.

In the early 1980s, saying that the bootstrap “works” meant that bootstrap methods gave consistent estimators, and in this sense were competitive with more conventional methods, for example those based on asymptotic analysis. Within about 5 years the goals had changed; it had been established that bootstrap methods “work” in a very wide variety of circumstances, and, although there were counterexamples to this general rule, by the mid 1980s the task had become largely one of comparing the effectiveness of the bootstrap relative to more conventional techniques. But in 1981 the extent to which the bootstrap was consistent was still largely unknown. Bickel and Freedman (1981) contributed mightily to the process of discovery there.

In particular, Bickel and Freedman (1981) were the first to establish rigorously that bootstrap methodology is consistent in a wide range of settings. The impact of their paper was dramatic. It provided motivation for exploring the bootstrap more deeply in a great many settings, and furnished some of the mathematical tools for that development. In the same year, in fact in the preceding paper in the Annals, Singh (1981) explored second-order properties of the bootstrap. However, Bickel and Freedman (1980) also took up that challenge at a particularly early stage.

As a prelude to describing the results of Bickel and Freedman (1981) we give some notation. Let χ_n = X ₁, …, X _n denote a sample of n independent observations from a given univariate distribution with finite variance σ², write $\bar{{X}}_{n} = {n}^{-1}\,{ \sum \nolimits }_{i}\,{X}_{i}$ for the sample mean, and define

$${\hat{\sigma }}_{n}^{2} ={ 1 \over n} \,\sum \limits_{i=1}^{n}\,{({X}_{ i} -\bar{ {X}}_{n})}^{2}\,,$$

the bootstrap estimator of σ². Let χ_m ^∗ = { X ₁ ^∗, …, X _m ^∗} denote a resample of size m drawn by sampling randomly, with replacement, from χ, and put $\bar{{X}}_{m}^{{_\ast}} = {m}^{-1}\,\sum \limits _{i\leq m}\,{X}_{i}^{{_\ast}}$. Bickel and Freedman’s (1981) first result was that, in the case of m-resamples, the m-resample bootstrap version of ${\hat{\sigma }}_{n}^{2}$, i.e.

$${\hat{\sigma }}_{m}^{{_\ast}2} ={ 1 \over m} \,\sum \limits _{i=1}^{m}\,{({X}_{ i}^{{_\ast}}-\bar{ {X}}_{ m}^{{_\ast}})}^{2}\,,$$

converges to σ² as both m and n increase, in the sense that, for each ε > 0,

$$P(\vert {\hat{\sigma }}_{m}^{{_\ast}}- \sigma \vert > \epsilon \,\vert \,{\chi }_{ n}) \rightarrow 0$$

(6.1)

with probability 1. Moreover, Bickel and Freedman (1981) showed that the conditional distribution of ${m}^{1/2}\,(\bar{{X}}_{m}^{{_\ast}}-\bar{ {X}}_{n})$, given χ_n, converges to the normal N(0, σ²) distribution. Taking m = n, the latter property can be restated as follows:

$$\begin{array}{rlrlrl} &\mbox{ the probabilities $P\{{n}^{1/2}\,({\hat{\theta }}^{{_\ast}}-\hat{\theta }) \leq \sigma \,x\,\vert \,{\chi }_{n}\}$ and $P\{{n}^{1/2}\,(\hat{\theta } - \theta ) \leq \sigma \,x\}$} & & \\ &\mbox{ both converge to $\Phi (x)$, the former converging with probability 1,} &\end{array}$$

(6.2)

where Φ denotes the standard normal distribution and, on the present occasion, θ = E(X _i), $\hat{\theta } =\bar{ {X}}_{n}$ and ${\hat{\theta }}^{{_\ast}} =\bar{ {X}}_{n}^{{_\ast}}$.

The second result established by Bickel and Freedman (1981) was a generalisation of this property to multivariate settings. Highlights of subsequent parts of the paper included its contributions to theory for the bootstrap in the context of functionals of a distribution function. For example, Bickel and Freedman (1981) considered von Mises functionals of a distribution function H, defined by

$$g(H) = \int \nolimits \nolimits \int \nolimits \nolimits \omega (x,y)\,dH(x)\,dH(y)\,,$$

where the function ω of two variables is symmetric, in the sense that ω(x, y) = ω(y, x), and where

$$\int \nolimits \nolimits \int \nolimits \nolimits \omega {(x,y)}^{2}\,dH(x)\,dH(y) + \int \nolimits \nolimits \omega {(x,x)}^{2}\,dH(x) < \infty \,.$$

(6.3)

If we take H to be either ${\widehat{F}}_{n}$, the empirical distribution function of the sample χ_n, or ${\widehat{F}}_{n}^{{_\ast}}$, the version of ${\widehat{F}}_{n}$ computed from χ_n ^∗, then

$$g({\widehat{F}}_{n}) ={ 1 \over {n}^{2}} \, \sum \limits_{{i}_{1}=1}^{n}\, \sum \limits _{{i}_{2}=1}^{n}\,\omega ({X}_{{i}_{1}},{X}_{{i}_{2}})\,,\quad g({\widehat{F}}_{n}^{{_\ast}}) ={ 1\over {n}^{2}} \, \sum \limits _{{i}_{1}=1}^{n}\,\sum \limits_{{i}_{2}=1}^{n}\,\omega ({X}_{{ i}_{1}}^{{_\ast}},{X}_{{i}_{2}}^{{_\ast}})\,.$$

Bickel and Freedman (1981) studied properties of this quantity. In particular they proved that if (6.3) holds with H = F, denoting the common distribution function of the X _is, then the distribution of ${n}^{1/2}\,\{g({\widehat{F}}_{n}^{{_\ast}}) - g({\widehat{F}}_{n})\}$, conditional on the data, is asymptotically normal N(0, τ²) where

$${\tau }^{2} = 4\,\left[\int \nolimits \nolimits \{\int \nolimits \nolimits \omega (x,y)\,dF(y)\}{}^{2}\,dF(x) - g{(F)}^{2}\right]\,.$$

This limit distribution is the same as that of ${n}^{1/2}\,\{g({\widehat{F}}_{n}) - g(F)\}$, and so the above result of Bickel and Freedman (1981) confirms, in the context of von Mises functions of the empirical distribution function, that (6.2) holds once again, provided that σ there is replaced by τ and we redefine θ = g(F), $\hat{\theta } = g({\widehat{F}}_{n})$ and ${\hat{\theta }}_{n}^{{_\ast}} = g({\widehat{F}}_{n}^{{_\ast}})$. That is, the bootstrap correctly captures, once more, first-order asymptotic properties. Subsequent results of Bickel and Freedman (1981) also showed that the same property holds for the empirical process, and in particular that the process ${n}^{1/2}\,({\widehat{F}}_{n}^{{_\ast}}-{\widehat{F}}_{n})$ has the same first-order asymptotic properties as ${n}^{1/2}\,({\widehat{F}}_{n} - F)$. Bickel and Freedman (1981) also derived the analogue of this result for the quantile process.

Importantly, Bickel and Freedman (1981) addressed cases where the bootstrap fails to enjoy properties such as (6.2). In their Sect. 6 they gave two counterexamples, one involving U-statistics and the other, spacings between extreme order statistics, where the bootstrap fails to capture large-sample properties even to first order. In both settings the problems are attributable, at least in part, to failure of the bootstrap to correctly capture the relationships among very high-ranked, or very low-ranked, order statistics, and in that context we shall relate below some of the issues to which Bickel and Freedman’s (1981) work pointed. This account will be given in detail because it is relevant to later sections.

Let X ₍₁₎ < … < X _(n) denote the ordered values in χ_n; we assume that the common distribution of the X _is is continuous, so that the probability of a tie equals zero. In this case the probability, conditional on χ_n, of the event ε_n that the largest X _i, i.e. X _(n), is in χ_n ^∗, equals 1 minus the conditional probability that X _(n) is not contained in in χ_n ^∗. That is, it equals $1 - {(1 - {n}^{-1})}^{n} = 1 - {e}^{-1} + O({n}^{-1})$. Therefore, as n → ∞,

$$P({X}_{(n)}^{{_\ast}} = {X}_{ (n)}\,\vert \,{\chi }_{n}) = P({X}_{(n)} \in {\chi }_{n}^{{_\ast}}\,\vert \,{\chi }_{ n}) \rightarrow 1 - {e}^{-1}\,,$$

where the convergence is deterministic. Similarly, for each integer k ≥ 1,

$${\pi }_{nk} \equiv P({X}_{(n)}^{{_\ast}} = {X}_{ (n-k)}\,\vert \,{\chi }_{n}) \rightarrow {\pi }_{k} \equiv {e}^{-k}\,\left(1 - {e}^{-1}\right)$$

(6.4)

as n → ∞; again the convergence is deterministic. Consequently the distribution of X _(n) ^∗, conditional on χ_n, is a mixture, and in particular is equal to X _{(n − k)} with probability π_nk, for k ≥ 1. Therefore:

$$\begin{array}{l} \mbox{ given $\epsilon >0$ and any metric, for example the L$\acute{{\rm e}}$vy metric, between distributions, we} \\ \mbox{ may choose $k = k(\epsilon ) \geq 1$ so large that the distribution of ${X}_{(n)}^{{_\ast}}$, conditional on ${\chi }_{n}$,} \\ \mbox{ is no more than $\epsilon $ from the discrete mixture ${\sum \nolimits }_{0\leq j\leq k}\,{X}_{(n-j)}\,{I}_{j}$,}\mbox{ where (a) exactly one} \\ \mbox{ of the random variables ${I}_{1},{I}_{2},\ldots $ is nonzero, (b) that variable takes the value 1, and} \\ \mbox{ (c) $P({I}_{k} = 1) = {\pi }_{k}$ for $k \geq 0$. The upper bound of $\epsilon $ applies deterministically, in that} \\ \mbox{ it is valid with probability 1, in an unconditional sense.} \end{array}$$

(6.5)

To indicate the implications of this property we note that, for many distributions F, there exist constants a _n and b _n, at least one of them diverging to infinity in absolute value as n increases; and a nonstationary stochastic process ξ₁, ξ₂, …; such that, for each k ≥ 0, the joint distribution of $({X}_{(n)} - {a}_{n})/{b}_{n},\ldots ,({X}_{(n-k)} - {a}_{n})/{b}_{n}$ converges to the distribution of (ξ₁, …, ξ_k). See, for example, Hall (1978). In view of (6.5) the distribution function of $({X}_{(n)}^{{_\ast}}- {a}_{n})/{b}_{n}$, conditional on χ_n, converges to that of

$$Z =\sum \limits _{j=0}^{\infty }\,{\xi }_{ j}\,{I}_{j}\,,$$

where the sequence I ₁, I ₂, … is distributed as in (6.5) and is chosen to be independent of ξ₁, ξ₂, …. In this notation,

$$P({X}_{(n)}^{{_\ast}}- {a}_{ n} \leq {b}_{n}\,z\,\vert \,{\chi }_{n}) \rightarrow P(Z \leq z)$$

(6.6)

in probability, whenever z is a continuity point of the distribution of Z. On the other hand,

$$P({X}_{(n)} - {a}_{n} \leq {b}_{n}\,z) \rightarrow P({\xi }_{1} \leq z)\,.$$

(6.7)

A comparison of (6.6) and (6.7) reveals that there is little opportunity for estimating consistently the distribution of X _(n), using standard bootstrap methods. Bickel and Freedman (1981) first drew our attention to this failing of the conventional bootstrap. The issue was to be the object of considerable research for many years after the appearance of Bickel and Freedman’s paper. Methodology for solving the problem, and ensuring consistency, was eventually developed and scrutinised; commonly the m-out-of-n bootstrap is used. See, for example, Swanepoel (1986), Bickel et al. (1997) and Bickel and Sakov (2008).

6.1.3 The Bootstrap in Stratified Sampling

Bickel and Freedman (1984) explored properties of the bootstrap in the case of stratified sampling from finite or infinite populations, and concluded that, with appropriate scaling, the bootstrap can give consistent distribution estimators in cases where asymptotic methods fail. However, without the proper scaling the bootstrap can be inconsistent.

The problem treated is that of estimating a linear combination,

$$\gamma ={ \sum \nolimits }_{j=1}^{p}\,{c}_{ j}\,{\mu }_{j}\,,$$

(6.8)

of the means μ₁, …, μ_p of p populations Π ₁, …, Π _p with corresponding distributions F ₁, …, F _p. The c _js are assumed known, and the μ_js are estimated from data. To construct estimators, a random sample $\chi (j) =\{ {X}_{j1},\ldots ,{X}_{j{n}_{j}}\}$ is drawn from the jth population, and the sample mean $X (j) = {n}_{j}^{-1}\,{ \sum \nolimits }_{i}\,{X}_{ji}$ is computed in each case. Bickel and Freedman (1984) considered two different choices of c _j, valid in two respective cases: (a) if it is known that each E(X _ji) = μ, not depending on j, and that the variance σ_j ² of Π _j is proportional to r _j, say, then

$${c}_{j} ={ {n}_{j}/{r}_{j} \over {\sum \nolimits }_{k}\,({n}_{k}/{r}_{k})} \,;$$

and (b) if the populations are finite, and in particular Π _j is of size N _j for j = 1, …, p, then

$${c}_{j} ={ {N}_{j} \over {\sum \nolimits }_{k}\,{N}_{k}} \,.$$

In either case the estimator $\hat{\gamma }$ of γ reflects the definition of γ at (6.8):

$$\hat{\gamma } =\sum \limits _{j=1}^{p}\,{c}_{ j}\,\bar{X}(j)\,,$$

where $\bar{X}(j)$ is the mean value of the data in χ(j).

In both cases Bickel and Freedman (1984) showed that, particularly if the sample sizes n _j are small, the bootstrap estimator of the distribution of $\hat{\gamma } - \gamma $ is not necessarily consistent, in the sense that the distribution estimator minus the true distribution may not converge to zero in probability. The asymptotic distribution of $\hat{\gamma } - \gamma $ is normal N(0, τ₁ ²), say; and the bootstrap estimator of that distribution, conditional on the data, is asymptotically normal N(0, τ₂ ²); but the ratio τ₁ ² ∕ τ₂ ² does not always converge to 1. Bickel and Freedman (1984) demonstrated that this difficulty can be overcome by estimating scale externally to the bootstrap process, in effect incorporating a scale correction to set the bootstrap on the right path. Bickel and Freedman also suggested other, more ad hoc remedies.

These contributions added immeasurably to our knowledge of the bootstrap. Combined with the counterexamples given earlier by Bickel and Freedman (1981), those authors showed that the bootstrap was not a device that could be used naively in all cases, without careful consideration.

Some researchers, a little outside the statistics community, had felt that bootstrap resampling methods freed statisticians from influence by a mathematical “priesthood” which was “frank about viewing resampling as a frontal attack upon their own situations” (Simon 1992). To the contrary, the work of Bickel and Freedman (1981, 1984) showed that a mathematical understanding of the problem was fundamental to determining when, and how, to apply bootstrap methods successfully. They demonstrated that mathematical theory was able to provide considerable assistance to the introduction and development of practical bootstrap methods, and they provided that aid to statisticians and non-statisticians alike.

6.1.4 Efficient Bootstrap Simulation

By the mid to late 1980s the strengths and weaknesses of bootstrap methods were becoming more clear, especially the strengths. However, computers with power comparable to that of today’s machines were not readily available at the time, and so efficient methods were required for computation. The work of Bickel and Yahav (1988) was an important contribution to that technology. It shared the limelight with other approaches to achieving computational efficiency, including the balanced bootstrap, which was a version for the bootstrap of Latin hypercube sampling and was proposed by Davison et al. (1986) (see also Graham et al. 1990); importance resampling, suggested by Davison (1988) and Johns (1988); the centring method, proposed by Efron (1990); and antithetic resampling, introduced by Hall (1990).

The main impediment to quick calculation for the bootstrap was the resampling step. In the 1980s, when for many of us computing power was in short supply, bootstrap practitioners nevertheless advocated thousands, rather than hundreds, of simulations for each sample. For example Efron (1988), writing for an audience of psychologists, argued that “It is not excessive to use 2,000 replications, as in this paper, though we might have stopped at 1,000.” In fact, if the number of simulations, B, is chosen so that the nominal coverage level of a confidence interval can be expressed as $b/(B + 1)$, where b is an integer, then the size of B has very little bearing on the coverage accuracy of the interval; (see Hall 1986). However, choosing B too small can result in overly variable Monte Carlo approximations to endpoints for bootstrap confidence intervals, and to critical points for bootstrap hypothesis tests.

It is instructive here to relate a story that G.S. Watson told me in 1988, the year in which Bickel and Yahav’s paper was published. Throughout his professional life Watson was an enthusiast of the latest statistical methods, and the bootstrap was no exception. Shortly after the appearance of Efron’s (1979) seminal paper he began to experiment with the percentile bootstrap technique. Not for Watson a tame problem involving a sample of scalar data; in what must have been one of the first applications of the bootstrap to spatial or spherical data, he used that technique to construct confidence regions for the mean direction derived from a sample of points on a sphere. He wrote a program that constructed bootstrap confidence regions, put the code onto a floppy disc, and passed the disc to a Princeton geophysicist to experiment with. This, he told the geophysicist, was the modern alternative to conventional confidence regions based on the von Mises-Fisher distribution. The latter regions, of course, took their shape from the mathematical form of the fitted distribution, with relatively little regard for any advice that the data might have to offer. What did the geophysicist think of the new approach?

In due course Watson received a reply, to the effect that the method was very interesting and remarkably flexible, adapting itself well to quite different datasets. But it had a basic flaw, the geophysicist said, that made it unattractive—every time he applied the code on the floppy disc to the same set of spherical data, he got a different answer! Watson, limited by the computational resources of the day, and by the relative complexity of computations on a sphere, had produced software that did only about B = 40 simulations each time the algorithm was implemented. Particularly with the extra degree of freedom that two dimensions provided for fluctuations, the results varied rather noticeably from one time-based simulation seed to another.

This tale defines the context of Bickel and Yahav’s (1988) paper. Their goal was to develop algorithms for reducing the variability, and enhancing the accuracy in that sense, of Monte Carlo procedures for implementing the bootstrap. Their approach, a modification for the bootstrap of the technique of Richardson extrapolation (a classical tool in numerical analysis; see Jeffreys and Jeffreys 1988, p. 288), ran as follows. Let ${\widehat{F}}_{n}$ (not to be confused with the same notation, but having a different meaning, in Sect. 6.1.2) denote the data-based distribution function of interest, and let F _n be the quantity of which ${\widehat{F}}_{n}$ is an approximation. For example, ${\widehat{F}}_{n}(x)$ might equal $P({\hat{\theta }}_{n}^{{_\ast}}-{\hat{\theta }}_{n} \leq x\,\vert \,{\chi }_{n})$, where ${\hat{\theta }}_{n}$ denotes an estimator of a parameter θ, computed from a random sample χ_n of size n, in which case ${\hat{\theta }}_{n}^{{_\ast}}$ would be the bootstrap version of ${\hat{\theta }}_{n}$. (In this example, ${F}_{n}(x) = P({\hat{\theta }}_{n} - \theta \leq x)$.) Instead of estimating ${\widehat{F}}_{n}$ directly, compute estimators of the distribution functions ${\widehat{F}}_{{n}_{1}},\ldots ,{\widehat{F}}_{{n}_{r}}$, where the sample sizes n ₁, …, n _r are all smaller than n, and in fact so small that ${n}_{1} + \ldots + {n}_{r}$ is markedly less than n. In some instances we may also know the limit F _∞ of F _n, or at least its form, ${\widetilde{F}}_{\infty }$ say, constructed by replacing any unknown quantities (for example, a variance) by estimators computed from χ_n. The quantities ${\widehat{F}}_{{n}_{1}},\ldots ,{\widehat{F}}_{{n}_{r}}$ and ${\widetilde{F}}_{\infty }$ are much less expensive, i.e. much faster, to compute than ${\widehat{F}}_{n}$, and so, by suitable “interpolation” from these functions, we can hope to get a very good approximation to ${\widehat{F}}_{n}$ without going to the expense of actually calculating the latter.

In general the cost of simulating, or equivalently the time taken to simulate, is approximately proportional to C _n B, where C _n depends only on n and increases with that quantity. Techniques for enhancing the performance of Monte Carlo methods can either directly produce greater accuracy for a given value of B (the balanced bootstrap has this property), or reduce the value of C _n and thereby allow a larger value of B (hence, greater accuracy from the viewpoint of reduced variability) for a given cost. Bickel and Yahav’s (1988) method is of the latter type. By enabling a larger value of B it alleviates the problem encountered by Watson and his geophysicist friend.

Bickel and Yahav’s (1988) technique is particularly widely applicable, and has the potential to improve efficiency more substantially than, say, the balanced bootstrap. Today, however, statisticians’ demands for efficient bootstrap methods have been largely assuaged by the development of more powerful computers. In the last 15 years there have been very few new simulation algorithms tailored to the bootstrap. Philippe Toint’s aphorism that “I would rather have today’s algorithms on yesterday’s computers, than vice versa,” loses impact when an algorithm is to some extent problem-specific, and its implementation requires skills that go beyond those needed to purchase a new, faster computer.

6.1.5 The m-Out-of-n Bootstrap

The m-out-of-n bootstrap is another example revealing that, in science, less is often more. Bickel and Freedman (1981, 1984) had shown that the standard bootstrap can fail, even at the level of statistical consistency, in a variety of settings; and, as we noted in Sect. 6.1.2, the m-out-of-n bootstrap, where m is an order of magnitude smaller than n, is often a remedy. Swanepoel (1986) was the first to suggest this method, which we shall define in the next paragraph. Bickel et al. (1997) made major contributions to the study of its theoretical properties. We shall give an example that provides further detail than we gave in Sect. 6.1.2 about the failure of the bootstrap in certain cases. Then we shall summarise briefly the contributions made by Bickel et al. (1997).

Consider drawing a resample χ_m ^∗ = { X ₁ ^∗, …, X _m ^∗}, of size m, from the original dataset χ_n = { X ₁, …, X _n} of size n, and let $\hat{\theta } ={ \hat{\theta }}_{n}$ denote the bootstrap estimator of θ computed from χ_n. In particular, if we can express θ as a functional, say θ(F), of the distribution function F of the data X _i, then

$${ \hat{\theta }}_{n} = \theta ({\widehat{F}}_{n})\,,$$

(6.9)

where ${\widehat{F}}_{n}$ is the empirical distribution function computed from χ_n. Likewise we can define ${\hat{\theta }}_{m}^{{_\ast}} = \theta ({\widehat{F}}_{m}^{{_\ast}})$, where ${\widehat{F}}_{m}^{{_\ast}}$ is the empirical distribution function for χ_m ^∗. As we noted in Sect. 2, Bickel and Freedman (1981) showed that first-order properties of ${\hat{\theta }}_{m}^{{_\ast}}$ are often robust against the value of m. In particular it is often the case that, for each ε > 0,

$$P(\vert {\hat{\theta }}_{m}^{{_\ast}}-{\hat{\theta }}_{ n}\vert > \epsilon \,\vert \,{\chi }_{n}) \rightarrow 0\,,\quad P(\vert {\hat{\theta }}_{n} - \theta \vert > \epsilon ) \rightarrow 0$$

(6.10)

as m and n diverge, where the first convergence is with probability 1. Compare (6.1). For example, (6.10) holds if θ is a moment, such as a mean or a variance, and if the sampling distribution has sufficiently many finite moments.

The definition (6.9) is conventionally used for a bootstrap estimator, and it does not necessarily involve simulation. For example, if θ = ∫x dF(x) is a population mean then

$${\hat{\theta }}_{n} = \int \nolimits \nolimits x\,d{\widehat{F}}_{n}(x) =\bar{ X}\,,\quad {\hat{\theta }}_{m}^{{_\ast}} = \int \nolimits \nolimits x\,d{\widehat{F}}_{m}^{{_\ast}}(x) =\bar{ {X}}^{{_\ast}}$$

are the sample mean and resample mean, respectively. However, in a variety of other cases the most appropriate way of defining and computing ${\hat{\theta }}_{n}$ is in terms of the resample χ_n ^∗; that is, χ_m ^∗ with m = n. Consider, for instance, the case where

$$\theta = P\left ({X}_{(n)} - {X}_{(n-1)} > {X}_{(n-1)} - {X}_{(n-2)}\right )\,,$$

(6.11)

in which, as in Sect. 6.1.2, we take X ₍₁₎ < … < X _(n) to be an ordering of the data in χ_n, assumed to have a common continuous distribution. For many sampling distributions, in particular distributions that lie in the domain of attraction of an extreme-value law, θ depends on n but converges to a strictly positive number as n increases.

In this example the bootstrap estimator, ${\hat{\theta }}_{n}$, of θ, based on a sample of size n, is defined by

$${ \hat{\theta }}_{n} = P\left({X}_{(n)}^{{_\ast}}- {X}_{ (n-1)}^{{_\ast}} > {X}_{ (n-1)}^{{_\ast}}- {X}_{ (n-2)}^{{_\ast}}\;\left\vert \;{\chi }_{ n}\right)\right.,$$

(6.12)

where X ₍₁₎ ^∗ ≤ … ≤ X _(n) ^∗ are the ordered data in χ_n ^∗. Analogously, the bootstrap version, ${\hat{\theta }}_{n}^{{_\ast}}$, of ${\hat{\theta }}_{n}$ is defined using the double bootstrap:

$${\hat{\theta }}_{n}^{{_\ast}} = P\left({X}_{ (n)}^{{_\ast}{_\ast}}- {X}_{ (n-1)}^{{_\ast}{_\ast}} > {X}_{ (n-1)}^{{_\ast}{_\ast}}- {X}_{ (n-2)}^{{_\ast}{_\ast}}\;\left\vert \;{\chi }_{ n}^{{_\ast}}\right.\right),$$

where X ₍₁₎ ^∗ ∗ ≤ … ≤ X _(n) ^∗ ∗ are the ordered data in χ_n ^∗ ∗ = { X ₁ ^∗ ∗, …, X _n ^∗ ∗}, drawn by sampling randomly, with replacement, from χ_n ^∗. However, for the reasons given in the paragraph containing (6.5), property (6.10) fails in this example, no matter how we choose m. (The m in (6.2) is different from the m for the m-out-of-n bootstrap.) The bootstrap fails to model accurately the relationships among large order statistics, to such an extent that, in the example characterised by (6.11), ${\hat{\theta }}_{n}$ does not converge to θ.

This problem evaporates if, in defining ${\hat{\theta }}_{n}$ at (6.12), we take the resample χ_m ^∗ to have size m = m(n), where

$$m \rightarrow \infty \quad \mbox{ and}\quad m/n \rightarrow 0$$

(6.13)

as n → ∞. That is, instead of (6.12) we define

$${ \hat{\theta }}_{n} = P\left({X}_{(m)}^{{_\ast}}- {X}_{ (m-1)}^{{_\ast}} > {X}_{ (m-1)}^{{_\ast}}- {X}_{ (m-2)}^{{_\ast}}\;\left\vert \;{\chi }_{ n}\right)\right.,$$

(6.14)

where X ₁ ^∗, …, X _m ^∗ are drawn by sampling randomly, with replacement, from χ_n. In this case, provided (6.5) holds, (6.2) is correct in a wide range of settings.

Deriving this result mathematically takes a little effort, but intuitively it is rather clear: By taking m to be of strictly smaller order than n we ensure that the probability that X _(m) ^∗ equals any given data value in χ_n, for example X _(n), converges to zero, and so the difficulties raised in the paragraph containing (6.5) no longer apply. In particular, instead of (6.4) we have:

$$P({X}_{(m-k)}^{{_\ast}} = {X}_{ (m-\mathcal{l})}\,\vert \,{\chi }_{n}) \rightarrow 0$$

in probability, for each fixed, nonnegative integer k and ℓ, as n → ∞. Further thought along the same lines indicates that the conditional distribution of ${X}_{(m)}^{{_\ast}}- {X}_{(m-1)}^{{_\ast}}$ should now, under mild assumptions, be a consistent estimator of the distribution of ${X}_{(n)} - {X}_{(n-1)}$.

Bickel et al. (1997) gave a sequence of four counter-examples illustrating cases where the bootstrap fails, and provided two examples of the success of the bootstrap. The first two counter-examples relate to extrema, and so are closely allied to the example considered above. The next two treat, respectively, hypothesis testing and improperly centred U and V statistics, and estimating nonsmooth functionals of the population distribution function. Bickel et al. (1997) then developed a deep, general theory which allowed them to construct accurate and insightful approximations to bootstrap statistics ${\hat{\theta }}_{n}$, such as that at (6.9), not just in that case but also when ${\hat{\theta }}_{n}$ is defined using the m-out-of-n bootstrap, as at (6.14). This enabled them to show that, in a large class of problems for which (6.13) holds, the m-out-of-n bootstrap overcomes consistency problems inherent in the conventional n-out-of-n approach, and also to derive rates of convergence.

A reliable way of choosing m empirically is of course necessary if the m-out-of-n bootstrap is to be widely adopted. In many cases this is still an open problem, although important contributions were made recently by Bickel and Sakov (2008).

References

Bickel PJ, Freedman DA (1980) On Edgeworth expansions and the bootstrap. Unpublished manuscript
Google Scholar
Bickel PJ, Freedman DA (1981) Some asymptotic theory for the bootstrap. Ann Stat 9:1196–1217
Article MATH MathSciNet Google Scholar
Bickel PJ, Freedman DA (1984) Asymptotic normality and the bootstrap in stratified sampling. Ann Stat 12:470–482
Article MATH MathSciNet Google Scholar
Bickel P, Sakov A (2008) On the choice of m in the m out of n bootstrap and confidence bounds for extrema. Stat Sin 18:967–985
MATH MathSciNet Google Scholar
Bickel PJ, Yahav JA (1988) Richardson extrapolation and the bootstrap. J Am Stat Assoc 83: 387–393
Article MATH MathSciNet Google Scholar
Bickel PJ, Götze F, van Zwet WR (1997) Resampling fewer than n observations: gains, losses, and remedies for losses. Stat Sin 7:1–31
MATH Google Scholar
Davison AC (1988) Discussion of papers by D.V. Hinkley and by T.J. DiCiccio and J.P. Romano. J R Stat Soc Ser B 50:356–357
MathSciNet Google Scholar
Davison AC, Hinkley DV, Schechtman E (1986) Saddlepoint approximations in resampling methods. Biometrika 75:417–431
Article MathSciNet Google Scholar
Efron B (1979) Bootstrap methods: another look at the jackknife. Ann Stat 7:1–26
Article MATH MathSciNet Google Scholar
Efron B (1988) Bootstrap confidence intervals: good or bad? (with discussion.) Psychol Bull 104:293–296
Google Scholar
Efron B (1990) More efficient bootstrap computations. J Am Stat Assoc 85:79–89
Article MATH MathSciNet Google Scholar
Graham RL, Hinkley DV, John PWM, Shi S (1990) Balanced design of bootstrap simulations. J R Stat Soc Ser B 52:185–202
MATH MathSciNet Google Scholar
Gurney M (1963) The variance of the replication method for estimating variances for the CPS sample design. Memorandum, U.S. Bureau of the Census. Unpublished
Google Scholar
Hall P (1978) Representations and limit theorems for extreme value distributions. J Appl Probab 15:639–644
Article MATH MathSciNet Google Scholar
Hall P (1986) On the number of bootstrap simulations required to construct a confidence interval. Ann Stat 14:1453–1462
Article MATH Google Scholar
Hall P (1990) Antithetic resampling for the bootstrap. Biometrika 76:713–724
Article Google Scholar
Jeffreys Y, Jeffreys BS (1988) Methods of mathematical physics, 3rd edn. Cambridge University Press, Cambridge
Google Scholar
Johns MV (1988) Importance sampling for bootstrap confidence intervals. J Am Stat Assoc 83:709–714
Article MATH MathSciNet Google Scholar
Jones HL (1956) Investigating the properties of a sample mean by employing random subsample means. J Am Stat Assoc 51:54–83
Article MATH Google Scholar
Mccarthy PJ (1966) Replication: an approach to the analysis of data from complex surveys. In: National Center for Health Statistics, Public Health Service (eds) Vital Health Statistics: Series 2. Public Health Service publication 1000, vol 14. U.S. Government Printing Office, Washington, DC
Google Scholar
Mccarthy PJ (1969) Pseudo-replication: half samples. Rev Int Stat Inst 37:239–264
Article MATH Google Scholar
Shiue C-J (1960) Systematic sampling with multiple random starts. For Sci 6:42–50
Google Scholar
Simon JC (1992) Barriers to adoption, and the future of resampling; resistances to using and teaching resampling. Unpublished
Google Scholar
Singh K (1981) On the asymptotic accuracy of Efron’s bootstrap. Ann Stat 9:1187–1195
Article MATH Google Scholar
Swanepoel JWH (1986) A note on proving that the (modified) bootstrap works. Commun Stat Ser A 15:3193–3203
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics and Statistics, The University of Melbourne, Melbourne, VIC, Australia
Peter Hall

Authors

Peter Hall
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Peter Hall .

Editor information

Editors and Affiliations

Department of Operations Research and Financial Engineering, Princeton University, Princeton, New Jersey, USA
Jianqing Fan
Department of Statistics, Hebrew University of Jerusalem, Jerusalem, Israel
Ya’acov Ritov
School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, USA
C. F. Jeff Wu

Appendix

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Hall, P. (2014). Boostrap Resampling. In: Fan, J., Ritov, Y., Wu, C.F.J. (eds) Selected Works of Peter J. Bickel. Selected Works in Probability and Statistics, vol 13. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-5544-8_6

Download citation

DOI: https://doi.org/10.1007/978-1-4614-5544-8_6
Published: 08 October 2012
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-5543-1
Online ISBN: 978-1-4614-5544-8
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Boostrap Resampling

Abstract

Similar content being viewed by others

On Bootstrapping Using Smoothed Bootstrap

Soft Bootstrapping in Cluster Analysis and Its Comparison with Other Resampling Methods

Bootstrap and Subsampling Methods

Keywords

6.1 Introduction to Four Bootstrap Papers

6.1.1 Introduction and Summary

6.1.2 Laying Foundations for the Bootstrap

6.1.3 The Bootstrap in Stratified Sampling

6.1.4 Efficient Bootstrap Simulation

6.1.5 The m-Out-of-n Bootstrap

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Boostrap Resampling

Abstract

Similar content being viewed by others

On Bootstrapping Using Smoothed Bootstrap

Soft Bootstrapping in Cluster Analysis and Its Comparison with Other Resampling Methods

Bootstrap and Subsampling Methods

Keywords

6.1 Introduction to Four Bootstrap Papers

6.1.1 Introduction and Summary

6.1.2 Laying Foundations for the Bootstrap

6.1.3 The Bootstrap in Stratified Sampling

6.1.4 Efficient Bootstrap Simulation

6.1.5 The m-Out-of-n Bootstrap

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation