Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Asymptotic theory is the study of how distributions of functions of a set of random variables behave, when the number of variables becomes large. One practical context for this is statistical sampling, when the number of observations taken is large. Distributional calculations in probability are typically such that exact calculations are difficult or impossible. For example, one of the simplest possible functions of n variables is their sum, and yet in most cases, we cannot find the distribution of the sum for fixed n in an exact closed form. But the central limit theorem allows us to conclude that in some cases the sum will behave as a normally distributed random variable, when n is large. Similarly, the role of general asymptotic theory is to provide an approximate answer to exact solutions in many types of problems, and often very complicated problems. The nature of the theory is such that the approximations have remarkable unity of character, and indeed nearly unreasonable unity of character. Asymptotic theory is the single most unifying theme of probability and statistics. Particularly, in statistics, nearly every method or rule or tradition has its root in some result in asymptotic theory. No other branch of probability and statistics has such an incredibly rich body of literature, tools, and applications, in amazingly diverse areas and problems. Skills in asymptotics are nearly indispensable for a serious statistician or probabilist.

Numerous excellent references on asymptotic theory are available. A few among them are Bickel and Doksum (2007), van der Vaart (1998), Lehmann (1999), Hall and Heyde (1980), Ferguson (1996), and Serfling (1980). A recent reference is DasGupta (2008). These references have a statistical undertone. Treatments of asymptotic theory with a probabilistic undertone include Breiman (1968), Ash (1973), Chow and Teicher (1988), Petrov (1975), Bhattacharya and Rao (1986), Cramér (1946), and the all-time classic, Feller (1971). Other specific references are given in the sections.

In this introductory chapter, we lay out the basic concepts of asymptotics with concrete applications. More specialized tools are separately treated in subsequent chapters.

7.1 Some Basic Notation and Convergence Concepts

Some basic definitions, notation, and concepts are put together in this section.

Definition 7.1.

Let a n be a sequence of real numbers. We write a n  = o(1) if lim n →  a n  = 0. We write a n  = O(1) if ∃ K <  ∋ | a n  | ≤ K  ∀n ≥ 1.

More generally, if a n , b n  > 0 are two sequences of real numbers, we write a n  = o(b n ) if \(\frac{{a}_{n}} {{b}_{n}} = o(1)\); we write a n  = O(b n ) if \(\frac{{a}_{n}} {{b}_{n}} = O(1)\).

Remark.

Note that the definition allows the possibility that a sequence a n which is O(1) is also o(1). The converse is always true; that is, a n  = o(1) ⇒ a n  = O(1).

Definition 7.2.

Let a n , b n be two real sequences. We write a n  ∼ b n if \(\frac{{a}_{n}} {{b}_{n}} \rightarrow 1\), as n → . We write a n  ≍ b n if \(0 < liminf\ \frac{{a}_{n}} {{b}_{n}} \leq limsup\ \frac{{a}_{n}} {{b}_{n}} < \infty \), as n → .

Example 7.1.

Let \({a}_{n} = \frac{n} {n+1}\). Then, | a n  | ≤ 1 ∀n ≥ 1; so a n  = O(1). But, a n  → 1. as n → ; so a n is not o(1).

However, suppose \({a}_{n} = \frac{1} {n}\). Then, again, | a n  | ≤ 1 ∀n ≥ 1; so a n  = O(1). But, this time a n  → 0. as n → ; so a n is both O(1) and o(1). But a n  = O(1) is a weaker statement in this case than saying a n  = o(1).

Next, suppose a n  =  − n. Then | a n  |  = n → , as n → ; so a n is not O(1), and therefore also cannot be o(1).

Example 7.2.

Let c n  = logn, and \({a}_{n} = \frac{{c}_{n}} {{c}_{n+k}}\), where k ≥ 1 is a fixed positive integer. Thus,

$${a}_{n} = \frac{\log n} {\log (n + k)} = \frac{\log n} {\log n +\log (1 + \frac{k} {n})} = \frac{1} {1 + \frac{1} {\log n}\log (1 + \frac{k} {n})} \rightarrow \frac{1} {1 + 0} = 1.$$

Therefore, a n  = O(1), a n  ∼ 1, but a n is not o(1). The statement that a n  ∼ 1 is stronger than saying a n  = O(1).

Example 7.3.

Let \({a}_{n} = \frac{1} {\sqrt{n+1}}\). Then,

$$\begin{array}{rcl}{ a}_{n}& =& \frac{1} {\sqrt{n}} + \frac{1} {\sqrt{n + 1}} - \frac{1} {\sqrt{n}} \\ & =& \frac{1} {\sqrt{n}} + \frac{\sqrt{n} -\sqrt{n + 1}} {\sqrt{n}\sqrt{n + 1}} = \frac{1} {\sqrt{n}} - \frac{1} {\sqrt{n}\sqrt{n + 1}(\sqrt{n} + \sqrt{n + 1})} \\ & =& \frac{1} {\sqrt{n}} - \frac{1} {\sqrt{n}\sqrt{n}(1 + o(1))(2\sqrt{n} + o(1))} \\ & =& \frac{1} {\sqrt{n}} - \frac{1} {n(1 + o(1))(2\sqrt{n} + o(1))} \\ & =& \frac{1} {\sqrt{n}} - \frac{1} {2{n}^{3/2}(1 + o(1))(1 + o(1))} \\ \end{array}$$
$$\begin{array}{rcl} & =& \frac{1} {\sqrt{n}} - \frac{1} {2{n}^{3/2}(1 + o(1))} \\ & =& \frac{1} {\sqrt{n}} - \frac{1} {2{n}^{3/2}}(1 + o(1)) = \frac{1} {\sqrt{n}} - \frac{1} {2{n}^{3/2}} + o\left ({n}^{-3/2}\right )\end{array}.$$

In working with asymptotics, it is useful to be skilled in calculations of the type of this example. To motivate the first probabilistic convergence concept, we give an illustrative example.

Example 7.4.

For n ≥ 1, consider the simple discrete random variables X n with the pmf \(P({X}_{n} = \frac{1} {n}) = 1 - \frac{1} {n},P({X}_{n} = n) = \frac{1} {n}\). Then, for large n, X n is close to zero with a large probability. Although for any given n, X n is never equal to zero, the probability of it being far from zero is very small for large n. For example, P(X n  > . 001) ≤ . 001 if n ≥ 1000. More formally, for any given ε > 0, \(P({X}_{n} > \epsilon ) \leq P({X}_{n} > \frac{1} {n}) = \frac{1} {n}\), if we take n to be so large that \(\frac{1} {n} < \epsilon \). As a consequence, P( | X n  |  > ε) = P(X n  > ε) → 0, as n → . This example motivates the following definition.

Definition 7.3.

Let X n , n ≥ 1, be an infinite sequence of random variables defined on a common sample space Ω. We say that X n converges in probability to c, a specific real constant, if for any given ε > 0, P( | X n  − c |  > ε) → 0, as n → . Equivalently, X n converges in probability to c if given any ε > 0, δ > 0, there exists an n 0 = n 0(ε, δ) such that P( | X n  − c |  > ε) < δ  ∀n  ≥ n 0(ε, δ).

If X n converges in probability to c, we write \({X}_{n}{ \mathcal{P} \atop \Rightarrow } c\), or sometimes also, \({X}_{n}{ P \atop \Rightarrow } c\).

However, sometimes the sequence X n may get close to some random variable, rather than a constant. Here is an example of such a situation.

Example 7.5.

Let X, Y be two independent standard normal variables. Define a sequence of random variables X n as \({X}_{n}\ =\ X + \frac{Y } {n}\). Then, intuitively, we feel that for large n, the \(\frac{Y } {n}\) part is small, and X n is very close to the fixed random variable X. Formally, \(P(\vert {X}_{n} - X\vert > \epsilon ) = P(\vert \frac{Y } {n} \vert > \epsilon ) = P(\vert Y \vert \,>\,n\epsilon )\) = 2[1 − Φ(nε)] → 0, as n → . This motivates a generalization of the previous definition.

Definition 7.4.

Let X n , X, n ≥ 1 be random variables defined on a common sample space Ω. We say that X n converges in probability to X if given any ε > 0, P( | X n  − X |  > ε) → 0, as n → .

We denote it as \({X}_{n}{ \mathcal{P} \atop \Rightarrow } X\), or \({X}_{n}{ P \atop \Rightarrow } X\).

Definition 7.5.

A sequence of random variables X n is said to be bounded in probability or tight if, given ε > 0, one can find a constant k such that P( | X n  |  > k) ≤ ε for all n ≥ 1.

If \({X}_{n}{ \mathcal{P} \atop \Rightarrow } 0\), then we write X n  = o p (1). More generally, if \({a}_{n}{X}_{n}{ \mathcal{P} \atop \Rightarrow } 0\) for some positive sequence a n , then we write \({X}_{n}={o}_{p}( \frac{1} {{a}_{n}})\).

If X n is bounded in probability, then we write X n  = O p (1). If a n X n  = O p (1), we write \({X}_{n} = {O}_{p}( \frac{1} {{a}_{n}})\).

Proposition.

Suppose X n = o p (1). Then, X n = O p (1). The converse is, in general, not true.

Proof.

If X n  = o p (1), then by definition of convergence in probability, given c > 0, P( | X n  |  > c) → 0, as n → . Thus, given any fixed ε > 0, for all large n, say n ≥ n 0(ε), P( | X n  |  > 1) < ε. Next find constants \({c}_{1},{c}_{2},\ldots, {c}_{{n}_{0}}\), such that P( | X i  |  > c i ) < ε, i = 1, 2, , n 0. Choose \(k =\max \{ 1,{c}_{1},{c}_{2},{c}_{{n}_{0}}\}\). Then, P( | X n  |  > k) < ε  ∀n ≥ 1. Therefore, X n  = O p (1).

To see that the converse is, in general, not true, let X ∼ N(0, 1), and define X n  ≡ X,   ∀n ≥ 1. Then, X n  = O p (1). But P( | X n  |  > 1) ≡ P( | X |  > 1), which is a fixed positive number, and so, P( | X n  |  > 1) does not converge to zero. □ 

Definition 7.6.

Let {X n , X} be defined on the same probability space. We say that X n converges almost surely to X (or X n converges to X with probability 1) if P(ω : X n (ω) → X(ω)) = 1. We write \({X}_{n}{ \mathrm{a.s.} \atop \rightarrow } X\) or \({X}_{n}{ \mathrm{a.s.} \atop \Rightarrow } X\).

Remark.

If the limit X is a finite constant c with probability one, we write \({X}_{n}{ \mathrm{a.s.} \atop \Rightarrow } c\). If P(ω : X n (ω) → ) = 1, we write \({X}_{n}{ \mathrm{a.s.} \atop \rightarrow } \infty \). Almost sure convergence is a stronger mode of convergence than convergence in probability. In fact, a characterization of almost sure convergence is that for any given ε > 0,

$$\mathop{\lim }\limits_{m\rightarrow \infty }P(\vert {X}_{n} - X\vert \leq \epsilon \quad \forall n \geq m) = 1.$$

It is clear from this characterization that almost sure convergence is stronger than convergence in probability. However, the following relationships hold.

Theorem 7.1.

  1. (a)

    Let \({X}_{n}{ \mathcal{P} \atop \Rightarrow } X\) . Then there is a sub-sequence \({X}_{{n}_{k}}\) such that \({X}_{{n}_{k}}{ \mathrm{a.s.} \atop \rightarrow } X\).

  2. (b)

    Suppose X n is a monotone nondecreasing sequence and that \({X}_{n}{ \mathcal{P} \atop \Rightarrow } X\) . Then \({X}_{n}{ \mathrm{a.s.} \atop \rightarrow } X\).

Example 7.6 (Pattern in Coin Tossing). 

For iid Bernoulli trials with a success probability \(p = \frac{1} {2}\), let T n denote the number of times in the first n trials that a success is followed by a failure. Denoting

$${I}_{i} = {I}_{i}\mbox{ th trial is a success and (i\,$+$\,1)}\mbox{ th trial is a failure},$$

we have T n  =  ∑ i = 1 n − 1 I i , and therefore \(E({T}_{n}) = \frac{n-1} {4}\), and

$$\mbox{ Var}({T}_{n}) ={ \sum \nolimits }_{i=1}^{n-1}\mbox{ Var}({I}_{ i})+2{\sum \nolimits }_{i=1}^{n-2}\mbox{ Cov}({I}_{ i},{I}_{i+1}) = \frac{3(n - 1)} {16} -\frac{2(n - 2)} {16} = \frac{n + 1} {16}.$$

It now follows by an application of Chebyshev’s inequality that \(\frac{{T}_{n}} {n}{ P \atop \Rightarrow } \frac{1} {4}\).

Example 7.7 (Uniform Maximum). 

Suppose X 1, X 2,  is an infinite sequence of iid U[0, 1] random variables and let X (n) = max{X 1, , X n }. Intuitively, X (n) should get closer and closer to 1 as n increases. In fact, X (n) converges almost surely to 1. For P( | 1 − X (n) |  ≤ ε ∀n ≥ m) = P(1 − X (n) ≤ ε ∀n ≥ m) = P(X (n) ≥ 1 − ε ∀n ≥ m) = P(X (m) ≥ 1 − ε) = 1 − (1 − ε)m → 1 as m → , and hence \({X}_{(n)}{ \mathrm{a.s.} \atop \Rightarrow } 1\).

Example 7.8 (Spiky Normals). 

Suppose \({X}_{n} \sim N( \frac{1} {n}, \frac{1} {n})\) is a sequence of independent variables. The mean and the variance are both converging to zero, therefore one would intuitively expect that the sequence X n converges to zero in some sense. In fact, it converges almost surely to zero. Indeed, \(P(\vert {X}_{n}\vert \leq \epsilon \quad \forall n \geq m) ={ \prod \nolimits }_{n=m}^{\infty }P(\vert {X}_{n}\vert \leq \epsilon ) ={ \prod \nolimits }_{n=m}^{\infty }[\Phi (\epsilon \sqrt{n} - \frac{1} {\sqrt{n}}) + \Phi (\epsilon \sqrt{n} + \frac{1} {\sqrt{n}}) - 1] ={ \prod \nolimits }_{n=m}^{\infty }[1 + O(\frac{\phi (\epsilon \sqrt{n})} {\sqrt{n}} )] = 1 + O(\frac{({e}^{-\frac{m{\epsilon }^{2}} {2} })} {\sqrt{m}} ) \rightarrow 1\), as m → , implying \({X}_{n}{ \mathrm{a.s.} \atop \Rightarrow } 0\). In the above, the last but one equality follows on using the tail property of the standard normal CDF that

$$\frac{1 - \Phi (x)} {\phi (x)} = \frac{1} {x} + O\left ( \frac{1} {{x}^{3}}\right ),\quad \mathrm{as}\ x \rightarrow \infty.$$

Next, we introduce the concept of convergence in mean. It often turns out to be a convenient method for establishing convergence in probability.

Definition 7.7.

Let X n , X, n ≥ 1 be defined on a common sample space Ω. Let p ≥ 1, and suppose E( | X n  | p), E( | X | p) < . We say that X n converges in pth mean to X or X n converges in L p to X if E( | X n  − X | p) → 0, as n → . If p = 2, we also say that X n converges to X in quadratic mean. If X n converges in L p to X, we write \({X}_{n}{ {L}_{p} \atop \Rightarrow } X\).

Example 7.9 (Some Counterexamples). 

Let X n be the sequence of two point random variables with pmf \(P({X}_{n} = 0) = 1 - \frac{1} {n},P({X}_{n} = n) = \frac{1} {n}\). Then X n converges in probability to zero. But, E( | X n  | ) = 1  ∀n, and hence X n does not converge in L 1 to zero. In fact, it does not converge to zero in L p for any p ≥ 1.

Now take the same sequence X n as above, and assume moreover that they are independent. Take an ε > 0, and positive integers m, k. Then,

$$\begin{array}{rcl} & & P(\vert {X}_{n}\vert < \epsilon \,\,\forall m \leq n \leq m + k) \\ & =& P({X}_{n} = 0\,\,\forall m \leq n \leq m + k) ={ \prod \nolimits }_{n=m}^{m+k}\left (1 - \frac{1} {n}\right ) \\ & =& \frac{m - 1} {m + k}\end{array}.$$

For any m, this converges to zero as k → . Therefore, lim m →  P( | X n  |  < ε  ∀n ≥ m) cannot be one, and so, X n does not converge almost surely to zero.

Next, let X n have the pmf \(P({X}_{n} = 0) = 1 - \frac{1} {n},P({X}_{n} = \sqrt{n}) = \frac{1} {n}\). Then, X n again converges in probability to zero. Furthermore, \(E(\vert {X}_{n}\vert ) = \frac{1} {\sqrt{n}} \rightarrow 0\), and so X n converges in L 1 to zero. But, E(X n 2) = 1  ∀n, and hence X n does not converge in L 2 to zero.

The next result says that convergence in L p is a useful method for establishing convergence in probability.

Proposition.

Let X n ,X,n ≥ 1 be defined on a common sample space Ω. Suppose X n converges to X in L p for some p ≥ 1. Then \({X}_{n}{ P \atop \Rightarrow } X\).

Proof.

Simply observe that, by using Markov’s inequality,

$$P(\vert {X}_{n} - X\vert > \epsilon ) = P(\vert {X}_{n} - X{\vert }^{p} > {\epsilon }^{p}) \leq \frac{E(\vert {X}_{n} - X{\vert }^{p})} {{\epsilon }^{p}}$$

 → 0, by hypothesis. □ 

Remark.

It is easily established that if p > 1, then

$${X}_{n}\,\,\mbox{ converges in}\,\,{L}_{p}\,\,\mbox{ to}\,\,X \Rightarrow {X}_{n}\,\,\mbox{ converges in}\,\,{L}_{1}\,\,\mbox{ to}\,\,X.$$

7.2 Laws of Large Numbers

The definitions and the treatment in the previous section are for general sequences of random variables. Averages and sums are sequences of special importance in applications. The classic laws of large numbers, which characterize the long run behavior of averages, are given in this section. Truly, the behavior of averages and sums in complete generality is very subtle, and is beyond the scope of this book. Specialized treatments are available in Feller (1971), Révész (1968), and Kesten (1972).

A very useful tool for establishing almost sure convergence is stated first.

Theorem 7.2 (Borel–Cantelli Lemma). 

Let {A n } be a sequence of events on a sample space Ω. If

$${\sum }_{n=1}^{\infty }P({A}_{ n}) < \infty, $$

then P(infinitely many A n occur) = 0.

If {A n } are pairwise independent, and

$${\sum }_{n=1}^{\infty }P({A}_{ n}) = \infty, $$

then P(infinitely many A n occur) = 1.

Proof.

We prove the first statement. In order that infinitely many among the events A n , n ≥ 1, occur, it is necessary and sufficient that given any m, there is at least one event among A m , A m + 1,  that occurs. In other words,

$$\mbox{ Infinitely many ${A}_{n}$ occur}\,\, = {\cap }_{m=1}^{\infty }{\cup }_{ n=m}^{\infty }{A}_{ n}.$$

On the other hand, the events B m  =  ∪ n = m A n are decreasing in m (i.e., B 1 ⊇ B 2 ⊇ ). Therefore,

$$\begin{array}{rcl} P(\mbox{ infinitely many ${A}_{n}$ occur})\,\,& =& P({\cap }_{m=1}^{\infty }{B}_{ m}) {=\lim }_{m\rightarrow \infty }P({B}_{m}) \\ & =& \mathop{\lim }\limits_{m\rightarrow \infty }P({\cup }_{n=m}^{\infty }{A}_{ n}) \leq { limsup}_{m\rightarrow \infty }{\sum \nolimits }_{n=m}^{\infty }P({A}_{ n}) \\ & =& 0, \\ \end{array}$$

because, by assumption, ∑ n = 1 P(A n ) < . □ 

Remark.

Although pairwise independence suffices for the conclusion of the second part of the Borel–Cantelli lemma, common applications involve cases where the A n are mutually independent.

The next example gives an application of the Borel–Cantelli lemma,

Example 7.10 (Tail Runs of Arbitrary Length). 

Most of us do not feel that it is likely that in tossing a coin repeatedly, we are likely to see many tails or many heads in succession. Problems of this kind were discussed in Chapter 1. This example shows that in some sense, that intuition is wrong.

Consider a sequence of independent Bernoulli trials in which success occurs with probability p and failure with probability q = 1 − p. Suppose p > q, so that successes are more likely than failures. Consider a hypothetical long uninterrupted run of m failures, say FF…F, for some fixed m. Break up the Bernoulli trials into nonoverlapping blocks of m trials, and consider A n to be the event that the nth block consists of only failures. The probability of each A n is q m, which is free of n. Therefore, ∑ n = 1 P(A n ) =  and it follows from the second part of the Borel–Cantelli lemma that no matter how large p may be, as long as p < 1, a string of consecutive failures of any given arbitrary length m reappears infinitely many times in the sequence of Bernoulli trials. In particular, if we keep tossing an ordinary coin, then with certainty, we will see 1000 tails (or, 10,000 tails) in succession, and we will see this occur again and again, infinitely many times, as our coin tosses continue indefinitely.

Here is another important application of the Borel–Cantelli lemma.

Example 7.11 (Almost Sure Convergence of Binomial Proportion). 

Let X 1, X 2,  be an infinite sequence of independent Ber(p) random variables, where 0 < p < 1. Let

$$\bar{{X}_{n}} = \frac{{S}_{n}} {n} = \frac{{X}_{1} + {X}_{2} + \ldots + {X}_{n}} {n}.$$

Then, from our previous formula in Chapter 1 for binomial distributions, E(S n  − np)4 = np(1 − p)[1 + 3(n − 2)p(1 − p)]. Thus, by Markov’s inequality,

$$\begin{array}{rcl} P(\vert \bar{{X}_{n}} - p\vert > \epsilon )& =& P(\vert {S}_{n} - np\vert > n\epsilon ) \\ & =& P({({S}_{n} - np)}^{4} > {(n\epsilon )}^{4}) \leq \frac{E{({S}_{n} - np)}^{4}} {{(n\epsilon )}^{4}} \\ & =& \frac{np(1 - p)[1 + 3(n - 2)p(1 - p)]} {{(n\epsilon )}^{4}} \\ & \leq & \frac{3{n}^{2}{(p(1 - p))}^{2} + np(1 - p)} {{n}^{4}{\epsilon }^{4}} \leq \frac{C} {{n}^{2}} + \frac{D} {{n}^{3}}, \\ \end{array}$$

for finite constants C, D. Therefore,

$$\begin{array}{rcl} {\sum \nolimits }_{n=1}^{\infty }P(\vert \bar{{X}_{ n}} - p\vert > \epsilon )& \leq & C{\sum \nolimits }_{n=1}^{\infty } \frac{1} {{n}^{2}} + D{\sum \nolimits }_{n=1}^{\infty } \frac{1} {{n}^{3}} \\ & <& \infty \end{array}.$$

It follows from the Borel–Cantelli lemma that the binomial sample proportion \(\bar{{X}_{n}}\) converges almost surely to p.

In fact, the convergence of the sample mean \(\bar{{X}_{n}}\) to E(X 1) (i.e., the common mean of the X i ) holds in general. The general results, due to Khintchine and Kolmogorov, are known as the laws of large numbers, stated below.

Theorem 7.3 (Weak Law of Large Numbers). 

Suppose X 1 ,X 2 ,… are independent and identically distributed (iid) random variables (defined on a common sample space Ω), such that E(|X 1 |) < ∞, and E(X 1 ) = μ. Let \(\bar{{X}}_{n} = \frac{1} {n}{ \sum \nolimits }_{i=1}^{n}{X}_{i}\) . Then \(\bar{{X}}_{n}x\mu \).

Theorem 7.4 (Strong Law of Large Numbers). 

Suppose X 1 ,X 2 ,… are independent and identically distributed random variables (defined on a common sample space Ω). Then, \(\bar{{X}}_{n}\) has an a.s. (almost sure) limit iff E(|X 1 |) < ∞, in which case \(\bar{{X}}_{n}{ \mathrm{a.s.} \atop \Rightarrow } \mu = E({X}_{1})\).

Remark.

It is not very simple to prove either of the two laws of large numbers in the generality stated above. We prove the weak law in Chapter 8, and the strong law in Chapter 14. If the X i have a finite variance, then Markov’s inequality easily leads to the weak law. If X i have four finite moments, then a careful argument on the lines of our special binomial proportion example above does lead to the strong law. Once again, the Borel–Cantelli lemma does the trick.

It is extremely interesting that existence of an expectation is not necessary for the WLLN (weak law of large numbers) to hold. That is, it is possible that E( | X | ) = , and yet \(\bar{X}{ \mathcal{P} \atop \Rightarrow } a\), for some real number a. We describe this more precisely shortly.

The SLLN (strong law of large numbers) already tells us that if X 1, X 2,  are independent with a common CDF F (that is, iid), then the sample mean \(\bar{X}\) does not have any almost sure limit if E F ( | X | ) = . An obvious question is what happens to \(\bar{X}\) in such a case. A great deal of deep work has been done on this question, and there are book-length treatements of this issue. The following theorem gives a few key results only for easy reference.

Definition 7.8.

Let x be a real number. The positive and negative part of x are defined as

$${x}^{+} =\max \{ x,0\};\quad {x}^{-} =\max \{ -x,0\}.$$

That is, x  +  = x when x ≥ 0, and 0 when x ≤ 0. On the other hand, x  −  = 0 when x ≥ 0, and − x when x ≤ 0. Consequently, for any real number x,

$${x}^{+},{x}^{-}\geq 0;\quad x = {x}^{+} - {x}^{-};\quad \vert x\vert = {x}^{+} + {x}^{-}.$$

Theorem 7.5 (Failure of the Strong Law). 

Let X 1 ,X 2 ,… be independent observations from a common CDF F on the real line. Suppose E F (|X|) = ∞.

  1. (a)

    For any sequence of real numbers, a n,

    $$P(limsup\vert \bar{X} - {a}_{n}\vert = \infty ) = 1.$$
  2. (b)

    If E(X + ) = ∞, and E(X ) < ∞, then \(\bar{X}{ \mathrm{a.s.} \atop \Rightarrow } \infty.\)

  3. (c)

    If E(X ) = ∞, and E(X + ) < ∞, then \(\bar{X}{ \mathrm{a.s.} \atop \Rightarrow } -\infty.\)

More refined descriptions of the set of all possible limit points of the sequence of means \(\bar{X}\) are worked out in Kesten (1972). See also Chapter 3 in DasGupta (2008).

Example 7.12.

Let F be the CDF of the standard Cauchy distribution. Due to the symmetry, we get \(E({X}^{+}) = E({X}^{-}) = \frac{1} {\pi }{ \int \nolimits \nolimits }_{0}^{\infty } \frac{x} {1+{x}^{2}} dx = \infty \). Therefore, from part (a) of the above theorem. with probability one, \(limsup\vert \bar{X}\vert = \infty \) (i.e., the sequence of sample means cannot remain bounded). Also, from the statement of the strong law itself, the sequence will not settle down near any fixed real number. The four simulated plots in Fig. 7.1 help illustrate these phenomena. In each plot, 1000 standard Cauchy values were simulated, and the sequence of means \(\bar{{X}}_{j} = \frac{1} {j}{ \sum \nolimits }_{i=1}^{j}{X}_{i}\) were plotted, for j = 1, 2, , 1000.

Fig. 7.1
figure 7

Sequence of sample means for simulated C(0,1) data

Now, we consider the issue of a possibility of a WLLN when the expectation does not exist. The answer is that the tail of F should not be too slow. Here is the precise result.

Theorem 7.6 (Weak Law Without an Expectation). 

Let X 1 ,X 2 ,… be independent observations from a CDF F. Then, there exist constants μ n such that \(\bar{X} - {\mu }_{n}{ \mathcal{P} \atop \Rightarrow } 0\) if and only if

$$x(1 - F(x) + F(-x)) \rightarrow 0,$$

as x →∞, in which case the constants μ n may be chosen as μ n = E F (XI {|X|≤n} ).

In particular, if F is symmetric, x(1 − F(x)) → 0 as x →∞, and ∫ 0 (1 − F(x))dx = ∞, then E F (|X|) = ∞, where as \(\bar{X}{ \mathcal{P} \atop \Rightarrow } 0\).

Remark.

See Feller (1971, p. 235) for a proof. It should be noted that the two conditions x(1 − F(x)) → 0 and ∫0 (1 − F(x))dx =  are not inconsistent. It is easy to find an F that satisfies both conditions.

We close this section with an important result on the uniform closeness of the empirical CDF to the underlying CDF in the iid case.

Theorem 7.7 (Glivenko–Cantelli Theorem). 

Let F be any CDF on the real line, and X 1 ,X 2 ,… iid with common CDF F. Let \({F}_{n}(x) = \frac{\#\{i\leq n:{X}_{i}\leq x\}} {n}\) be the sequence of empirical CDFs. Then, \({\Delta }_{n} ={ \mathrm{sup}}_{x\in \mathcal{R}}\vert {F}_{n}(x) - F(x)\vert { \mathrm{a.s.} \atop \Rightarrow } 0.\)

Proof.

The main idea of the proof is to discretize the problem, and exploit Kolmogorov’s SLLN.

Fix m and define a 0, a 1, , a m , a m + 1 by the relationships \([{a}_{i},{a}_{i+1}) =\{ x : \frac{i} {m} \leq F(x) < \frac{i+1} {m} \},i = 1,2,\ldots, m - 1\), and a 0 =  − , a m + 1 = . Now fix an i and look at x ∈ [a i , a i + 1). Then

$${F}_{n}(x) - F(x) \leq {F}_{n}({a}_{i+1}-) - F({a}_{i}) \leq {F}_{n}({a}_{i+1}-) - F({a}_{i-1}-) + \frac{1} {m}.$$

Similarly, for x ∈ [a i , a i + 1),

$$F(x) - {F}_{n}(x) \leq F({a}_{i}) - {F}_{n}({a}_{i}) + \frac{1} {m}.$$

Therefore, because any x belongs to one of these intervals [a i , a i + 1),

$${\mathrm{sup}}_{x\in \mathcal{R}}\vert {F}_{n}(x) - F(x)\vert {=\max }_{i}\left \{\vert F({a}_{i}) - {F}_{n}({a}_{i})\vert + \vert F({a}_{i}-) - {F}_{n}({a}_{i}-)\vert + \frac{1} {m}\right \}.$$

For fixed m, as n → , by the SLLN each of the terms within the absolute value sign above goes almost surely to zero, and so, for any fixed m, almost surely, \(\mathop{\lim }\limits_{n}{\mathrm{sup}}_{x\in \mathcal{R}}\vert {F}_{n}(x) - F(x)\vert \leq \frac{1} {m}\). Now let m → , and the result follows. □ 

7.3 Convergence Preservation

We have already seen the importance of being able to deal with transformations of random variables in Chapters 3 and 4. This section addresses the question of when convergence properties are preserved if we suitably transform a sequence of random variables.

The next important theorem gives some frequently useful results, that are analogous to corresponding results on convergence of sequences in calculus.

Theorem 7.8 (Convergence Preservation). 

  1. (a)

    \({X}_{n}{ \mathcal{P} \atop \Rightarrow } X,{Y }_{n}{ \mathcal{P} \atop \Rightarrow } Y \,\Rightarrow \,{X}_{n} \pm {Y }_{n}{ \mathcal{P} \atop \Rightarrow } X \pm Y.\)

  2. (b)

    \({X}_{n}{ \mathcal{P} \atop \Rightarrow } X,{Y }_{n}{ \mathcal{P} \atop \Rightarrow } Y \Rightarrow {X}_{n}{Y }_{n}{ \mathcal{P} \atop \Rightarrow } XY\), \({X}_{n}{ \mathcal{P} \atop \Rightarrow } X,{Y }_{n}{ \mathcal{P} \atop \Rightarrow } Y,P(Y \neq 0) = 1 \Rightarrow \frac{{X}_{n}} {{Y }_{n}}{ \mathcal{P} \atop \Rightarrow } \frac{X} {Y }.\)

  3. (c)

    \({X}_{n}{ \mathrm{a.s.} \atop \Rightarrow } X,{Y }_{n}{ \mathrm{a.s.} \atop \Rightarrow } Y \Rightarrow {X}_{n} \pm {Y }_{n}{ \mathrm{a.s.} \atop \Rightarrow } X \pm Y.\)

  4. (d)

    \({X}_{n}{ \mathrm{a.s.} \atop \Rightarrow } X,{Y }_{n}{ \mathrm{a.s.} \atop \Rightarrow } Y \Rightarrow {X}_{n}{Y }_{n}{ \mathrm{a.s.} \atop \Rightarrow } XY\), \({X}_{n}{ \mathrm{a.s.} \atop \Rightarrow } X,{Y }_{n}{ \mathrm{a.s.} \atop \Rightarrow } Y,P(Y \neq 0) = 1 \Rightarrow \frac{{X}_{n}} {{Y }_{n}}{ \mathrm{a.s.} \atop \Rightarrow } \frac{X} {Y }.\)

  5. (e)

    \({X}_{n}{ {L}_{1} \atop \Rightarrow } X,{Y }_{n}{ {L}_{1} \atop \Rightarrow } Y \Rightarrow {X}_{n} + {Y }_{n}{ {L}_{1} \atop \Rightarrow } X + Y.\)

  6. (f)

    \({X}_{n}{ {L}_{2} \atop \Rightarrow } X,{Y }_{n}{ {L}_{2} \atop \Rightarrow } Y \Rightarrow {X}_{n} + {Y }_{n}{ {L}_{2} \atop \Rightarrow } X + Y.\)

The proofs of each of these parts use relatively simple arguments, such as the triangular, Minkowski, and the Cauchy–Schwarz inequality (see Chapter 1 for their exact statements). We omit the details of these proofs; Chow and Teicher (1988, pp. 254–256) give the details for several parts of this convergence preservation theorem.

Example 7.13.

Suppose X 1, X 2,  are independent N1, σ1 2) variables, and Y 1, Y 2,  are independent N2, σ2 2) variables. For n, m ≥ 1, let \(\bar{{X}_{n}} = \frac{1} {n}{ \sum \nolimits }_{i=1}^{n}{X}_{i},\bar{{Y }_{m}} = \frac{1} {m}{ \sum \nolimits }_{j=1}^{m}{Y }_{j}\). By the strong law of large numbers, (SLLN), as \(n,m \rightarrow \infty, \bar{{X}_{n}}{ \mathrm{a.s.} \atop \Rightarrow } {\mu }_{1}\), and \(\bar{{Y }_{m}}{ \mathrm{a.s.} \atop \Rightarrow } {\mu }_{2}\). Then, by the theorem above, \(\bar{{X}_{n}} -\bar{ {Y }_{m}}{ \mathrm{a.s.} \atop \Rightarrow } {\mu }_{1} - {\mu }_{2}\).

Also, by the same theorem, \(\bar{{X}_{n}}\bar{{Y }_{m}}{ \mathrm{a.s.} \atop \Rightarrow } {\mu }_{1}{\mu }_{2}\).

Definition 7.9 (The Multidimensional Case). 

Let X n, n ≥ 1, X be d-dimensional random vectors, for some 1 ≤ d < . We say that \(\mathbf{{X}_{n}}{ \mathcal{P} \atop \Rightarrow } \mathbf{X}\), if \(\vert \vert \mathbf{{X}_{n}} -\mathbf{X}\vert \vert { \mathcal{P} \atop \Rightarrow } 0\).

We say that \(\mathbf{{X}_{n}}{ \mathrm{a.s.} \atop \Rightarrow } \mathbf{X}\), if P(ω :  | | X n(ω) − X(ω) | | → 0) = 1. Here, | | . | | denotes Euclidean length (norm).

Operationally, the following equivalent conditions are more convenient.

Proposition.

  1. (a)

    \(\mathbf{{X}_{n}}{ \mathcal{P} \atop \Rightarrow } \mathbf{X}\) if and only if \({X}_{n,i}{ \mathcal{P} \atop \Rightarrow } {X}_{i}\) , for each i = 1,2,…,d. That is, each coordinate of X n converges in probability to the corresponding coordinate of X;

  2. (b)

    \(\mathbf{{X}_{n}}{ \mathrm{a.s.} \atop \Rightarrow } \mathbf{X}\) if and only if \({X}_{n,i}{ \mathrm{a.s.} \atop \Rightarrow } {X}_{i}\) , for each i = 1,2,…,d.

Theorem 7.9 (Convergence Preservation in Multidimension). 

Let X n,Y n ,n ≥ 1, X,Y be d-dimensional random vectors. Let A be some p × d matrix of real elements, where p ≥ 1. Then,

  1. (a)

    \(\mathbf{{X}_{n}}{ \mathcal{P} \atop \Rightarrow } \mathbf{X},\mathbf{{Y}_{n}}{ \mathcal{P} \atop \Rightarrow } \mathbf{Y} \Rightarrow \mathbf{{X}_{n}} \pm \mathbf{{Y}_{n}}{ \mathcal{P} \atop \Rightarrow } \mathbf{X} \pm \mathbf{Y};\) \(\mathbf{{X}_{n}}^{\prime}\mathbf{{Y}_{n}}{ \mathcal{P} \atop \Rightarrow } \mathbf{X}^{\prime}\mathbf{Y};A\mathbf{{X}_{n}}{ \mathcal{P} \atop \Rightarrow } A\mathbf{X}.\)

  2. (b)

    Exactly the same results hold, when convergence in probability is replaced everywhere by almost sure convergence.

Proof.

This theorem follows easily from the convergence preservation theorem in one dimension, and the proposition above, which says that multidimensional convergence is the same as convergence in each coordinate separately. □ 

The next result is one of the most useful results on almost sure convergence and convergence in probability. It says that convergence properties are preserved if we make smooth transformations. However, the force of the result is partially lost if we insist on the transformations being smooth everywhere. To give the most useful version of the result, we need a technical definition.

Definition 7.10.

Let d, p ≥ 1 be positive integers, and \(f : S \subseteq {\mathcal{R}}^{d} \rightarrow {\mathcal{R}}^{p}\) a function. Let C(f) = { x ∈ S : f is continuous at x}. Then C(f) is called the continuity set of f.

We can now give the result on preservation of convergence under smooth transformations.

Theorem 7.10

Let X n,X be d-dimensional random vectors, and \(f : S \subseteq {\mathcal{R}}^{d} \rightarrow {\mathcal{R}}^{p}\) a function. Let C(f) be the continuity set of f. Suppose the random vector X satisfies the condition

$$P(\mathbf{X} \in C(f)) = 1.$$

Then,

  1. (a)

    \(\mathbf{{X}_{n}}{ \mathcal{P} \atop \Rightarrow } \mathbf{X} \Rightarrow f(\mathbf{{X}_{n}}){ \mathcal{P} \atop \Rightarrow } f(\mathbf{X});\)

  2. (b)

    \(\mathbf{{X}_{n}}{ \mathrm{a.s.} \atop \Rightarrow } \mathbf{X} \Rightarrow f(\mathbf{{X}_{n}}){ \mathrm{a.s.} \atop \Rightarrow } f(\mathbf{X}).\)

Proof.

We prove part (b). Let S 1 = { ω ∈ Ω : f is continuous at X(ω)}. Let S 2 = { ω ∈ Ω : X n(ω) → X(ω)}. Then, P(S 1 ∩ S 2) = 1, and for each ω ∈ S 1 ∩ S 2, f(X n(ω)) → f(X(ω)). That means that \(f(\mathbf{{X}_{n}}){ \mathrm{a.s.} \atop \Rightarrow } f(\mathbf{X}).\) □ 

We give two important applications of this theorem next.

Example 7.14 (Convergence of Sample Variance). 

Let X 1, X 2, , X n be independent observations from a common distribution F, and suppose that F has finite mean μ and finite variance σ2. The sample variance, of immense importance in statistics, is defined as \({s}^{2} = \frac{1} {n-1}{ \sum \nolimits }_{i=1}^{n}{({X}_{i} -\bar{ X})}^{2}\). The purpose of this example is to show that \({s}^{2}{ \mathrm{a.s.} \atop \Rightarrow } {\sigma }^{2}\), as n → .

First note that if we can prove that \(\frac{1} {n}{ \sum \nolimits }_{i=1}^{n}{({X}_{i} -\bar{ X})}^{2}{ \mathrm{a.s.} \atop \Rightarrow } {\sigma }^{2}\), then it follows that s 2 also converges almost surely to σ2, because \(\frac{n} {n-1} \rightarrow 1\) as n → . Now,

$$\frac{1} {n}{\sum \nolimits }_{i=1}^{n}{({X}_{ i} -\bar{ X})}^{2} = \frac{1} {n}{\sum \nolimits }_{i=1}^{n}{X}_{ i}^{2} - {(\bar{X})}^{2}$$

(an algebraic identity). Because F has a finite variance, it also possesses a finite second moment, namely, E F (X 2) = σ2 + μ2 < . By applying the strong law of large numbers to the sequence X 1 2, X 2 2, , we get \(\frac{1} {n}{ \sum \nolimits }_{i=1}^{n}{X}_{i}^{2}{ \mathrm{a.s.} \atop \Rightarrow } {E}_{F}({X}^{2}) = {\sigma }^{2} + {\mu }^{2}\). By applying the SLLN to the sequence X 1, X 2, , we get \(\bar{X}{ \mathrm{a.s.} \atop \Rightarrow } \mu \), and therefore by the continuous mapping theorem, \({(\bar{X})}^{2}{ \mathrm{a.s.} \atop \Rightarrow } {\mu }^{2}\). Now, by the theorem on preservation of convergence, we get that \(\frac{1} {n}{ \sum \nolimits }_{i=1}^{n}{X}_{i}^{2} - {(\bar{X})}^{2}{ \mathrm{a.s.} \atop \Rightarrow } {\sigma }^{2} + {\mu }^{2} - {\mu }^{2}={\sigma }^{2}\), which finishes the proof.

Example 7.15 (Convergence of Sample Correlation). 

Suppose F(x, y) is a joint CDF in \({\mathcal{R}}^{2}\), and suppose that E(X 2), E(Y 2) are both finite. Let

$$\rho = \frac{\mbox{ Cov}(X,Y )} {\sqrt{\mbox{ Var} (X)\mbox{ Var} (Y )}} = \frac{E(XY ) - E(X)E(Y )} {\sqrt{\mbox{ Var} (X)\mbox{ Var} (Y )}}$$

be the correlation between X and Y. Suppose (X i , Y i ), 1 ≤ i ≤ n are n independent observations from the joint CDF F. The sample correlation coefficient is defined as

$$r = \frac{{\sum \nolimits }_{i=1}^{n}({X}_{i} -\bar{ X})({Y }_{i} -\bar{ Y })} {\sqrt{{\sum \nolimits }_{i=1}^{n}{({X}_{i} -\bar{ X})}^{2}{ \sum \nolimits }_{i=1}^{n}{({Y }_{i} -\bar{ Y })}^{2}}}.$$

The purpose of this example is to show that r converges almost surely to ρ.

It is convenient to rewrite r in the equivalent form

$$r = \frac{ \frac{1} {n}{ \sum \nolimits }_{i=1}^{n}{X}_{i}{Y }_{i} -\bar{ X}\bar{Y }} {\sqrt{ \frac{1} {n}{ \sum \nolimits }_{i=1}^{n}{({X}_{i} -\bar{ X})}^{2}}\sqrt{ \frac{1} {n}{ \sum \nolimits }_{i=1}^{n}{({Y }_{i} -\bar{ Y })}^{2}}}.$$

By the SLLN, \(\frac{1} {n}{ \sum \nolimits }_{i=1}^{n}{X}_{i}{Y }_{i}\) converges almost surely to E(XY ), and \(\bar{X},\bar{Y }\) converge almost surely to E(X), E(Y ). By the previous example, \(\frac{1} {n}{ \sum \nolimits }_{i=1}^{n}{({X}_{i}\,-\,\bar{X})}^{2}\) converges almost surely to Var(X), and \(\frac{1} {n}{ \sum \nolimits }_{i\,=\,1}^{n}{({Y }_{i}\,-\,\bar{Y })}^{2}\) converges almost surely to Var(Y ). Now consider the function \(f(s,t,u,v,w)\,=\, \frac{s-tu} {\sqrt{v}\sqrt{w}},-\infty \,<\,s,t,u<\,\infty, v,w\,>\,0\). This function is continuous on the set S = {(s, t, u, v, w) :  −  < s, t, u < , v, w > 0, (s − tu)2 ≤ vw}. The joint distribution of \(\frac{1} {n}{ \sum \nolimits }_{i=1}^{n}{X}_{i}{Y }_{i},\bar{X},\bar{Y }, \frac{1} {n}{ \sum \nolimits }_{i=1}^{n}{({X}_{i} -\bar{ X})}^{2}, \frac{1} {n}{ \sum \nolimits }_{i=1}^{n}{({Y }_{i} -\bar{ Y })}^{2}\) assigns probability one to the set S. By the continuous mapping theorem, it follows that \(r{ \mathrm{a.s.} \atop \Rightarrow } \rho \).

7.4 Convergence in Distribution

Studying distributions of random variables is of paramount importance in both probability and statistics. The relevant random variable may be a member of some sequence X n . Its exact distribution may be cumbersome. But it may be possible to approximate its distribution by a simpler distribution. We can then approximate probabilities for the true distribution of the random variable by probabilities in the simpler distribution. The type of convergence concept that justifies this sort of approximation is called convergence in distribution or convergence in law. Of all the convergence concepts we are discussing, convergence in distribution is among the most useful in answering practical questions. For example, statisticians are usually much more interested in constructing confidence intervals than just point estimators, and a central limit theorem of some kind is necessary to produce a confidence interval.

We start with an illustrative example.

Example 7.16.

Suppose

$${X}_{n} \sim U\left [\frac{1} {2} - \frac{1} {n + 1}, \frac{1} {2} + \frac{1} {n + 1}\right ],n \geq 1.$$

Because the interval \([\frac{1} {2} - \frac{1} {n+1}, \frac{1} {2} + \frac{1} {n+1}]\) is shrinking to the single point \(\frac{1} {2}\), intuitively we feel that the distribution of X n is approaching a distribution concentrated at \(\frac{1} {2}\), that is, a one-point distribution. The CDF of the distribution concentrated at \(\frac{1} {2}\) equals the function F(x) = 0 for \(x < \frac{1} {2}\), and F(x) = 1 for \(x \geq \frac{1} {2}\). Consider now the CDF of X n ; call it F n (x). Fix \(x < \frac{1} {2}\). Then, for all large n, F n (x) = 0, and so lim n F n (x) is also zero. Next fix \(x > \frac{1} {2}\). Then, for all large n, F n (x) = 1, and so lim n F n (x) is also one. Therefore, if \(x < \frac{1} {2}\), or if \(x > \frac{1} {2}\), lim n F n (x) = F(x). If x is exactly equal to \(\frac{1} {2}\), then \({F}_{n}(x) = \frac{1} {2}\). But \(F(\frac{1} {2}) = 1\). So \(x = \frac{1} {2}\) is a problematic point, and the only problematic point, in that \({F}_{n}(\frac{1} {2})\rightarrow F(\frac{1} {2})\). Interestingly, \(x = \frac{1} {2}\) is also exactly the only point at which F is not continuous. However, we do not want this one problematic point to ruin our intuitive feeling that X n is approaching the one-point distribution concentrated at \(\frac{1} {2}\). That is, we do not take into account any points where the limiting CDF is not continuous.

Definition 7.11.

Let X n , X, n ≥ 1, be real-valued random variables defined on a common sample space Ω. We say that X n converges in distribution (in law) to X if P(X n  ≤ x) → P(X ≤ x) as n → , at every point x that is a continuity point of the CDF F of the random variable X.

We denote convergence in distribution by \({X}_{n}{ \mathcal{L} \atop \Rightarrow } X\).

If X n, X are d-dimensional random vectors, then the same definition applies by using the joint CDFs of X n, X, that is, X n converges in distribution to X if P(X n1 ≤ x 1, , X nd  ≤ x d ) → P(X 1 ≤ x 1, , X d  ≤ x d ) at each point (x 1, , x d ) that is a continuity point of the joint CDF F(x 1, , x d ) of the random vector X.

An important point of caution is the following.

In order to prove that d-dimensional vectors X n converge in distribution to X, it is not, in general, enough to prove that each coordinate of X n converges in distribution to the corresponding coordinate of X. However, convergence of general linear combinations is enough, which is the content of the following theorem.

Theorem 7.11 (Cramér–Wold Theorem). 

Let X n,X be d-dimensional random vectors. Then \(\mathbf{{X}_{n}}{ \mathcal{L} \atop \Rightarrow } \mathbf{X}\) if and only if \(\mathbf{c}^{\prime}\mathbf{{X}_{n}}{ \mathcal{L} \atop \Rightarrow } \mathbf{c}^{\prime}\mathbf{X}\) for all unit d-dimensional vectors c.

The shortest proof of this theorem uses a tool called characteristic functions , which we have not discussed yet. We give a proof in the next chapter by using characteristic functions. Returning to the general concept of convergence in distribution, two basic facts are the following.

Theorem 7.12.

  1. (a)

    If \({X}_{n}{ \mathcal{L} \atop \Rightarrow } X\) , then X n = O p (1);

  2. (b)

    If \({X}_{n}{ \mathcal{P} \atop \Rightarrow } X\) , then \({X}_{n}{ \mathcal{L} \atop \Rightarrow } X\).

Proof.

We sketch a proof of part (b). Take a continuity point x of the CDF F of X, and fix ε > 0. Then,

$$\begin{array}{rcl}{ F}_{n}(x)& =& P({X}_{n} \leq x) \\ & =& P({X}_{n} \leq x,\vert {X}_{n} - X\vert \leq \epsilon ) + P({X}_{n} \leq x,\vert {X}_{n} - X\vert > \epsilon ) \\ & \leq & P(X \leq x + \epsilon ) + P(\vert {X}_{n} - X\vert > \epsilon )\end{array}.$$

Now let n →  on both sides of the inequality. Then, we get \mathop{limsup} n F n (x) ≤ F(x + ε), because P( | X n  − X |  > ε) → 0 by hypothesis. Now, letting ε → 0, we get \mathop{limsup} n F n (x) ≤ F(x), because F(x + ε) → F(x) by right continuity of F.

The proof will be complete if we show that \mathop{liminf} n F n (x) ≥ F(x). This is proved similarly, except we now start with P(X ≤ x − ε) on the left, and follow the same steps. It should be mentioned that it is in this part that the continuity of F at x is used. □ 

Remark.

The fact that if a sequence X n of random variables converges in distribution, then the sequence must be O p (1), tells us that there must be sequences of random variables which do not converge in distribution to anything. For example, take X n  ∼ N(n, 1), n ≥ 1. This sequence X n is not O p (1), and therefore cannot converge in distribution. The question arises if the O p (1) property suffices for convergence. Even that, evidently, is not true; just consider X 2n − 1 ∼ N(0, 1), and X 2n  ∼ N(1, 1). However, separately, the odd and the even subsequences do converge. That is, there might be a partial converse to the fact that if a sequence X n converges in distribution, then it must be O p (1). This is a famous theorem on convergence in distribution, and is stated below.

Theorem 7.13 (Helly’s Theorem). 

Let X n ,n ≥ 1 be random variables defined on a common sample space Ω, and suppose X n is O p (1). Then there is a sub-sequence \({X}_{{n}_{j}},j \geq 1\) , and a random variable X (on the same sample space Ω), such that \({X}_{{n}_{j}}{ \mathcal{L} \atop \Rightarrow } X\) . Furthermore, \({X}_{n}{ \mathcal{L} \atop \Rightarrow } X\) if and only if every convergent sub-sequence \({X}_{{n}_{j}}\) converges in distribution to this same X.

See Port (1994, p. 625) for a proof. Major generalizations of Helly’s theorem to much more general spaces are known. Typically, some sort of a metric structure is assumed in these results; see van der Vaart and Wellner (2000) for such general results on weak compactness.

Example 7.17 (Various Convergence Phenomena Are Possible). 

This quick example shows that a sequence of discrete distributions can converge in distribution to a discrete distribution, or a continuous distribution, and a sequence of continuous distributions can converge in distribution to a continuous one, or a discrete one.

A good example of discrete random variables converging in distribution to a discrete random variable is the sequence \({X}_{n} \sim \mathrm{Bin}(n, \frac{1} {n})\). Although it was not explicitly put in the language of convergence in distribution, we have seen in Chapter 6 that X n converges to a Poisson random variable with mean one. A familiar example of discrete random variables converging in distribution to a continuous random variable is the de Moivre–Laplace central limit theorem (Chapter 1), which says that if X n  ∼ Bin(n, p), then \(\frac{{X}_{n}-np} {\sqrt{np(1-p)}}\) converges to a standard normal variable.

Examples of continuous random variables converging to a continuous random variable are immediately available by using the general central limit theorem (also Chapter 1). For example, if X i are independent U[ − 1, 1] variables, then \(\frac{\sqrt{n}\bar{X}} {\sigma }\), where \({\sigma }^{2} = \frac{1} {3}\), converges to a standard normal variable. Finally, as an example of continuous random variables converging to a discrete random variable, consider \({X}_{n} \sim \mathrm{Be}( \frac{1} {n}, \frac{1} {n})\). Visually, the density of X n for large n is a symmetric U-shaped density, unbounded at both 0 and 1. It is not hard to show that X n converges in distribution to X, where X is a Bernoulli random variable with parameter \(\frac{1} {2}\).

Thus, we see that any types of random variables can indeed converge to the same or the other type.

By definition of convergence in distribution, if \({X}_{n}{ \mathcal{L} \atop \Rightarrow } X\), and if X has a continuous CDF F (continuous everywhere), then F n (x) → F(x)  ∀x where F n (x) is the CDF of X n . The following theorem says that much more is true, namely that the convergence is actually uniform; see p. 265 in Chow and Teicher (1988).

Theorem 7.14 (Pólya’s Theorem). 

Let X n ,n ≥ 1 have CDF F n , and let X have CDF F. If F is everywhere continuous, and if \({X}_{n}{ \mathcal{L} \atop \Rightarrow } X\) , then

$$su{p}_{x\in \mathcal{R}}\vert {F}_{n}(x) - F(x)\vert \rightarrow 0,$$

as n →∞.

A large number of equivalent characterizations of convergence in distribution are known. Collectively, these conditions are called the portmanteau theorem . Note that the parts of the theorem are valid for real valued random variables, or d-dimensional random variables, for any 1 < d < ∞.

Theorem 7.15 (The Portmanteau Theorem). 

Let {X n ,X} be random variables taking values in a finite-dimensional Euclidean space. The following are characterizations of \({X}_{n}{ \mathcal{L} \atop \Rightarrow } X:\)

  1. (a)

    E(g(X n )) → E(g(X)) for all bounded continuous functions g.

  2. (b)

    E(g(X n )) → E(g(X)) for all bounded uniformly continuous functions g.

  3. (c)

    E(g(X n )) → E(g(X)) for all bounded Lipschitz functions g. Here a Lipschitz function is such that |g(x) − g(y)|≤ C||x − y|| for some C, and all x,y.

  4. (d)

    E(g(X n )) → E(g(X)) for all continuous functions g with a compact support.

  5. (e)

    liminf P(X n ∈ G) ≥ P(X ∈ G) for all open sets G.

  6. (f)

    limsupP(X n ∈ S) ≤ P(X ∈ S) for all closed sets S.

  7. (g)

    P(X n ∈ B) → P(X ∈ B) for all (Borel) sets B such that the probability of X belonging to the boundary of B is zero.

See Port (1994, p. 614) for proofs of various parts of this theorem.

Example 7.18.

Consider \({X}_{n}\,\sim \,\mathrm{Uniform}\{ \frac{1} {n}, \frac{2} {n},\ldots, \frac{n-1} {n}, 1\}\). Then, it can be shown easily that the sequence X n converges in law to the U[0, 1] distribution. Consider now the function g(x) = x 10, 0 ≤ x ≤ 1. Note that g is continuous and bounded. Therefore, by part (a) of the portmanteau theorem, \(E(g({X}_{n})) ={ \sum \nolimits }_{k=1}^{n}(\frac{{k}^{10}} {{n}^{11}} ) \rightarrow E(g(X)) ={ \int \nolimits \nolimits }_{0}^{1}{x}^{10}dx = \frac{1} {11}\).

This can be proved by using convergence of Riemann sums to a Riemann integral. But it is interesting to see the link to convergence in distribution.

Example 7.19 (Weierstrass’s Theorem). 

Weierstrass’s theorem says that any continuous function on a closed bounded interval can be uniformly approximated by polynomials. In other words, given a continuous function f(x) on a bounded interval, one can find a polynomial p(x) (of a sufficiently large degree) such that | p(x) − f(x) | is uniformly small. Consider the case of the unit interval; the case of a general bounded interval reduces to this case.

Here we show pointwise convergence by using the portmanteau theorem. Laws of large numbers are needed for establishing uniform approximability. We give a constructive proof. Towards this, for n ≥ 1, 0 ≤ p ≤ 1, and a given continuous function \(g : [0,1] \rightarrow \mathcal{R}\), define the sequence of Bernstein polynomials, \({B}_{n}(p)\,=\,{\sum \nolimits }_{k=0}^{n}g(\frac{k} {n})\left (\begin{array}{c} n\\ k \end{array} \right ){p}^{k}(1- p{)}^{n-k}\). Note that we can think of B n (p) as \({B}_{n}(p)\,=\,E[g(\frac{X} {n} )\,\,\vert X\,\sim \,\mathrm{Bin}(n,p)]\). As \(n \rightarrow \infty, \frac{X} {n}{ \mathcal{P} \atop \Rightarrow } p\), and it follows that \(\frac{X} {n}{ \mathcal{L} \atop \Rightarrow } {\delta }_{p}\), the one-point distribution concentrated at p (we have already seen that convergence in probability implies convergence in distribution). Because g is continuous and hence bounded, it follows from the portmanteau theorem that B n (p) → g(p), at any p.

It is not hard to establish that B n (p) − g(p) converges uniformly to zero, as n → . Here is a sketch. As above, X denotes a binomial random variable with parameters n and p. We need to use the facts that a continuous function on [0, 1] is also uniformly continuous and bounded. Thus, for any given ε > 0, we can find δ > 0 such that | x − y |  < δ ⇒ | g(x) − g(y) | ≤ ε, and also we can find a finite C such that | g(x) | ≤ C  ∀x. So,

$$\begin{array}{rcl} \vert {B}_{n}(p) - g(p)\vert & =& \left \vert E\left [g\left (\frac{X} {n} \right )\right ] - g(p)\right \vert \\ &\leq & E\left [\left \vert g\left (\frac{X} {n} \right ) - g(p)\right \vert \right ] \\ & =& E\left [\left \vert g\left (\frac{X} {n} \right ) - g(p)\right \vert {I}_{\left \{\left \vert \frac{X} {n} -p\right \vert \leq \delta \right \}}\right ] \\ & & +E\left [\left \vert g\left (\frac{X} {n} \right ) - g(p)\right \vert {I}_{\left \{\left \vert \frac{X} {n} -p\right \vert >\delta \right \}}\right ] \\ & \leq & \epsilon + 2CP\left (\left \vert \frac{X} {n} - p\right \vert > \delta \right )\end{array}.$$

Now, in the last line, just apply Chebyshev’s inequality and bound the function p(1 − p) in Chebyshev’s inequality by \(\frac{1} {4}\). It easily follows then that for all large n, the second term \(2CP(\vert \frac{X} {n} - p\vert > \delta )\) is also ≤ ε, which means, that for all large n, uniformly in p, | B n (p) − g(p) | ≤ 2ε.

The most important result on convergence in distribution is the central limit theorem, which we have already seen in Chapter 1. The proof of the general case is given later in this chapter; it requires some additional development.

Theorem 7.16 (CLT). 

Let X i ,i ≥ 1 be iid with E(X i ) = μ and VarX i ) = σ 2  < ∞. Then

$$\frac{\sqrt{n}(\bar{X} - \mu )} {\sigma }{ \mathcal{L} \atop \Rightarrow } Z \sim N(0,1).$$

We also write

$$\frac{\sqrt{n}(\bar{X} - \mu )} {\sigma }{ \mathcal{L} \atop \Rightarrow } N(0,1).$$

The multidimensional central limit theorem is stated next. We show that it easily follows from the one-dimensional central limit theorem, by making use of the Cramér–Wold theorem.

Theorem 7.17 (Multivariate CLT). 

Let X i,i ≥ 1, be iid d-dimensional random vectors with E(X 1 ) = μ, and covariance matrix Cov( X 1) = Σ. Then,

$$\sqrt{n}(\bar{\mathbf{X}} - \mu ){ \mathcal{L} \atop \Rightarrow } {N}_{d}(0,\Sigma ).$$

Remark.

If X i , i ≥ 1 are iid with mean μ and variance σ2, then the CLT in one dimension says that \(\frac{{S}_{n}-n\mu } {\sigma \sqrt{n}}{ \mathcal{L} \atop \Rightarrow } N(0,1)\), where S n is the nth partial sum X 1 + ⋯ + X n . In particular, therefore, \(\frac{{S}_{n}-n\mu } {\sigma \sqrt{n}} = {O}_{p}(1)\). In other words, in a distributional sense, \(\frac{{S}_{n}-n\mu } {\sigma \sqrt{n}}\) stabilizes. If we take a large n, then for most sample points \(\omega, \frac{\vert {S}_{n}(\omega )-n\mu \vert } {\sigma \sqrt{n}}\) will be, for example, less than 4. But as n changes, this collection of good sample points also changes. Indeed, any fixed sample point ω is one of the good sample points for certain values of n, and falls into the category of bad sample points for (many) other values of n. The law of the iterated logarithm says that if we fix ω and look at \(\frac{\vert {S}_{n}(\omega )-n\mu \vert } {\sigma \sqrt{n}}\) along such unlucky values of n, then \(\frac{{S}_{n}(\omega )-n\mu } {\sigma \sqrt{n}}\) will not appear to be stable. In fact, it will keep growing with n, although at a slow rate. Here is what the law of the iterated logarithm says.

Theorem 7.18 (Law of Iterated Logarithm(LIL)). 

Let X i ,i ≥ 1 be iid with mean μ, variance σ 2 < ∞, and let S n = ∑ \nolimits i=1 n X i ,n ≥ 1. Then,

  1. (a)

    \({limsup}_{n}\frac{{S}_{n}-n\mu } {\sqrt{2n\log \log n}} = \sigma \,\,\mbox{ a.s.}\)

  2. (b)

    \({liminf }_{n}\frac{{S}_{n}-n\mu } {\sqrt{2n\log \log n}} = -\sigma \,\,\mbox{ a.s.}\)

  3. (c)

    If finite constants a,τ satisfy

    $${limsup}_{n}\frac{{S}_{n} - na} {\sqrt{2n\log \log n}} = \tau, $$

    then necessarily Var(X 1 ) < ∞, and a = E(X 1 ),τ 2 = Var(X 1 ).

See Chow and Teicher (1988, p. 355) for a proof. The main use of the LIL is in proving other strong laws. Because \(\sqrt{n\log \log n}\) grows at a very slow rate, the practical use of the LIL is quite limited. We remark that the LIL provides another example of a sequence which converges in probability (to zero), but does not converge almost surely.

7.5 Preservation of Convergence and Statistical Applications

Akin to the results on preservation of convergence in probability and almost sure convergence under various operations, there are similar other extremely useful results on preservation of convergence in distribution. The first theorem is of particular importance in statistics.

7.5.1 Slutsky’s Theorem

Theorem 7.19 (Slutsky’s Theorem). 

Let X n,Y n be d and p -dimensional random vectors for some d,p ≥ 1. Suppose \(\mathbf{{X}_{n}}{ \mathcal{L} \atop \Rightarrow } \mathbf{X}\) , and \(\mathbf{{Y}_{n}}{ \mathcal{P} \atop \Rightarrow } \mathbf{c}\) . Let h(x,y) be a scalar or a vector-valued jointly continuous function in \((x,y) \in {\mathcal{R}}^{d} \times {\mathcal{R}}^{p}\) . Then \(h(\mathbf{{X}_{n}},\mathbf{{Y}_{n}}){ \mathcal{L} \atop \Rightarrow } h(\mathbf{X},\mathbf{c})\) .

Proof.

We use the part of the portmanteau theorem which says that a random vector \(\mathbf{{Z}_{n}}{ \mathcal{L} \atop \Rightarrow } \mathbf{Z}\) if E[g(Z n)] → E[g(Z)] for all bounded uniformly continuous functions g. Now, if we simply repeat the proof of the uniform convergence of the Bernstein polynomials in our example on Weierstrass’s theorem, the result is obtained. □ 

The following are some particularly important consequences of Slutsky’s theorem.

Corollary.

  1. (a)

    Suppose \(\mathbf{{X}_{n}}{ \mathcal{L} \atop \Rightarrow } \mathbf{X},\mathbf{{Y}_{n}}{ \mathcal{P} \atop \Rightarrow } \mathbf{c}\) , where X n,Y n are of the same order. Then, \(\mathbf{{X}_{n}} + \mathbf{{Y}_{n}}{ \mathcal{L} \atop \Rightarrow } \mathbf{X} + \mathbf{c}\).

  2. (b)

    Suppose \(\mathbf{{X}_{n}}{ \mathcal{L} \atop \Rightarrow } \mathbf{X},{Y }_{n}{ \mathcal{P} \atop \Rightarrow } c\) , where Y n are scalar random variables. Then \({Y }_{n}\mathbf{{X}_{n}}{ \mathcal{L} \atop \Rightarrow } c\mathbf{X}\).

  3. (c)

    Suppose \(\mathbf{{X}_{n}}{ \mathcal{L} \atop \Rightarrow } \mathbf{X},{Y }_{n}{ \mathcal{P} \atop \Rightarrow } c\neq 0\) , where Y n are scalar random variables. Then \(\frac{\mathbf{{X}_{n}}} {{Y }_{n}}{ \mathcal{L} \atop \Rightarrow } \frac{\mathbf{X}} {c}\).

Example 7.20 (Convergence of the t to Normal). 

Let X i , i ≥ 1, be iid N(μ, σ2), σ > 0, and let \({T}_{n} = \frac{\sqrt{n}(\bar{X}-\mu )} {s}\), where \({s}^{2} = \frac{1} {n-1}{ \sum }_{i=1}^{n}{({X}_{i} -\bar{ X})}^{2}\), namely the sample variance. We saw in Chapter 15 that T n has the central t(n − 1) distribution. Write

$${T}_{n} = \frac{\sqrt{n}(\bar{X} - \mu )/\sigma } {s/\sigma }.$$

We have seen that \({s}^{2}{ \mathrm{a.s.} \atop \Rightarrow } {\sigma }^{2}\). Therefore, by the continuous mapping theorem, \(s{ \mathrm{a.s.} \atop \Rightarrow } \sigma \), and so \(\frac{s} {\sigma }{ \mathrm{a.s.} \atop \Rightarrow } 1\). On the other hand, by the central limit theorem, \(\frac{\sqrt{n}(\bar{X}-\mu )} {\sigma }{ \mathcal{L} \atop \Rightarrow } N(0,1)\). Therefore, now by Slutsky’s theorem, \({T}_{n}{ \mathcal{L} \atop \Rightarrow } N(0,1)\).

Indeed, this argument shows that whatever the common distribution of the X i is, if σ2 = Var(X 1) <  and > 0, then \({T}_{n}{ \mathcal{L} \atop \Rightarrow } N(0,1)\), although the exact distribution of T n is no longer the central t distribution, unless the common distribution of the X i is normal.

Example 7.21 (A Normal–Cauchy Connection). 

Consider iid standard normal variables, X 1, X 2, , X 2n , n ≥ 1. Let

$${R}_{n} = \frac{{X}_{1}} {{X}_{2}} + \frac{{X}_{3}} {{X}_{4}} + \cdots + \frac{{X}_{2n-1}} {{X}_{2n}}, \quad \mathrm{and}\quad {D}_{n} = {X}_{1}^{2} + {X}_{ 2}^{2} + \cdots + {X}_{ n}^{2}.$$

Let \({T}_{n} = \frac{{R}_{n}} {{D}_{n}}\).

Recall that the quotient of two independent standard normals is distributed as a standard Cauchy. Thus,

$$\frac{{X}_{1}} {{X}_{2}}, \frac{{X}_{3}} {{X}_{4}},\ldots, \frac{{X}_{2n-1}} {{X}_{2n}}$$

are independent standard Cauchy. In the following, we write C n to denote a random variable with a Cauchy distribution with location parameter zero, and scale parameter n. From our results on convolutions, we know that the sum of n independent standard Cauchy random variables is distributed as C n ; the scale parameter is n. Thus, \({R}_{n}{ \mathcal{L} \atop =} {C}_{n} \sim C(0,n){ \mathcal{L} \atop =} n{C}_{1} \sim nC(0,1)\). Therefore,

$${T}_{n}{ \mathcal{L} \atop =} \frac{n{C}_{1}} {{\sum \nolimits }_{i=1}^{n}{X}_{i}^{2}} = \frac{{C}_{1}} { \frac{1} {n}{ \sum \nolimits }_{i=1}^{n}{X}_{i}^{2}}.$$

Now, by the WLLN, \(\frac{1} {n}{ \sum \nolimits }_{i=1}^{n}{X}_{i}^{2}{ \mathcal{P} \atop \Rightarrow } E({X}_{1}^{2}) = 1\). A sequence of random variables with a distribution identically equal to the fixed C(0, 1) distribution also, tautologically, converges in distribution to C 1, thus by applying Slutsky’s theorem, we conclude that \({T}_{n}{ \mathcal{L} \atop \Rightarrow } {C}_{1} \sim C(0,1)\).

7.5.2 Delta Theorem

The next theorem says that convergence in distribution is appropriately preserved by making smooth transformations. In particular, we present a general version of a theorem of fundamental use in statistics, called the delta theorem.

Theorem 7.20 (Continuous Mapping Theorem). 

(a) Let X n be d-dimensional random vectors and let \(S \subseteq {\mathcal{R}}^{d}\) be such that P(X n ∈ S) = 1 ∀n. Suppose \(\mathbf{{X}_{n}}{ \mathcal{L} \atop \Rightarrow } \mathbf{X}\) . Let \(g : S \rightarrow {\mathcal{R}}^{p}\) be a continuous function, where p is a positive integer. Then \(g(\mathbf{{X}_{n}}){ \mathcal{L} \atop \Rightarrow } g(\mathbf{X})\).

  1. (b)

    (Delta Theorem of Cramér). Let X n be d-dimensional random vectors and let \(S \subseteq {\mathcal{R}}^{d}\) be such that \(P(\mathbf{{X}_{n}} \in S) = 1\,\,\forall n\) . Suppose for some d-dimensional vector μ, and some sequence of reals \({c}_{n} \rightarrow \infty, {c}_{n}(\mathbf{{X}_{n}} - \mu ){ \mathcal{L} \atop \Rightarrow } \mathbf{X}\) . Let \(g : S \rightarrow {\mathcal{R}}^{p}\) be a function with each coordinate of g once continuously differentiable with respect to every coordinate of x ∈ S at x = μ. Then

    $${c}_{n}(g(\mathbf{{X}_{n}}) - g(\mu )){ \mathcal{L} \atop \Rightarrow } Dg(\mu )\mathbf{X},$$

where Dg(μ) is the matrix of partial derivatives \((( \frac{\partial {g}_{i}} {\partial {x}_{j}})){\vert }_{\mathbf{x}=\mu }.\)

Proof.

For part (a), we use the Portmanteau theorem. Denote g(X n) = Y n, g(X) = Y, and consider bounded continuous functions f(Y n). Now, f(Y n) = f(g(X n)) = h(X n), where h(. ) is the composition function f(g(. )). Because h is continuous, because f, g are, and h is bounded, because f is, the Portmanteau theorem implies that E(h(X n)) → E(h(X)), that is, E(f(Y n)) → E(f(Y)). Now the reverse implication in the Portmanteau theorem implies that \(\mathbf{{Y}_{n}}{ \mathcal{L} \atop \Rightarrow } \mathbf{Y}\).

We prove part (b) for the case d = p = 1. First note that it follows from the assumption c n  →  that X n  − μ = o p (1). Also, by an application of Taylor’s theorem,

$$g({x}_{0} + h) = g({x}_{0}) + hg^{\prime}({x}_{0}) + o(h)$$

if g is differentiable at x 0. Therefore,

$$g({X}_{n}) = g(\mu ) + ({X}_{n} - \mu )g^{\prime}(\mu ) + {o}_{p}({X}_{n} - \mu ).$$

That the remainder term is o p (X n  − μ) follows from our observation that X n  − μ = o p (1). Taking g(μ) to the left and multiplying by c n , we obtain

$${c}_{n}[g({X}_{n}) - g(\mu )] = {c}_{n}({X}_{n} - \mu )g^{\prime}(\mu ) + {c}_{n}{o}_{p}({X}_{n} - \mu ).$$

The term c n o p (X n  − μ) = o p (1), because c n (X n  − μ) = O p (1). Hence, by an application of Slutsky’s theorem, \({c}_{n}[g({X}_{n}) - g(\mu )]{ \mathcal{L} \atop \Rightarrow } g^{\prime}(\mu )X\). □ 

Example 7.22 (A Quadratic Form). 

Let X i , i ≥ 1 be iid random variables with finite mean μ and finite variance σ2. By the central limit theorem, \(\frac{\sqrt{n}(\bar{X}-\mu )} {\sigma }{ \mathcal{L} \atop \Rightarrow } Z\), where Z ∼ N(0, 1). Therefore, by the continuous mapping theorem, if \({Q}_{n} = \frac{n} {{\sigma }^{2}} {(\bar{X} - \mu )}^{2}\), then \({Q}_{n} ={ \left (\frac{\sqrt{n}(\bar{X}-\mu )} {\sigma } \right )}^{2}{ \mathcal{L} \atop \Rightarrow } {Z}^{2}\). But Z 2 ∼ χ1 2. Therefore, \({Q}_{n}{ \mathcal{L} \atop \Rightarrow } {\chi }_{1}^{2}\).

Example 7.23 (An Important Statistics Example). 

Let X = X n  ∼ Bin(n, p), n ≥ 1, 0 < p < 1. In statistics, p is generally treated as an unknown parameter, and the usual estimate of p is \(\hat{p} = \frac{X} {n}\). Define \({T}_{n} = \vert \frac{\sqrt{n}(\hat{p}-p)} {\sqrt{\hat{p}(1-\hat{p})}}\vert \). The goal of this example is to find the limiting distribution of T n . First, by the central limit theorem,

$$\frac{\sqrt{n}(\hat{p} - p)} {\sqrt{p(1 - p)}} = \frac{X - np} {\sqrt{np(1 - p)}}{ \mathcal{L} \atop \Rightarrow } N(0,1).$$

Next, by the WLLN, \(\hat{p}{ \mathcal{P} \atop \Rightarrow } p\), and hence by the continuous mapping theorem for convergence in probability, \(\sqrt{ \hat{p}(1 -\hat{ p})}{ \mathcal{P} \atop \Rightarrow } \sqrt{ p(1 - p)}\). This gives, by Slutsky’s theorem, \(\frac{\sqrt{n}(\hat{p}-p)} {\sqrt{\hat{p}(1-\hat{p})}}{ \mathcal{L} \atop \Rightarrow } N(0,1).\) Finally, because the absolute value function is continuous, by the continuous mapping theorem for convergence in distribution,

$${T}_{n} = \vert \frac{\sqrt{n}(\hat{p} - p)} {\sqrt{\hat{p}(1 -\hat{ p})}}\vert { \mathcal{L} \atop \Rightarrow } \vert Z\vert, $$

the absolute value of a standard normal.

Example 7.24.

Suppose X i , i ≥ 1, are iid with mean μ and variance σ2 < . By the central limit theorem, \(\frac{\sqrt{n}(\bar{X}-\mu )} {\sigma }{ \mathcal{L} \atop \Rightarrow } Z\), where Z ∼ N(0, 1). Consider the function g(x) = x 2. This is continuously differentiable, in fact at any x, and g′(x) = 2x. If μ≠0, then g′(μ) = 2μ≠0. By the delta theorem, we get that \(\sqrt{n}(\bar{{X}}^{2} - {\mu }^{2}){ \mathcal{L} \atop \Rightarrow } N(0,4{\mu }^{2}{\sigma }^{2}).\) If μ = 0, this last statement is still true, and that means \(\sqrt{n}\bar{{X}}^{2}{ \mathcal{P} \atop \Rightarrow } 0\), if μ = 0.

Example 7.25 (Sample Variance and Standard Deviation). 

Suppose again X i , i ≥ 1, are iid with mean θ, variance σ2, and E(X 1 4) < . Also let μ j  = E(X 1 − θ)j, 1 ≤ j ≤ 4. This example has d = 2, p = 1. Take

$$\mathbf{{X}_{n}} = \left (\begin{array}{c} \bar{X}\\ \frac{1} {n}{ \sum \nolimits }_{i=1}^{n}{X}_{i}^{2} \end{array} \right ),\quad \mu = \left (\begin{array}{c} E{X}_{1} \\ E{X}_{1}^{2} \end{array} \right ),\quad \Sigma = \left (\begin{array}{cc} \mathrm{Var}({X}_{1}) &\mathrm{Cov}({X}_{1},{X}_{1}^{2}) \\ \mathrm{Cov}({X}_{1},{X}_{1}^{2})& \mathrm{Var}({X}_{1}^{2}) \end{array} \right )$$

By the multivariate central limit theorem, \(\sqrt{n}(\mathbf{{X}_{n}} - \mu ){ \mathcal{L} \atop \Rightarrow } N(\mathbf{0},\Sigma )\). Now consider the function g(u, v) = v − u 2. This is once continuously differentiable with respect to each of u, v (in fact at any u, v), and the partial derivatives are g u  =  − 2u, g v  = 1. Using the delta theorem, with a little bit of matrix algebra calculations, it follows that

$$\begin{array}{rcl} \sqrt{ n}\left ( \frac{1} {n}{\sum \nolimits }_{i=1}^{n}{({X}_{ i} -\bar{ X})}^{2} -\mathrm{Var}({X}_{ 1})\right )&{ \mathcal{L} \atop \Rightarrow } & N(0,{\mu }_{4} - {\sigma }^{4})\end{array}.$$

If we choose \({s}_{n}^{2} ={ \sum \nolimits }_{i=1}^{n}{({X}_{i} -\bar{ X})}^{2}/(n - 1)\) then

$$\begin{array}{rcl} \sqrt{ n}({s}_{n}^{2} - {\sigma }^{2})& =& \frac{{\sum \nolimits }_{i=1}^{n}{({X}_{i} -\bar{ X})}^{2}} {(n - 1)\sqrt{n}} + \sqrt{n}\left ( \frac{1} {n}{\sum \nolimits }_{i=1}^{n}{({X}_{ i} -\bar{ X})}^{2} - {\sigma }^{2}\right ) \\ \end{array}$$

\(=\,\sqrt{n}( \frac{1} {n}{ \sum \nolimits }_{i=1}^{n}{({X}_{i} -\bar{ X})}^{2} - {\sigma }^{2}) + {o}_{p}(1)\), which also converges in law to N(0, μ4 − σ4) by Slutsky’s theorem. By another use of the delta theorem, this time with d = p = 1, and with the function \(g(u) = \sqrt{u}\), one gets

$$\begin{array}{rcl} \sqrt{ n}({s}_{n} - \sigma )&{ \mathcal{L} \atop \Rightarrow } & N\left (0, \frac{{\mu }_{4} - {\sigma }^{4}} {4{\sigma }^{2}} \right )\end{array}.$$

Example 7.26 (Sample Correlation). 

Another use of the delta theorem is the derivation of the limiting distribution of the sample correlation coefficient r for iid bivariate data (X i , Y i ). We have

$$\begin{array}{rcl}{ r}_{n}& =& \frac{ \frac{1} {n} \sum \nolimits {X}_{i}{Y }_{i} -\bar{ X}\bar{Y }} {\sqrt{ \frac{1} {n} \sum \nolimits {({X}_{i} -\bar{ X})}^{2}}\sqrt{ \frac{1} {n} \sum \nolimits {({Y }_{i} -\bar{ Y })}^{2}}}\end{array}.$$

By taking

$$\begin{array}{rcl}{ T}_{n}& =& \left (\bar{X},\,\bar{Y },\, \frac{1} {n}\sum \nolimits {X}_{i}^{2},\, \frac{1} {n}\sum \nolimits {Y }_{i}^{2},\, \frac{1} {n}\sum \nolimits {X}_{i}{Y }_{i}\right ) \\ \theta & =& \left (E{X}_{1},\;E{Y }_{1},\;E{X}_{1}^{2},\;E{Y }_{ 1}^{2},\;E{X}_{ 1}{Y }_{1}\right ) \\ \end{array}$$

and by taking Σ to be the covariance matrix of (X 1, Y 1, X 1 2, Y 1 2, X 1 Y 1), and on using the transformation \(g({u}_{1},{u}_{2},{u}_{3},{u}_{4},{u}_{5}) = ({u}_{5} - {u}_{1}{u}_{2})/\sqrt{({u}_{3 } - {u}_{1 }^{2 })({u}_{4 } - {u}_{2 }^{2 })}\) it follows from the delta theorem, with d = 5, p = 1, that

$$\begin{array}{rcl} \sqrt{ n}({r}_{n} - \rho )&{ \mathcal{L} \atop \Rightarrow } N(0,{v}^{2})& \\ \end{array}$$

for some v > 0. It is not possible to write a clean formula for v in general. If (X i , Y i ) are iid N 2 X , μ Y ,  σ X 2, σ Y 2, ρ) then the calculation of v 2 can be done in closed-form, and

$$\begin{array}{rcl} \sqrt{ n}({r}_{n} - \rho )&{ \mathcal{L} \atop \Rightarrow } N(0,{(1 - {\rho }^{2})}^{2}).& \\ \end{array}$$

However, convergence to normality is very slow.

7.5.3 Variance Stabilizing Transformations

A major use of the delta theorem is construction of variance stabilizing transformations (VST), a technique that is of fundamental use in statistics. VSTs are useful tools for constructing confidence intervals for unknown parameters. The general idea is the following. Suppose we want to find a confidence interval for some parameter \(\theta \in \mathcal{R}\). If T n  = T n (X 1, , X n ) is some natural estimate for θ (e.g., sample mean as an estimate of a population mean), then often the CLT, or some generalization of the CLT, will tell us that \({T}_{n} - \theta { \mathcal{L} \atop \Rightarrow } N(0,{\sigma }^{2}(\theta ))\), for some suitable function σ2(θ). This implies that in large samples,

$${P}_{\theta }({T}_{n} - {z}_{\frac{\alpha } {2} } \frac{\sigma (\theta )} {\sqrt{n}} \leq \theta \leq {T}_{n} + {z}_{\frac{\alpha } {2} } \frac{\sigma (\theta )} {\sqrt{n}} ) \approx 1 - \alpha, $$

where α is some specified number in (0,1) and \({z}_{\alpha /2}\,=\,{\Phi }^{-1}(1 -\frac{\alpha } {2} )\). Finally, plugging in T n in place of θ in σ(θ), a confidence interval for θ is \({T}_{n} \pm {z}_{\frac{\alpha } {2} } \frac{\sigma ({T}_{n})} {\sqrt{n}}\). The delta theorem provides an alternative solution that is sometimes preferred. By the delta theorem, if g( ⋅) is once differentiable at θ with g′(θ) ≠ 0, then

$$\begin{array}{rcl} \sqrt{ n}\left (g({T}_{n}) - g(\theta )\right )&{ \mathcal{L} \atop \Rightarrow } N(0,{[g^{\prime}(\theta )]}^{2}{\sigma }^{2}(\theta )).& \\ \end{array}$$

Therefore, if we set

$$\begin{array}{rcl}{ [g^{\prime}(\theta )]}^{2}{\sigma }^{2}(\theta )& =& {k}^{2} \\ \end{array}$$

for some constant k, then \(\sqrt{n}\left (g({T}_{n}) - g(\theta )\right ){ \mathcal{L} \atop \Rightarrow } N(0,{k}^{2})\), and this produces a confidence interval for g(θ):

$$g({T}_{n}) \pm {z}_{\frac{\alpha } {2} } \frac{k} {\sqrt{n}}.$$

By retransforming back to θ, we get another confidence interval for θ:

$${g}^{-1}\left (g({T}_{ n}) - {z}_{\frac{\alpha } {2} } \frac{k} {\sqrt{n}}\right ) \leq \theta \leq {g}^{-1}\left (g({T}_{ n}) + {z}_{\frac{\alpha } {2} } \frac{k} {\sqrt{n}}\right ).$$

The reason that this one is sometimes preferred to the first confidence interval, namely, \({T}_{n} \pm {z}_{\frac{\alpha } {2} } \frac{\sigma ({T}_{n})} {\sqrt{n}}\), is that no additional plug-in is necessary to estimate the penultimate variance function in this second confidence interval. The penultimate variance function is already a constant k 2 by choice in this second method. The transformation g(T n ) obtained from its defining property

$${[g^{\prime}(\theta )]}^{2}{\sigma }^{2}(\theta ) = {k}^{2}$$

has the expression

$$g(\theta ) = k\int \nolimits \nolimits \frac{1} {\sigma (\theta )}d\theta, $$

where the integral is to be interpreted as a primitive. The constant k can be chosen as any nonzero real number, and g(T n ) is called a variance stabilizing transformation. Although the delta theorem is certainly available in \({\mathcal{R}}^{d}\) even when d > 1, unfortunately the concept of VSTs does not generalize to multiparameter cases. It is generally infeasible to find a dispersion-stabilizing transformation when the dimension of θ is more than one. This example is a beautiful illustration of how probability theory leads to useful and novel statistical techniques.

Example 7.27 (VST in Binomial Case). 

Suppose X n  ∼ Bin(n, p). Then \(\sqrt{n}({X}_{n}/\) \(n - p){ \mathcal{L} \atop \Rightarrow } N(0,p(1 - p))\). So using the notation used above, \(\sigma (p) = \sqrt{p(1 - p)}\) and consequently, on taking \(k = \frac{1} {2}\),

$$g(p) = \int \nolimits \nolimits \frac{1/2} {\sqrt{p(1 - p)}}dp\; =\;\arcsin (\sqrt{p}).$$

Hence, \(g({X}_{n}) =\arcsin (\sqrt{{X}_{n } /n})\) is a variance-stabilizing transformation and indeed,

$$\sqrt{n}\left (\!\arcsin \left (\sqrt{\frac{{X}_{n } } {n}} \right ) -\arcsin \left (\sqrt{p}\right )\right ){ \mathcal{L} \atop \Rightarrow } N\left (0, \frac{1} {4}\right ).$$

Thus, a confidence interval for p is

$$\mathop{\sin }\limits^{2}\left (\!\arcsin \left (\sqrt{\frac{{X}_{n } } {n}} \right ) \mp \frac{{z}_{\alpha /2}} {2\sqrt{n}}\right ).$$

Example 7.28 (Fisher’s z). 

Suppose (X i , Y i ), i = 1, , n, are iid bivariate normal with parameters μ X , μ Y ,  σ X 2, σ Y 2, ρ. Then, as we saw above, \(\sqrt{n}({r}_{n} - \rho ){ \mathcal{L} \atop \Rightarrow } N(0,{(1 - {\rho }^{2})}^{2})\), r n being the sample correlation coefficient. Therefore,

$$g(\rho ) = \int \nolimits \nolimits \frac{1} {{(1 - \rho )}^{2}}d\rho = \frac{1} {2}\log \frac{1 + \rho } {1 - \rho } = \mbox{ arctanh}(\rho )$$

provides a variance-stabilizing transformation for r n . This is the famous arctanh transformation of Fisher, popularly known as Fisher’s z. By the delta theorem, \(\sqrt{n}(\mbox{ arctanh}({r}_{n}) -\mbox{ arctanh}(\rho ))\) converges in distribution to the N(0, 1) distribution. Confidence intervals for ρ are computed from Fisher’s z as

$$\tanh \!\left (\!\mbox{ arctanh}({r}_{n}) \pm \frac{{z}_{\alpha /2}} {\sqrt{n}} \right ).$$

The arctanh transformation z = arctanh(r n ) attains approximate normality much more quickly than r n itself.

Example 7.29 (An Unusual VST). 

Here is a nonregular example on variance stabilization. Suppose we have iid observations X 1, X 2,  from the U[0, θ] distribution. Then, the usual estimate of θ is the sample maximum X (n), and \(n(\theta - {X}_{(n)}){ \mathcal{L} \atop \Rightarrow } \mathrm{Exp}(\theta )\). The asymptotic variance function in the distribution of the sample maximum is therefore simply θ2, and therefore, a VST is

$$g(\theta ) = \int \nolimits \nolimits \frac{1} {\theta }d\theta =\log \theta.$$

So, g(X (n)) = logX (n) is a variance-stabilizing transformation of X (n), In fact, \(n(\log \theta -\log {X}_{(n)}){ \mathcal{L} \atop \Rightarrow } \mathrm{Exp}(1)\). However, the interesting fact is that for every n, the distribution of n(logθ − logX (n)) is exactly a standard exponential. There is no nontrivial example such as this in the regular cases (although N(θ, 1) is a trivial example).

7.6 Convergence of Moments

If some sequence of random variables X n converges in distribution to a random variable X, then sometimes we are interested in knowing whether moments of X n converge to moments of X. More generally, we may want to find approximations for moments of X n . Convergence in distribution just by itself cannot ensure convergence of any moment. An extra condition that ensures convergence of appropriate moments is uniform integrability. There is another side of this story. If we can show that the moments of X n converge to moments of some recognizable distribution, then we can sometimes show that X n converges in distribution to that distinguished distribution. Some of these issues are discussed in this section.

7.6.1 Uniform Integrability

Definition 7.12.

Let {X n } be a sequence of random variables on some common sample space Ω. The sequence {X n } is called uniformly integrable if sup n ≥ 1 E( | X n  | ) < , and if for any ε > 0 there exists a sufficiently small δ > 0, such that whenever P(A) < δ, sup n ≥ 1 A  | X n  | dP < ε.

Remark.

We give two results on the link between uniform integrability and convergence of moments.

Theorem 7.21.

Suppose X,X n ,n ≥ 1 are such that E(|X| p ) < ∞, and E(|X n | p )<∞ ∀n ≥ 1. Suppose \({X}_{n}{ \mathcal{P} \atop \Rightarrow } X\) , and |X n | p is uniformly integrable. Then, E(|X n − X| p ) → 0, as n →∞.

For proving this theorem, we need two lemmas. The first one is one of the most fundamental results in real analysis. It uses the terminology of Lebesgue integrals and the Lebesgue measure , which we are not treating in this book. Thus, the statement below uses an undefined concept.

Lemma (Dominated Convergence Theorem). 

Let f n ,f be functions on \({\mathcal{R}}^{d},\) d<∞, and suppose f and each f n is (Lebesgue) integrable. Suppose f n (x) → f(x), except possibly for a set of x values of Lebesgue measure zero. If |f n |≤ g for some integrable function g, then \({\int \nolimits \nolimits }_{{\mathcal{R}}^{d}}{f}_{n}(x)dx \rightarrow {\int \nolimits \nolimits }_{{\mathcal{R}}^{d}}f(x)dx\) , as n →∞.

Lemma.

Suppose X,X n ,n≥1 are such that E(|X| p ) < ∞, and E(|X n | p ) < ∞ ∀n ≥ 1. Then |X n | p is uniformly integrable if and only if |X n − X| p is uniformly integrable.

Fix c > 0, and define \({Y }_{n} = \vert {X}_{n} - X{\vert }^{p},{Y }_{n,c} = {Y }_{n}{I}_{\{\vert {X}_{n}-X\vert \leq c\}}\). Because, by hypothesis, \({X}_{n}{ \mathcal{P} \atop \Rightarrow } X\), by the continuous mapping theorem, \({Y }_{n}{ \mathcal{P} \atop \Rightarrow } 0\), and as a consequence, \({Y }_{n,c}{ \mathcal{P} \atop \Rightarrow } 0\). Furthermore, | Y n, c  | ≤ c p, and the dominated convergence theorem implies that E(Y n, c ) → 0. Now,

$$\begin{array}{rcl} E(\vert {X}_{n} - X{\vert }^{p})& =& E({Y }_{ n}) = E({Y }_{n,c}) + E({Y }_{n}{I}_{\{\vert {X}_{n}-X\vert >c\}}) \\ & \leq & E({Y }_{n,c}) +{ \mathrm{sup}}_{n\geq 1}E({Y }_{n}{I}_{\{\vert {X}_{n}-X\vert >c\}}) \\ \Rightarrow {limsup}_{n}E(\vert {X}_{n} - X{\vert }^{p})& \leq &{ \mathrm{sup}}_{ n\geq 1}E({Y }_{n}{I}_{\{\vert {X}_{n}-X\vert >c\}}) \\ \Rightarrow {limsup}_{n}E(\vert {X}_{n} - X{\vert }^{p})& \leq &{ \mathrm{inf}}_{ c}{\mathrm{sup}}_{n\geq 1}E({Y }_{n}{I}_{\{\vert {X}_{n}-X\vert >c\}}) \\ & =& \end{array}$$
(0.)

Therefore, E( | X n  − X | p) → 0. □ 

Remark.

Sometimes we do not need the full force of the result that E( | X n  − X | p) → 0, but all we want is that E(X n p) converges to E(X p). In that case, the conditions in the previous theorem can be relaxed, and in fact from a statistical point of view, the relaxed condition is much more natural. The following result gives the relaxed conditions.

Theorem 7.22.

Suppose X n ,X,n ≥ 1 are defined on a common sample space Ω, that \({X}_{n}{ \mathcal{L} \atop \Rightarrow } X\) , and that for some given p ≥ 1,|X n | p is uniformly integrable. Then E(X n k ) → E(X k ) ∀k ≤ p.

Remark.

To apply these last two theorems, we have to verify that for the appropriate sequence X n , and for the relevant p, | X n  | p is uniformly integrable. Direct verification of uniform integrability from definition is often cumbersome. But simple sufficient conditions are available, and these are often satisfied in many applications. The next result lists a few useful sufficient conditions.

Theorem 7.23 (Sufficient Conditions for Uniform Integrability). 

  1. (a)

    Suppose for some δ > 0, sup n E|X n | 1+δ < ∞. Then {X n } is uniformly integrable.

  2. (b)

    If |X n |≤ Y,n ≥ 1, and E(Y ) < ∞, then {X n } is uniformly integrable.

  3. (c)

    If |X n |≤ Y n ,n ≥ 1, and Y n is uniformly integrable, then {X n } is uniformly integrable.

  4. (d)

    If X n ,n ≥ 1 are identically distributed, and E(|X 1 |) < ∞, then {X n } is uniformly integrable.

  5. (e)

    If {X n } and {Y n } are uniformly integrable then {X n + Y n } is uniformly integrable.

  6. (f)

    If {X n } is uniformly integrable and |Y n |≤ M < ∞, then {X n Y n } is uniformly integrable.

    See Chow and Teicher (1988, p. 94) for further details on the various parts of this theorem.

Example 7.30 (Sample Maximum). 

We saw in Chapter 6 that if X 1, X 2,  are iid, and if E( | X 1 | k) < , then any order statistic X (r) satisfies

$$E(\vert {X}_{(r)}{\vert }^{k}) \leq \frac{n!} {(r - 1)!(n - r)!}E(\vert {X}_{1}{\vert }^{k}).$$

In particular, for the sample maximum X (n) of n observations,

$$E(\vert {X}_{(n)}\vert ) \leq nE(\vert {X}_{1}\vert ) \Rightarrow E\left (\frac{\vert {X}_{(n)}\vert } {n} \right ) \leq E(\vert {X}_{1}\vert ).$$

By itself, this does not ensure that \(\frac{\vert {X}_{(n)}\vert } {n}\) is uniformly integrable.

However, if we also assume that E(X 1 2) < , then the same argument gives E( | X (n) | 2) ≤ nE(X 1)2, so that \({\mathrm{sup}}_{n}E{(\frac{\vert {X}_{(n)}\vert } {n} )}^{2} < \infty \), which is enough to conclude that \(\frac{\vert {X}_{(n)}\vert } {n}\) is uniformly integrable.

However, we do not need the existence of E(X 1 2) for this conclusion. Note that

$$\vert {X}_{(n)}\vert \leq {\sum \nolimits }_{i=1}^{n}\vert {X}_{ i}\vert \Rightarrow \frac{\vert {X}_{(n)}\vert } {n} \leq \frac{{\sum \nolimits }_{i=1}^{n}\vert {X}_{i}\vert } {n}.$$

If E( | X 1 | ) < , then in fact \(\frac{{\sum \nolimits }_{i=1}^{n}\vert {X}_{ i}\vert } {n}\) is uniformly integrable, and as a consequence, \(\frac{\vert {X}_{(n)}\vert } {n}\) is also uniformly integrable under just the condition E( | X 1 | ) < .

7.6.2 The Moment Problem and Convergence in Distribution

We remarked earlier that convergence of moments can be useful to establish convergence in distribution. Clearly, however, if we only know that E(X n k) converges to E(X k) for each k, from that alone we cannot conclude that the distributions of X n converge to the distribution of X. This is because there could, in general, be another random variable Y with a distribution distinct from that of X but with all moments equal to the moments of X. However, if we rule out that possibility, then convergence in distribution follows.

Theorem 7.24.

Suppose for some sequence {X n } and a random variable X, E(X n k ) → E(X k ) ∀k ≥ 1. If the distribution of X is determined by its moments, then \({X}_{n}{ \mathcal{L} \atop \Rightarrow } X\).

When is a distribution determined by its sequence of moments? This is a hard analytical problem, and is commonly known as the moment problem . There is a huge and sophisticated literature on the moment problem. A few easily understood conditions for determinacy by moments are given in the next result.

Theorem 7.25.

  1. (a)

    If a random variable X is uniformly bounded, then it is determined by its moments.

  2. (b)

    If the mgf of a random variable X exists in a nonempty interval containing zero, then it is determined by its moments.

  3. (c)

    Let X have a density function f(x). If there exist positive constants c,α,β,k such that f(x) ≤ ce −α|x| |x| β  ∀x such that |x| > k, then X is determined by its moments.

Remark.

See Feller (1971, pp. 227–228 and p. 251) for the previous two theorems. Basically, part (b) is the primary result here, because if the conditions in (a) or (c) hold, then the mgf exists in an interval containing zero. However, (a) and (c) are useful special sufficient conditions.

Example 7.31 (Discrete Uniforms Converging to Continuous Uniform). 

Consider random variables X n with the discrete uniform distribution on \(\{ \frac{1} {n},\frac{2} {n},\ldots, 1\}\). Fix a positive integer k. Then, \(E({X}_{n}^{k}) = \frac{1} {n}{ \sum \nolimits }_{i=1}^{n}{( \frac{i} {n})}^{k}\). This is the upper Riemann sum corresponding to the partition

$$\left (\left.\frac{i - 1} {n}, \frac{i} {n}\right ]\right.,i = 1,2,\ldots, n$$

for the function f(x) = x k on (0, 1]. Therefore, as n → , E(X n k), which is the upper Riemann sum, converges to ∫0 1 x k dx, which is the kth moment of a random variable X having the uniform distribution on the unit interval. Because k is arbitrary, it follows from part (a) of the above theorem that the discrete uniform distribution on \(\{ \frac{1} {n}, \frac{2} {n},\ldots, 1\}\) converges to the uniform distribution on the unit interval.

7.6.3 Approximation of Moments

Knowing the limiting value of a moment of some sequence of random variables is only a first-order approximation to the moment. Sometimes we want more refined approximations. Suppose X i , 1 ≤ i ≤ d, are jointly distributed random variables, and T d (X 1, X 2, , X d ) is some function of X 1, X 2, , X d . To find approximations for a moment of T d , one commonly used technique is to approximate the function T d (x 1, x 2, , x d ) by a simpler function, say g d (x 1, x 2, , x d ), and then use the moment of g d (X 1, X 2, , X d ) as an approximation to the moment of T d (X 1, X 2, , X d ). Note that this is a formal approximation. It does not come with an automatic quantification of the error of the approximation. Such quantification is usually a harder problem, and limited answers are available. We address these two issues in this section. We consider approximation of the mean and variance of a statistic, because of their special importance.

A natural approximation of a smooth function is obtained by expanding the function around some point in a Taylor series. For the formal approximations, we assume that all the moments of X i that are necessary for the approximation to make sense do exist.

It is natural to expand the statistic around the mean vector μ of X = (X 1, , X d ). For notational simplicity, we write t for T d . Then, the first- and second-order Taylor expansions for t(X 1, X 2, , X d ) are:

$$t({x}_{1},{x}_{2},\ldots, {x}_{d}) \approx t({\mu }_{1},\ldots, {\mu }_{d}) +{ \sum \nolimits }_{i=1}^{n}({x}_{ i} - {\mu }_{i}){t}_{i}({\mu }_{1},\ldots, {\mu }_{d}),$$

and

$$\begin{array}{rcl} & t({x}_{1},{x}_{2},\ldots, {x}_{d}) \approx t({\mu }_{1},\ldots, {\mu }_{d}) +{ \sum \nolimits }_{i=1}^{d}({x}_{i} - {\mu }_{i}){t}_{i}({\mu }_{1},\ldots, {\mu }_{d})& \\ & +\dfrac{1} {2}{\sum \nolimits }_{1\leq i,j\leq d}({x}_{i} - {\mu }_{i})({x}_{j} - {\mu }_{j}){t}_{ij}({\mu }_{1},\ldots, {\mu }_{d}). & \\ \end{array}$$

If we formally take an expectation on both sides, we get the first- and second-order approximations to E[T d (X 1, X 2, , X d )]:

$$E[{T}_{d}({X}_{1},{X}_{2},\ldots, {X}_{d})] \approx {T}_{d}({\mu }_{1},\ldots, {\mu }_{d}),$$

and

$$E[{T}_{d}({X}_{1},{X}_{2},\ldots, {X}_{d})] \approx {T}_{d}({\mu }_{1},\ldots, {\mu }_{d}) + \frac{1} {2}{\sum \nolimits }_{1\leq i,j\leq d}{t}_{ij}({\mu }_{1},\ldots, {\mu }_{d}){\sigma }_{ij},$$

where σ ij is the covariance between X i and X j .

Consider now the variance approximation problem. From the first-order Taylor approximation

$$t({x}_{1},{x}_{2},\ldots, {x}_{d}) \approx t({\mu }_{1},\ldots, {\mu }_{d}) +{ \sum \nolimits }_{i=1}^{d}({x}_{ i} - {\mu }_{i}){t}_{i}({\mu }_{1},\ldots, {\mu }_{d}),$$

by formally taking the variance of both sides, we get the first-order variance approximation

$$\begin{array}{rcl} \mbox{ Var}({T}_{d}({X}_{1},{X}_{2},\ldots, {X}_{d}))& \approx & \mbox{ Var}\left ({\sum \nolimits }_{i=1}^{d}({x}_{ i} - {\mu }_{i}){t}_{i}({\mu }_{1},\ldots, {\mu }_{d})\right ) \\ & =& {\sum \nolimits }_{1\leq i,j\leq d}\,{t}_{i}({\mu }_{1},\ldots, {\mu }_{d}){t}_{j}({\mu }_{1},\ldots, {\mu }_{d}){\sigma }_{ij}\end{array}.$$

The second-order variance approximation takes more work. By using the second-order Taylor approximation for t(x 1, x 2, , x d ), the second-order variance approximation is

$$\begin{array}{rcl} \mbox{ Var}({T}_{d}({X}_{1},{X}_{2},\ldots, {X}_{d}))& \approx & \mbox{ Var}\left ({\sum \nolimits}_{i}({X}_{i} - {\mu }_{i}){t}_{i}({\mu }_{1},\ldots, {\mu }_{d})\right ) \\ & & +\ \frac{1} {4}\mbox{ Var}\left({\sum \nolimits }_{i,j}({X}_{i} - {\mu }_{i})({X}_{j} - {\mu }_{j}){t}_{ij}({\mu }_{1},\ldots, {\mu }_{d})\right) \\ & & +\ \mbox{ Cov}\Biggl(\sum \nolimits ({X}_{i} - {\mu }_{i}){t}_{i}({\mu }_{1},\ldots, {\mu }_{d}), \\ & & \quad \qquad \;\;\,{\sum \nolimits }_{j,k}({X}_{j} - {\mu }_{j})({X}_{k} - {\mu }_{k}){t}_{jk}({\mu }_{1},\ldots, {\mu }_{d})\Biggr)\end{array}$$

If we denote E(X i  − μ i )(X j  − μ j )(X k  − μ k ) = m 3, ijk , and E(X i  − μ i )(X j  − μ j )(X k  − μ k )(X l  − μ l ) = m 4, ijkl , then the second-order variance approximation becomes

$$\begin{array}{rcl} \mbox{ Var}({T}_{d}({X}_{1},{X}_{2},\ldots, {X}_{d}))& \approx & {\sum \nolimits }_{i,j}{t}_{i}({\mu }_{1},\ldots, {\mu }_{d}){t}_{j}({\mu }_{1},\ldots, {\mu }_{d}){\sigma }_{ij} \\ & & +\ {\sum \nolimits }_{i,j,k}{t}_{i}({\mu }_{1},\ldots, {\mu }_{d}){t}_{jk}({\mu }_{1},\ldots, {\mu }_{d}){m}_{3,ijk} \\ & & +\ \frac{1} {4}{\sum \nolimits }_{i,j,k,l}{t}_{ij}({\mu }_{1},\ldots, {\mu }_{d}){t}_{kl}({\mu }_{1},\ldots, {\mu }_{d}) \\ & & \;[{m}_{4,ijkl} - {\sigma }_{ij}{\sigma }_{kl}]\end{array}.$$

For general d, this is a complicated expression. For d = 1, it reduces to the reasonably simple approximation

$$\mbox{ Var}(T(X)) \approx {[t^{\prime}(\mu )]}^{2}{\sigma }^{2} + t^{\prime}(\mu )t^{{\prime}{\prime}}(\mu )E{(X - \mu )}^{3} + \frac{1} {4}{[t^{{\prime}{\prime}}(\mu )]}^{2}[E{(X - \mu )}^{4} - {\sigma }^{4}].$$

Example 7.32.

Let X, Y be two jointly distributed random variables, with means μ1, μ2, variances σ1 2, σ2 2, and covariance σ12. We work out the second-order approximation to the expectation of T(X, Y ) = XY. Writing t for T as above, the various relevant partial derivatives are t x  = y, t y  = x, t xx  = t yy  = 0, t xy  = 1. Plugging into the general formula for the second-order approximation to the mean, we get \(E(XY ) \approx {\mu }_{1}{\mu }_{2} + \frac{1} {2}[{\sigma }_{12} + {\sigma }_{21}] = {\mu }_{1}{\mu }_{2} + {\sigma }_{12}\). Thus, in this case, the second-order approximation reproduces the exact mean of XY.

Example 7.33 (A Multidimensional Example). 

Let X = (X 1, X 2, , X d ) have mean vector μ and covariance matrix Σ. Assume that μ is not the null vector. We find a second-order approximation to E( | | X | | ). Denoting T(x 1, , x d ) =  | | x | | , the successive partial derivatives are

$${t}_{i}(\mu ) = \frac{{\mu }_{i}} {\vert \vert \mu \vert \vert },\quad {t}_{ii}(\mu ) = \frac{1} {\vert \vert \mu \vert \vert }- \frac{{\mu }_{i}^{2}} {\vert \vert \mu \vert {\vert }^{3}},\quad {t}_{ij}(\mu ) = -\frac{{\mu }_{i}{\mu }_{j}} {\vert \vert \mu \vert {\vert }^{3}}(i\neq j).$$

Plugging these into the general formula for the second-order approximation of the expectation, on some algebra, we get the approximation

$$\begin{array}{rcl} E(\vert \vert \mathbf{x}\vert \vert )& \approx & \vert \vert \mu \vert \vert + \frac{\mathit{tr}\Sigma } {2\vert \vert \mu \vert \vert }-\frac{{\sum \nolimits }_{i}{\mu }_{i}^{2}{\sigma }_{ii}} {2\vert \vert \mu \vert {\vert }^{3}} -\frac{{\sum \nolimits }_{i\neq j}{\mu }_{i}{\mu }_{j}{\sigma }_{ij}} {2\vert \vert \mu \vert {\vert }^{3}} \\ & =& \vert \vert \mu \vert \vert + \frac{1} {2\vert \vert \mu \vert \vert }\left [\mathit{tr}\Sigma -\frac{\mu ^{\prime}\Sigma \mu } {\mu ^{\prime}\mu } \right ]\end{array}.$$

The ratio \(\frac{\mu ^{\prime}\Sigma \mu } {\mu ^{\prime}\mu }\) varies between the minimum and the maximum eigenvalue of Σ, where as trΣ equals the sum of all the eigenvalues. Thus, \(\mathit{tr}\Sigma -\frac{\mu ^{\prime}\Sigma \mu } {\mu ^{\prime}\mu } \,\geq \,0,\Sigma \) being a nnd matrix, which implies that the approximation \(\vert \vert \mu \vert \vert + \frac{1} {2\vert \vert \mu \vert \vert }[\mathit{tr}\Sigma -\frac{\mu ^{\prime}\Sigma \mu } {\mu ^{\prime}\mu } ]\) is ≥ | | μ | | . This is consistent with the bound E( | | X | | ) ≥ | | μ | | , as is implied by Jensen’s inequality.

The second-order variance approximation is difficult to work out in this example. However, the first-order approximation is easily worked out, and gives

$$\mbox{ Var}(\vert \vert \mathbf{X}\vert \vert ) \approx \frac{\mu ^{\prime}\Sigma \mu } {\mu ^{\prime}\mu }.$$

Note that no distributional assumption about X was made in deriving the approximations.

Example 7.34 (Variance of the Sample Variance). 

It is sometimes necessary to calculate the variance of a centered sample moment \({m}_{k} = \frac{1} {n}{ \sum \nolimits }_{i=1}^{n}{({X}_{i} -\bar{ X})}^{k}\), for iid observations X 1, , X n from some one-dimensional distribution. Particularly, the case k = 2 is of broad interest in statistics. Because we are considering centered moments, we may assume that E(X i ) = 0, so that E(X i k) for any k will equal the population centered moment μ k  = E(X i  − μ)k. We also recall that \(E({m}_{2}) = \frac{n-1} {n} {\sigma }^{2}\).

Using the algebraic identity \({\sum \nolimits }_{i=1}^{n}{({X}_{i} -\bar{ X})}^{2} ={ \sum \nolimits }_{i=1}^{n}{X}_{i}^{2} - {({\sum \nolimits }_{i=1}^{n}{X}_{i})}^{2}/n\), one can make substantial algebraic simplification towards calculating the variance of m 2. Indeed,

$$\begin{array}{rcl} \mbox{ Var}({m}_{2})& =& E[{m}_{2}^{2}] - {[E({m}_{ 2})]}^{2} \\ & =& \frac{1} {{n}^{2}}E\left [{\sum \nolimits }_{i=1}^{n}{X}_{ i}^{4} +{ \sum \nolimits }_{i\neq j}{X}_{i}^{2}{X}_{ j}^{2} + \frac{8} {{n}^{2}}{ \sum \nolimits }_{i\neq j}{X}_{i}^{2}{X}_{ j}^{2} + \frac{4} {{n}^{2}}{ \sum \nolimits }_{i\neq j\neq k}{X}_{i}^{2}{X}_{ j}{X}_{k}\right. \\ & & \qquad \quad \left.-\frac{4} {n}{\sum \nolimits }_{i\neq j}{X}_{i}^{3}{X}_{ j}\right ] - {[E({m}_{2})]}^{2}\end{array}.$$

The expectation of each term above can be found by using the independence of the X i and the zero mean assumption, and interestingly, in fact the variance of m 2 can be thus found exactly for any n, namely,

$$\mbox{ Var}({m}_{2}) = \frac{1} {{n}^{3}}[{(n - 1)}^{2}({\mu }_{ 4} - {\sigma }^{4}) + 2(n - 1){\sigma }^{4}].$$

The approximate methods would have produced the answer

$$\mbox{ Var}({m}_{2}) \approx \frac{{\mu }_{4} - {\sigma }^{4}} {n}.$$

It is useful to know that the approximate methods would likewise produce the general first-order variance approximation

$$\mbox{ Var}({m}_{r}) \approx \frac{{\mu }_{2r} + {r}^{2}{\sigma }^{2}{\mu }_{r-1}^{2} - 2r{\mu }_{r-1}{\mu }_{r+1} - {\mu }_{r}^{2}} {n}.$$

The formal approximations described above may work well in some cases, however, it is useful to have some theoretical quantification of the accuracy of the approximations. This is difficult in general, and we give one result in a special case, with d = 1.

Theorem 7.26.

Suppose X 1 ,X 2 ,… are iid observations with a finite fourth moment. Let E(X 1 ) = μ, and Var (X 1 ) = σ 2 . Let g be a scalar function with four uniformly bounded derivatives. Then

  1. (a)

    \(E[g(\overline{X})] = g(\mu ) + \frac{{g}^{(2)}(\mu ){\sigma }^{2}} {2n} + O({n}^{-2});\)

  2. (b)

    \(\mbox{ Var}[g(\overline{X})] = \frac{{(g^{\prime}(\mu ))}^{2}{\sigma }^{2}} {n} + O({n}^{-2}).\)

See Bickel and Doksum (2007) for a proof of this theorem.

7.7 Convergence of Densities and Scheffé’s Theorem

Suppose X n , X, n ≥ 1 are continuous random variables with densities f n , f, and that X n converges in distribution to X. It is natural to ask whether that implies that f n converges to f pointwise. Simple counterexamples show that this need not be true. We show an example below. However, convergence of densities, when true, is very useful. It ensures a mode of convergence much stronger than convergence in distribution. We discuss convergence of densities in general, and for sample means of iid random variables in particular, in this section.

First, we give an example to show that convergence in distribution does not imply convergence of densities.

Example 7.35 (Convergence in Distribution Is Weaker Than Convergence of Density). 

Suppose X n is a sequence of random variables on [0, 1] with density f n (x) = 1 + cos(2πnx). Then, \({X}_{n}{ \mathcal{L} \atop \Rightarrow } U[0,1]\) by a direct verification of the definition using CDFs. Indeed, \({F}_{n}(x) = x + \frac{\sin (2n\pi x)} {2n\pi } \rightarrow x\,\forall x \in (0,1)\). However, note that the densitities f n do not converge to the uniform density 1 as n → .

Convergence of densities is useful to have when true, because it ensures a much stronger form of convergence than convergence in distribution. Suppose X n have CDF F n and density f n , and X has CDF F and density f. If \({X}_{n}{ \mathcal{L} \atop \Rightarrow } X\), then we can only assert that F n (x) = P(X n  ≤ x) → F(x) = P(X ≤ x)  ∀x. However, if we have convergence of the densities, then we can make the much stronger assertion that for any event A, P(X n  ∈ A) → P(X ∈ A), not just for events A of the form A = ( − , x]. This is explained below.

Definition 7.13.

Let X, Y be two random variables defined on a common sample space Ω. The total variation distance between the distributions of X and Y is defined as d TV (X, Y ) = sup A  | P(X ∈ A) − P(Y ∈ A) | .

Remark.

Again, actually the set A is not completely arbitrary. We do need the restriction that A be a Borel set, a concept in measure theory. However, we make no further mention of this qualification.

The relation between total variation distance and densities when the random variables X, Y are continuous is described by the following result.

Lemma.

Let X,Y be continuous random variables with densities f,g. Then \({d}_{TV }(X,Y ) = \frac{1} {2}{ \int \nolimits \nolimits }_{-\infty }^{\infty }\vert f(x) - g(x)\vert dx\).

Proof.

The proof is based on two facts:

$${\int \nolimits \nolimits }_{-\infty }^{\infty }\vert f(x) - g(x)\vert dx = 2{\int \nolimits \nolimits }_{-\infty }^{\infty }{(f - g)}^{+}dx,$$

and, for any set A,

$$\vert P(X \in A) - P(Y \in A)\vert \leq {\int \nolimits \nolimits }_{x:f(x)>g(x)}(f(x) - g(x))dx.$$

Putting these two together,

$${d}_{TV }(X,Y ) ={ \mathrm{sup}}_{A}\vert P(X \in A) - P(Y \in A)\vert \leq \frac{1} {2}{\int \nolimits \nolimits }_{-\infty }^{\infty }\vert f(x) - g(x)\vert dx.$$

However, for the particular set A 0 = { x : f(x) > g(x)}, \(\vert P(X \in {A}_{0}) - P(Y \in {A}_{0})\vert = \frac{1} {2}{ \int \nolimits \nolimits }_{-\infty }^{\infty }\vert f(x) - g(x)\vert dx\), and so, that proves that sup A  | P(X ∈ A) − P(Y ∈ A) | exactly equals \(\frac{1} {2}{ \int \nolimits \nolimits }_{-\infty }^{\infty }\vert f(x) - g(x)\vert dx\). □ 

Example 7.36 (Total Variation Distance Between Two Normals). 

Total variation distance is usually hard to find in closed analytical form. The absolute value sign makes closed-form calculations difficult. It is, however, possible to write a closed-form formula for the total variation distance between two arbitrary normal distributions in one dimension. No such formula would be possible in higher dimensions.

Let X ∼ N1, σ1 2), Y ∼ N2, σ2 2). We use the result that \({d}_{TV }(X,Y ) = \frac{1} {2}{ \int \nolimits \nolimits }_{-\infty }^{\infty }\vert f(x) - g(x)\vert dx\), where f, g are the densities of X, Y. To evaluate the integral of | f(x) − g(x) | , we need to find the set of all values of x for which f(x) ≥ g(x). We assume that σ1 > σ2, and use the notation

$$\begin{array}{rcl} c& =& \frac{{\sigma }_{1}} {{\sigma }_{2}},\hspace{118.79993pt} \Delta = \frac{{\mu }_{1} - {\mu }_{2}} {{\sigma }_{2}}, \\ A& =& \frac{\sqrt{({c}^{2 } - 1)2\log c + {\Delta }^{2}} - c\Delta } {{c}^{2} - 1}, \quad B = -\frac{\sqrt{({c}^{2 } - 1)2\log c + {\Delta }^{2}} + c\Delta } {{c}^{2} - 1} \end{array}.$$

The case σ1 = σ2 is commented on below.

With this notation, by making a change of variable,

$${\int \nolimits \nolimits }_{-\infty }^{\infty }\vert f(x) - g(x)\vert dx ={ \int \nolimits \nolimits }_{-\infty }^{\infty }\vert \phi (z) - c\phi (\Delta + cz)\vert dz,$$

and ϕ(z) ≤ cϕ(Δ + cz) if and only if A ≤ z ≤ B. Therefore,

$$\begin{array}{rcl} {\int \nolimits \nolimits }_{-\infty }^{\infty }\vert f(x) - g(x)\vert dx& =& {\int \nolimits \nolimits }_{A}^{B}[c\phi (\Delta + cz) - \phi (z)]dz \\ & & +{\int \nolimits \nolimits }_{-\infty }^{A}[\phi (z) - c\phi (\Delta + cz)]dz \\ & & +{\int \nolimits \nolimits }_{B}^{\infty }[\phi (z) - c\phi (\Delta + cz)]dz \\ & =& 2[(\Phi (\Delta + cB) - \Phi (B)) - (\Phi (\Delta + cA) - \Phi (A))], \\ \end{array}$$

where the quantities Δ + cB, Δ + cA work out to

$$\begin{array}{rcl} \Delta + cB& =& \frac{c\sqrt{({c}^{2 } - 1)2\log c + {\Delta }^{2}} - \Delta } {{c}^{2} - 1}, \\ \Delta + cA& =& -\frac{c\sqrt{({c}^{2 } - 1)2\log c + {\Delta }^{2}} + \Delta } {{c}^{2} - 1} \end{array}.$$

When σ1 = σ2, the expression reduces to \(\Phi (\frac{\vert \Delta \vert } {2} ) - \Phi (-\frac{\vert \Delta \vert } {2} )\). In applying the formula we have derived, it is important to remember that the larger of the two variances has been called σ1. Finally, now,

$${d}_{TV }(X,Y ) = (\Phi (\Delta + cB) - \Phi (B)) - (\Phi (\Delta + cA) - \Phi (A)),$$

with Δ + cA, Δ + cB, A, B as given above explicitly.

We see from the formula that d TV (X, Y ) depends on both individual variances, and on the difference between the means. When the means are equal, the total variation distance reduces to the simpler expression

$$\begin{array}{rcl} & & \;\;\;\;2\left [\Phi \left ( \frac{c\sqrt{2\log c}} {\sqrt{{c}^{2 } - 1}}\right ) - \Phi \left ( \frac{\sqrt{2\log c}} {\sqrt{{c}^{2 } - 1}}\right )\right ] \\ & & \approx \frac{1} {2\sqrt{2\pi e}}(3 - c)(c - 1), \\ \end{array}$$

for c ≈ 1. 

The next fundamental result asserts that if X n , X are continuous random variables with densities f n , f, and if f n (x) → f(x)  ∀x, then d TV (X n , X) → 0. This means that pointwise convergence of densities, when true, ensures an extremely strong mode of convergence, namely convergence in total variation.

Theorem 7.27 (Scheffé’s Theorem). 

Let f n ,f be nonnegative integrable functions. Suppose:

$${f}_{n}(x) \rightarrow f(x)\,\forall x;\quad {\int \nolimits \nolimits }_{-\infty }^{\infty }{f}_{ n}(x)dx \rightarrow {\int \nolimits \nolimits }_{-\infty }^{\infty }f(x)dx.$$

Then ∫ \nolimits \nolimits −∞ |f n (x) − f(x)|dx → 0.

In particular, if f n ,f are all density functions, and if f n (x) → f(x) ∀x, then ∫ \nolimits \nolimits −∞ |f n (x) − f(x)|dx → 0.

Proof.

The proof is based on the pointwise algebraic identity

$$\vert {f}_{n}(x) - f(x)\vert = {f}_{n}(x) + f(x) - 2\min \{{f}_{n}(x),f(x)\}.$$

Now note that min{f n (x), f(x)} → f(x)  ∀x, as f n (x) → f(x)  ∀x, and min{f n (x), f(x)} ≤ f(x). Therefore, by the dominated convergence theorem (see the previous section), ∫ −  min{f n (x), f(x)}dx →  ∫ −  f(x)dx. The pointwise algebraic identity now gives that

$${\int \nolimits \nolimits }_{-\infty }^{\infty }\vert {f}_{ n}(x)-f(x)\vert dx \rightarrow {\int \nolimits \nolimits }_{-\infty }^{\infty }f(x)dx+{\int \nolimits \nolimits }_{-\infty }^{\infty }f(x)dx-2{\int \nolimits \nolimits }_{-\infty }^{\infty }f(x)dx = 0,$$

which completes the proof. □ 

Remark.

As we remarked before, convergence in total variation is very strong, and should not be expected, without some additional structure. The following theorems exemplify the kind of structure that may be necessary. The first theorem below is a general theorem: no assumptions are made on the structural form of the statistic. In the second theorem below, convergence in total variation is considered for sample means of iid random variables: there is a restriction on the structural form of the underlying statistic.

Theorem 7.28 (Ibragimov). 

Suppose X n , X are continuous random variables with densities f n ,f that are unimodal. Then \({X}_{n}{ \mathcal{L} \atop \Rightarrow } X\) if and only if ∫ \nolimits \nolimits −∞ |f n (x) − f(x)|dx → 0.

See Reiss Aui (1989) for this theorem. The next result for sample means of iid random variables was already given in Chapter 1; we restate it for completeness.

Theorem 7.29 (Gnedenko). 

Let X i ,i ≥ 1 be iid continuous random variables with density f(x), mean μ, and finite variance σ 2 . Let \({Z}_{n} = \frac{\sqrt{n}(\bar{X}-\mu )} {\sigma }\) , and let f n denote the density of Z n . If f is uniformly bounded, then f n converges uniformly to the standard normal density ϕ(x) on (−∞,∞), and ∫ \nolimits \nolimits −∞ |f n (x) − ϕ(x)|dx → 0.

Remark.

This is an easily stated result covering many examples. But better results are available. Feller (1971) is an excellent reference for some of the better results, which, however, involve more complex concepts.

Example 7.37.

Let X n  ∼ N n , σ n 2), n ≥ 1. For X n to converge in distribution, each of μ n , σ n 2 must converge. This is because, if X n does converge in distribution, then there is a CDF F(x) such that \(\Phi (\frac{x-{\mu }_{n}} {{\sigma }_{n}} ) \rightarrow F(x)\) at all continuity points of F. This implies, by selecting two suitable continuity points of F, that each of μ n , σ n must converge. If σ n converges to zero, then X n will converge to a one-point distribution. Otherwise, μ n  → μ, σ n  → σ, for some μ, σ, −  < μ < , 0 < σ < . It follows that \(P({X}_{n} \leq x) \rightarrow \Phi (\frac{x-\mu } {\sigma } )\) for any fixed x, and so, X n converges in distribution to another normal, namely to X ∼ N(μ, σ2). Now, either by direct verification, or from Ibragimov’s theorem, we have that X n also converges to X in total variation. The converse is also true. That is, if X n  ∼ N n , σ n 2), n ≥ 1, then X n can either converge to a one-point distribution, or to another normal distribution, say N(μ, σ2), in which case μ n  → μ, σ n  → σ, and convergence in total variation also holds. Conversely, if μ n  → μ, σ n  → σ > 0, then X n converges in total variation to X ∼ N(μ, σ2).

7.8 Exercises

Exercise 7.1.

  1. (a)

    Show that \({X}_{n}{ 2 \atop \rightarrow } c\) (i.e., X n converges in quadratic mean to c) if and only if E(X n  − c) and Var(X n ) both converge to zero.

  2. (b)

    Show by an example (different from text) that convergence in probability does not necessarily imply almost sure convergence.

Exercise 7.2.

  1. (a)

    Suppose E | X n  − c | α → 0, where 0 < α < 1. Does X n necessarily converge in probability to c?

  2. (b)

    Suppose \({a}_{n}({X}_{n} - \theta ){ \mathcal{L} \atop \Rightarrow } N(0,1)\). Under what condition on a n can we conclude that \({X}_{n}{ \mathcal{P} \atop \Rightarrow } \theta \)?

  3. (c)

    o p (1) + O p (1) = ?

  4. (d)

    o p (1)O p (1) = ? 

  5. (e)

    o p (1) + o p (1)O p (1) = ? 

  6. (f)

    Suppose \({X}_{n}{ \mathcal{L} \atop \Rightarrow } X.\) Then, o p (1)X n  = ?

Exercise 7.3 (Monte Carlo). 

Consider the purely mathematical problem of finding a definite integral f(x)dx for some (possibly complicated) function f(x). Show that the SLLN provides a method for approximately finding the value of the integral by using appropriate averages \(\frac{1} {n}{ \sum \nolimits }_{i=1}^{n}f({X}_{i})\).

Numerical analysts call this Monte Carlo integration.

Exercise 7.4.

Suppose X 1, X 2,  are iid and that E(X 1) = μ≠0, Var(X 1) = σ2 < . Let S m, p  =  ∑ i = 1 m X i p, m ≥ 1, p = 1, 2.

  1. (a)

    Identify with proof the almost sure limit of \(\frac{{S}_{m,1}} {{S}_{n,1}}\) for fixed m, and n → .

  2. (b)

    Identify with proof the almost sure limit of \(\frac{{S}_{n-m,1}} {{S}_{n,1}}\) for fixed m, and n → .

  3. (c)

    Identify with proof the almost sure limit of \(\frac{{S}_{n,1}} {{S}_{n,2}}\) as n → .

  4. (d)

    Identify with proof the almost sure limit of \(\frac{{S}_{n,1}} {{S}_{{n}^{2},2}}\) as n → .

Exercise 7.5.

Let A n , n ≥ 1, A be events with respect to a common sample space Ω.

  1. (a)

    Prove that \({I}_{{A}_{n}}{ \mathcal{L} \atop \Rightarrow } {I}_{A}\) if and only if P(A n ) → P(A).

  2. (b)

    Prove that \({I}_{{A}_{n}}{ 2 \atop \Rightarrow } {I}_{A}\) if and only if P(AΔA n ) → 0.

Exercise 7.6.

Suppose \(g : {\mathcal{R}}_{+} \rightarrow \mathcal{R}\) is continuous and bounded. Show that

$${e}^{-n\lambda }{ \sum \nolimits }_{k=0}^{\infty }g(\frac{k} {n})\frac{{(n\lambda )}^{k}} {k!} \rightarrow g(\lambda )$$

as n → .

Exercise 7.7 * (Convergence of Medians). Suppose X n is a sequence of random variables converging in probability to a random variable X; X is absolutely continuous with a strictly positive density. Show that the medians of X n converge to the median of X.

Exercise 7.8. Suppose {A n } is an infinite sequence of independent events. Show that P(infinitely many A n occur) = 1 ⇔ P( ⋃A n ) = 1.

Exercise 7.9 * (Almost Sure Limit of Mean Absolute Deviation). Suppose X i , i ≥ 1 are iid random variables from a distribution F with E F ( | X | ) < .

  1. (a)

    Prove that the mean absolute deviation \(\frac{1} {n}{ \sum \nolimits }_{i=1}^{n}\vert {X}_{i} -\bar{ X}\vert \) has a finite almost sure limit.

  2. (b)

    Evaluate this limit explicitly when F is standard normal.

Exercise 7.10. * Let X n be any sequence of random variables. Prove that one can always find a sequence of numbers c n such that \(\frac{{X}_{n}} {{c}_{n}}{ \mathrm{a.s.} \atop \Rightarrow } 0\).

Exercise 7.11 (Sample Maximum). 

Let X i , i ≥ 1 be an iid sequence, and X (n) the maximum of X 1, , X n . Let ξ(F) = sup{x : F(x) < 1}, where F is the common CDF of the X i . Prove that \({X}_{(n)}{ \mathrm{a.s.} \atop \Rightarrow } \xi (F)\).

Exercise 7.12.

Suppose {A n } is an infinite sequence of events. Suppose that P(A n ) ≥ δ  ∀n. Show that P(infinitely many A n occur) ≥ δ.

Exercise 7.13.

Let X i be independent N(μ, σ i 2) variables.

  1. (a)

    Find the BLUE (best linear unbiased estimate) of μ.

  2. (b)

    Suppose ∑ i = 1 σ i  − 2 = . Prove that the BLUE converges almost surely to μ.

Exercise 7.14.

Suppose X i are iid standard Cauchy. Show that

  1. (a)

    P( | X n  |  > n infinitely often) = 1,

  2. (b)

    *P( | S n  |  > n infinitely often) = 1.

Exercise 7.15.

Suppose X i are iid standard exponential. Show that \({limsup}_{n}\frac{{X}_{n}} {\log n} = 1\) with probability 1.

Exercise 7.16 * (Coupon Collection). Cereal boxes contain independently and with equal probability exactly one of n different celebrity pictures. Someone having the entire set of n pictures can cash them in for money. Let W n be the minimum number of cereal boxes one would need to purchase to own a complete set of the pictures. Find a sequence a n such that \(\frac{{W}_{n}} {{a}_{n}}{ P \atop \Rightarrow } 1\).

Hint : Approximate the mean of W n .

Exercise 7.17. Let X n  ∼ Bin(n, p). Show that (X n  ∕ n)2 and X n (X n  − 1) ∕ n(n − 1) both converge in probability to p 2. Do they also converge almost surely?

Exercise 7.18. Suppose X 1, , X n are iid standard exponential variables, and let S n  = X 1 +  + X n . Apply the Chernoff–Bernstein inequality (see Chapter 1) to show that for c > 1,

$$P({S}_{n} > cn) \leq {e}^{-n(c-1-\ln c)}$$

and hence that P(S n  > cn) → 0 exponentially fast.

Exercise 7.19. Let X 1, X 2,  be iid nonnegative random variables. Show that \(\frac{{X}_{(n)}} {n}{ \mathcal{P} \atop \Rightarrow } 0\) if and only if nP(X 1 > n) → 0.

Is this true in the normal case?

Exercise 7.20 (Failure of Weak Law). 

Let X 1, X 2,  be a sequence of independent variables, with P(X i  = i) = P(X i  =  − i) = 1 ∕ 2. Show that \(\bar{X}\) does not converge in probability to the common mean μ = 0.

Exercise 7.21.

Let X 1, X 2, X 3,  be iid U[0, 1]. Let

$${G}_{n} = {({X}_{1}{X}_{2}\ldots {X}_{n})}^{1/n}.$$

Find c such that \({G}_{n}{ \mathcal{P} \atop \Rightarrow } c\).

Exercise 7.22 * (Uniform Integrability of Sample Mean). Suppose X i , i ≥ 1 are iid from some CDF F with mean zero and variance one. Find a sufficient condition on F for \(E{(\sqrt{n}\bar{X})}^{k}\) to exist and converge to E(Z k), where k is fixed, and Z ∼ N(0, 1).

Exercise 7.23 * (Sufficient Condition for Uniform Integrability). Let {X n }, n ≥ 1 be a sequence of random variables, and suppose for some function \(f : {\mathcal{R}}_{+} \rightarrow {\mathcal{R}}_{+}\) such that f is nondecreasing and \(\frac{f(x)} {x} \rightarrow \infty \) as x → , we know that sup n E[f( | X n  | )] < . Show that {X n } is uniformly integrable.

Exercise 7.24 (Uniform Integrability of IID Sequence). 

Suppose {X n } is an iid sequence with E( | X 1 | ) < . Show that {X n } is uniformly integrable.

Exercise 7.25.

Give an example of a sequence {X n }, and an X such that \({X}_{n}{ \mathcal{L} \atop \Rightarrow } X,E({X}_{n}) \rightarrow E(X)\), but E( | X n  | ) does not converge to E( | X | ).

Exercise 7.26.

Suppose X n has a normal distribution with mean μ n and variance σ n 2. Let μ n  → μ and σ n  → σ as n → . What is the limiting distribution of X n ?

Exercise 7.27 (Delta Theorem). 

Suppose X 1, X 2,  are iid with mean μ and variance σ2, a finite fourth moment, and let Z ∼ N(0, 1).

  1. (a)

    Show that \(\sqrt{n}({\overline{X}}^{2} - {\mu }^{2}){ \mathcal{L} \atop \Rightarrow } 2\mu \sigma Z\).

  2. (b)

    Show that \(\sqrt{n}({e}^{\overline{X}} - {e}^{\mu }){ \mathcal{L} \atop \Rightarrow } {e}^{\mu }Z\).

  3. (c)

    Show that \(\sqrt{n}(\log (1/n{\sum \nolimits }_{i=1}^{n}{({X}_{i} -\overline{X})}^{2}) -\log {\sigma }^{2}){ \mathcal{L} \atop \Rightarrow } (1/{\sigma }^{2}){(E{X}_{ 1}^{4})}^{1/2}Z\).

Exercise 7.28 (Asymptotic Variance and True Variance). 

Let X 1, X 2,  be iid observations from a CDF F with four finite moments. For each of the following cases, find the exact variance of \({m}_{2} = \frac{1} {n}{ \sum \nolimits }_{i=1}^{n}{({X}_{i} -\bar{ X})}^{2}\) by using the formula in the text, and also find the asymptotic variance by using the formula in the text. Check when the true variance is larger than the asymptotic variance.

  1. (a)

    F = N(μ, σ2).

  2. (b)

    F = Exp(λ).

  3. (c)

    F = Poi(λ).

Exercise 7.29 (All Distributions as Limits of Discrete). 

Show that any distribution on R d is the limit in distribution of distributions on R d that are purely discrete with finitely many values.

Exercise 7.30 (Conceptual). 

Suppose \({X}_{n}{ \mathcal{L} \atop \Rightarrow } X\), and also \({Y }_{n}{ \mathcal{L} \atop \Rightarrow } X\). Does this mean that X n  − Y n converge in distribution to (the point mass at) zero?

Exercise 7.31.

  1. (a)

    Suppose a n (X n  − θ) → N(0, τ2); what can be said about the limiting distribution of | X n  | , when θ≠0, θ = 0?

  2. (b)

    * Suppose X i are iid Bernoulli(p); what can be said about the limiting distribution of the sample variance s 2 when \(p = \frac{1} {2};p\neq \frac{1} {2}\)?

Exercise 7.32 (Delta Theorem). 

Suppose X 1, X 2,  are iid Poi(λ). Find the limiting distribution of \({e}^{-\bar{X}}\).

Remark.

It is meant that on suitable centering and norming, you will get a nondegenerate limiting distribution.

Exercise 7.33 (Delta Theorem). 

Suppose X 1, X 2,  are iid N(μ, 1). Find the limiting distribution of \(\Phi (\bar{X})\), where Φ, as usual, is the standard normal CDF.

Exercise 7.34 * (Delta Theorem with Lack of Smoothness). Suppose X 1, X 2,  are iid N(μ, 1). Find the limiting distribution of \(\vert \bar{X}\vert \) when

  1. (a)

    μ≠0.

  2. (b)

    μ = 0.

Exercise 7.35 (Delta Theorem). 

For each F below, find the limiting distributions of \(\frac{\bar{X}} {s}\) and \(\frac{s} {\bar{X}}\):

(i) F = U[0, 1], (ii) F = Exp(λ), (iii) F = χ2(p).

Exercise 7.36 * (Delta Theorem). Suppose X 1, X 2,  are iid N(μ, σ2). Let

$${b}_{1} = \frac{ \frac{1} {n} \sum \nolimits {({X}_{i} -\bar{ X})}^{3}} {{\left [ \frac{1} {n} \sum \nolimits {({X}_{i} -\bar{ X})}^{2}\right ]}^{3/2}}\quad \mbox{ and}\quad {b}_{2} = \frac{ \frac{1} {n} \sum \nolimits {({X}_{i} -\bar{ X})}^{4}} {{\left [ \frac{1} {n} \sum \nolimits {({X}_{i} -\bar{ X})}^{2}\right ]}^{2}} - 3$$

be the sample skewness and kurtosis coefficients. Find the joint limiting distribution of (b 1, b 2).

Exercise 7.37 * (Slutsky). Let X n , Y m be independent Poisson with means n, m, m, n ≥ 1. Find the limiting distribution of \(\frac{{X}_{n}\,-\,{Y }_{m}\,-\,(n-m)} {\sqrt{{X}_{n } +{Y }_{m}}}\) as n, m → .

Exercise 7.38 (Approximation of Mean and Variance). 

Let X ∼ Bin(n, p). Find the first- and the second-order approximation to the mean and variance of \(\frac{X} {n-X}\).

Exercise 7.39 (Approximation of Mean and Variance). 

Let X ∼ Poi(λ). Find the first- and the second-order approximation to the mean and variance of e  − X. Compare to the exact mean and variance by consideration of the mgf of X.

Exercise 7.40 (Approximation of Mean and Variance). 

Let X 1, , X n be iid N(μ, σ2). Find the first- and the second-order approximation to the mean and variance of \(\Phi (\bar{X})\).

Exercise 7.41 * (Approximation of Mean and Variance). Let X ∼ Bin(n, p). Find the first- and the second-order approximation to the mean and variance of \(\arcsin (\sqrt{\frac{X} {n}} )\).

Exercise 7.42 * (Approximation of Mean and Variance). Let X ∼ Poi(λ). Find the first- and the second-order approximation to the mean and variance of \(\sqrt{X}\).

Exercise 7.43 * (Multidimensional Approximation of Mean). Let X be a d-dimensional random vector. Find a first- and second-order approximation to the mean of \(\sqrt{ {\sum \nolimits }_{i=1}^{d}{X}_{i}^{4}}\).

Exercise 7.44 * (Expected Length of Poisson Confidence Interval). In Chapter 1, the approximate 95% confidence interval \(X + 1.92 \pm \sqrt{3.69 + 3.84X}\) for a Poisson mean λ was derived. Find a first- and second-order approximation to the expected length of this confidence interval.

Exercise 7.45 * (Expected Length of the t Confidence Interval). The modified t confidence interval for a population mean μ has the limits \(\bar{X} \pm {z}_{\alpha /2} \frac{s} {\sqrt{n}}\), where \(\bar{X}\) and s are the mean and the standard deviation of an iid sample of size n, and \({z}_{\alpha /2} = {\Phi }^{-1}(\frac{\alpha } {2} )\). Find a first- and a second-order approximation to the expected length of the modified t confidence interval when the population distribution is

  1. (a)

    N(μ, σ2).

  2. (b)

    Exp(μ).

  3. (c)

    U[μ − 1, μ + 1].

Exercise 7.46 * (Coefficient of Variation). Given a set of positive iid random variables X 1, X 2, , X n , the coefficient of variation (CV) is defined as \(\mbox{ CV}\, = \frac{s} {\bar{X}}\). Find a second-order approximation to its mean, and a first-order approximation to its variance, in terms of suitable moments of the distribution of the X i . Make a note of how many finite moments you need for each approximation to make sense.

Exercise 7.47 * (Variance-Stabilizing Transformation). Let X i , i ≥ 1 be iid Poi(λ).

  1. (a)

    Show that for each \(a,b,\sqrt{\frac{{\sum \nolimits }_{i=1}^{n}{X}_{i}+a} {n+b}}\) is a variance stabilizing transformation.

  2. (b)

    Find the first- and the second-order approximation to the mean of

    \(\sqrt{\frac{{\sum \nolimits }_{i=1}^{n}{X}_{i}+a} {n+b}}\).

  3. (c)

    Are there some particular choices of a, b that make the approximation

    $$E\left (\sqrt{\frac{{\sum \nolimits }_{i=1}^{n}{X}_{i} + a} {n + b}} \,\right ) \approx \sqrt{\lambda }$$

    more accurate? Justify your answer.

Exercise 7.48 * (Variance-Stabilizing Transformation). Let X i , i ≥ 1 be iid Ber(p).

  1. (a)

    Show that for each \(a,b,\arcsin (\sqrt{\frac{{\sum \nolimits }_{i=1}^{n}{X}_{i}+a} {n+b}} )\) is a variance stabilizing transformation.

  2. (b)

    Find the first- and the second-order approximation to the mean of

    \(\arcsin (\sqrt{\frac{{\sum \nolimits }_{i=1}^{n}{X}_{i}+a} {n+b}} )\).

  3. (c)

    Are there some particular choices of a, b that make the approximation

    $$E\left [\arcsin \left (\sqrt{\frac{{\sum \nolimits }_{i=1}^{n}{X}_{i} + a} {n + b}} \,\right )\right ] \approx \arcsin \left (\sqrt{p}\,\right )$$

    more accurate ? Justify your answer.

Exercise 7.49. For each of the following cases, evaluate the total variation distance between the indicated distributions:

  1. (a)

    N(0, 1) and C(0, 1).

  2. (b)

    N(0, 1) and N(0, 104).

  3. (c)

    C(0, 1) and C(0, 104).

Exercise 7.50 (Plotting the Variation Distance). 

Calculate and plot (as a function of μ) d TV (X, Y ) if X ∼ N(0, 1), Y ∼ N(μ, 1).

Exercise 7.51 (Convergence of Densities). 

Let Z ∼ N(0, 1) and Y independent of Z. Let \({X}_{n} = Z + \frac{Y } {n}, n \geq 1\).

  1. (a)

    Prove by direct calculation that the density of X n converges pointwise to the standard normal density in each of the following cases.

    1. (i)

      Y ∼ N(0, 1).

    2. (ii)

      Y ∼ U[0, 1].

    3. (iii)

      Y ∼ Exp(1).

  2. (b)

    Hence, or by using Ibragimov’s theorem prove that X n  → Z in total variation.

Exercise 7.52.

Show that d TV (X, Y ) ≤ P(XY ). 

Exercise 7.53.

Suppose X 1, X 2,  are iid Exp(1). Does \(\sqrt{n}(\bar{X} - 1)\) converge to standard normal in total variation? Prove or disprove.

Exercise 7.54 * (Minimization of Variation Distance). Let X ∼ U[ − a, a] and Y ∼ N(0, 1). Find a that minimizes d TV (X, Y ).

Exercise 7.55. Let X, Y be integer-valued random variables. Show that \({d}_{TV }(X,Y ) = \frac{1} {2}{ \sum \nolimits }_{k}\vert P(X = k) - P(Y = k)\vert \).