A distributed one-step estimator

Huang, Cheng; Huo, Xiaoming

doi:10.1007/s10107-019-01369-0

A distributed one-step estimator

Full Length Paper
Series B
Published: 20 February 2019

Volume 174, pages 41–76, (2019)
Cite this article

Mathematical Programming Submit manuscript

1332 Accesses
28 Citations
Explore all metrics

Abstract

Distributed statistical inference has recently attracted enormous attention. Many existing work focuses on the averaging estimator, e.g., Zhang and Duchi (J Mach Learn Res 14:3321–3363, 2013) together with many others. We propose a one-step approach to enhance a simple-averaging based distributed estimator by utilizing a single Newton–Raphson updating. We derive the corresponding asymptotic properties of the newly proposed estimator. We find that the proposed one-step estimator enjoys the same asymptotic properties as the idealized centralized estimator. In particular, the asymptotic normality was established for the proposed estimator, while other competitors may not enjoy the same property. The proposed one-step approach merely requires one additional round of communication in relative to the averaging estimator; so the extra communication burden is insignificant. The proposed one-step approach leads to a lower upper bound of the mean squared error than other alternatives. In finite sample cases, numerical examples show that the proposed estimator outperforms the simple averaging estimator with a large margin in terms of the sample mean squared error. A potential application of the one-step approach is that one can use multiple machines to speed up large scale statistical inference with little compromise in the quality of estimators. The proposed method becomes more valuable when data can only be available at distributed machines with limited communication bandwidth.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Introduction to Machine Learning

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Article 30 August 2016

The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective

Article 07 February 2017

References

Battey, H., Fan, J., Liu, H., Lu, J., Zhu, Z.: Distributed estimation and inference with statistical guarantees (2015). arXiv preprint arXiv:1509.05457
Bickel, P.J.: One-step Huber estimates in the linear model. J. Am. Stat. Assoc. 70(350), 428–434 (1975)
Article MathSciNet MATH Google Scholar
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends® Mach Learn. 3(1), 1–122 (2011)
MATH Google Scholar
Chen, S., Donoho, D.L., Saunders, M.A.: Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20(1), 33–61 (1998)
Article MathSciNet MATH Google Scholar
Chen, X., Xie, Mg: A split-and-conquer approach for analysis of extraordinarily large data. Stat. Sin. 24, 1655–1684 (2014)
MathSciNet MATH Google Scholar
Corbett, J.C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J., Ghemawat, S., Gubarev, A., Heiser, C., Hochschild, P. et al.: Spanner: Googles globally distributed database. In: Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (2012)
Fan, J., Chen, J.: One-step local quasi-likelihood estimation. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 61(4), 927–943 (1999)
Article MathSciNet MATH Google Scholar
Jaggi, M., Smith, V., Takác, M., Terhorst, J., Krishnan, S., Hofmann, T., Jordan, M.I.: Communication-efficient distributed dual coordinate ascent. In: Advances in Neural Information Processing Systems, pp. 3068–3076 (2014)
Jordan, M.I., Lee, J.D., Yang, Y.: Communication-efficient distributed statistical inference. J. Am. Stat. Assoc. (just-accepted) (2018)
Kushner, H., Yin, G.G.: Stochastic Approximation and Recursive Algorithms and Applications, vol. 35. Springer, Berlin (2003)
MATH Google Scholar
Lang, S.: Real and Functional Analysis, vol. 142. Springer, Berlin (1993)
MATH Google Scholar
Lee, J.D., Sun, Y., Liu, Q., Taylor, J.E.: Communication-efficient sparse regression: a one-shot approach (2015). arXiv preprint arXiv:1503.04337
Liu, Q., Ihler, A.T.: Distributed estimation, information loss and exponential families. In: Advances in Neural Information Processing Systems, pp. 1098–1106 (2014)
Mitra, S., Agrawal, M., Yadav, A., Carlsson, N., Eager, D., Mahanti, A.: Characterizing web-based video sharing workloads. ACM Trans. Web 5(2), 8 (2011)
Article Google Scholar
Rosenblatt, J., Nadler, B.: On the optimality of averaging in distributed statistical learning (2014). arXiv preprint arXiv:1407.2724
Rosenthal, H.P.: On the subspaces of ${L}^p$ ($p>2$) spanned by sequences of independent random variables. Isr. J. Math. 8(3), 273–303 (1970)
Article MATH Google Scholar
Shamir, O., Srebro, N., Zhang, T.: Communication-efficient distributed optimization using an approximate Newton-type method. In: Proceedings of the 31st International Conference on Machine Learning, pp. 1000–1008 (2014)
Shapiro, A., Dentcheva, D., Ruszczyński, A.: Lectures on stochastic programming: modeling and theory. SIAM (2009)
Tibshirani, R.: Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B 58(1), 267–288 (1996)
MathSciNet MATH Google Scholar
van der Vaart, A.W.: Asymptotic Statistics (Cambridge Series in Statistical and Probabilistic Mathematics). Cambridge University Press, Cambridge (2000)
Google Scholar
Zhang, Y., Duchi, J.C., Wainwright, M.J.: Communication-efficient algorithms for statistical optimization. J. Mach. Learn. Res. 14, 3321–3363 (2013)
MathSciNet MATH Google Scholar
Zinkevich, M., Weimer, M., Li, L., Smola, A.J.: Parallelized stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp. 2595–2603 (2010)
Zou, H., Li, R.: One-step sparse estimates in nonconcave penalized likelihood models. Ann. Stat. 36(4), 1509–1533 (2008)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Georgia Institute of Technology, Atlanta, USA
Cheng Huang & Xiaoming Huo

Authors

Cheng Huang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoming Huo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaoming Huo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix overview

The appendix is organized as follows. In Section A, we analyze the upper bounds of sum of i.i.d. random vectors and random matrices, which will be useful in later proofs. In Sect. 1, we derive the upper bounds of the local M-estimators and the simple averaging estimator. We present the proofs to Theorems 7 and 10 in Section C and Sect. 1, respectively. A proof of Corollary 15 will be in Section E.

A Bounds on gradient and Hessian

In order to establish the convergence of gradients and Hessians of the empirical criterion function to those of population criterion function, which is essential for the later proofs, we will present some results on the upper bound of sums of i.i.d. random vectors and random matrices. We start with stating a useful inequality on the sum of independent random variables from Rosenthal [16].

Lemma 16

(Rosenthal’s Inequality, [16], Theorem 3) For $q > 2$, there exists constant C(q) depending only on q such that if $X_1,\ldots ,X_n$ are independent random variables with ${\mathbb {E}}[X_j] = 0$ and ${\mathbb {E}}[|X_j|^q] < \infty $ for all j, then

$$\begin{aligned} \left( {\mathbb {E}} \left[ \left| \sum _{j=1}^n X_j \right| ^q\right] \right) ^{1/q} \le C(q) \max \left\{ \left( \sum _{j=1}^n {\mathbb {E}}[|X_j|^q]\right) ^{1/q},\quad \left( \sum _{j=1}^n {\mathbb {E}}[ |X_j|^2 ]\right) ^{1/2} \right\} \end{aligned}$$

Equipped with the above lemma, we can bound the moments of mean of random vectors.

Lemma 17

Let $X_1, \ldots , X_n \in {\mathbb {R}}^d$ be i.i.d. random vectors with ${\mathbb {E}}[X_i] = {\mathbf {0}}$. And there exists some constants $G>0$ and $q_0 \ge 2$ such that ${\mathbb {E}} [\Vert X_i \Vert ^{q_0}] < G^{q_0}$. Let $\overline{X }= \frac{1}{n}\sum _{i=1}^n X_i$, then for $1 \le q \le q_0$, we have

$$\begin{aligned} {\mathbb {E}}[\Vert \overline{X }\Vert ^q] \le \frac{C_v(q,d)}{n^{q/2}}G^q, \end{aligned}$$

where C(q, d) is a constant depending solely on q and d.

Proof

The main idea of this proof is to transform the sum of random vectors into the sum of random variables and then apply Lemma 16. Let $X_{i,j}$ denote the j-th component of $X_i$ and $\overline{X }_j$ denote the j-th component of $\overline{X }$.

(1)
Let us start with a simpler case in which $q=2$.
$$\begin{aligned} {\mathbb {E}} [\Vert \overline{X }\Vert ^2] = \sum _{j=1}^d \sum _{i=1}^n {\mathbb {E}}[| X_{i,j}/n |^2] = \sum _{j=1}^d {\mathbb {E}}[|X_{1,j}|^2]/n = n^{-1} {\mathbb {E}}[\Vert X_1\Vert ^2] \le n^{-1}G^2. \end{aligned}$$
The last inequality holds because ${\mathbb {E}}[\Vert X_{1}\Vert ^q] \le ({\mathbb {E}}[\Vert X_{1}\Vert ^{q_0}])^{q/q_0} \le G^q$ for $1 \le q \le q_0$ by Hölder’s inequality.
(2)
When $1 \le q < 2$, we have
$$\begin{aligned} {\mathbb {E}} [\Vert \overline{X }\Vert ^q] \le ({\mathbb {E}} [\Vert \overline{X }\Vert ^2])^{q/2} \le n^{-q/2}G^q. \end{aligned}$$
(3)
For $2 < q \le q_0$, with some simple algebra, we have
$$\begin{aligned} {\mathbb {E}} [\Vert \overline{X }\Vert ^q]&= {\mathbb {E}} \left[ \left( \sum _{j=1}^d |\overline{X }_j|^2\right) ^{q/2} \right] \le {\mathbb {E}} \left[ \left( d \max _{1 \le j \le d} |\overline{X }_j|^2\right) ^{q/2} \right] \\&= d^{q/2} {\mathbb {E}} \left[ \max _{1 \le j \le d} |\overline{X }_j|^q \right] \le d^{q/2} {\mathbb {E}} \left[ \sum _{j=1}^d |\overline{X }_j|^q \right] = d^{q/2} \sum _{j=1}^d {\mathbb {E}} \left[ |\overline{X }_j|^q \right] . \end{aligned}$$
As a continuation, we have
$$\begin{aligned}&{\mathbb {E}} [\Vert \overline{X }\Vert ^q] \le d^{q/2} \sum _{j=1}^d {\mathbb {E}}[|\overline{X }_j|^q] \\&\quad \le d^{q/2} \sum _{j=1}^d [C(q)]^q \max \left\{ \sum _{i=1}^n {\mathbb {E}}[|X_{i,j}/n|^q] , \left( \sum _{i=1}^n {\mathbb {E}}[ |X_{i,j}/n|^2 ]\right) ^{q/2} \right\} \qquad \text {(Lemma }16)\\&\quad = d^{q/2} [C(q)]^q \sum _{j=1}^d \max \left\{ \frac{{\mathbb {E}}[|X_{1,j}|^q]}{n^{q-1}} , \frac{({\mathbb {E}}[|X_{1,j}|^2])^{q/2}}{n^{q/2}} \right\} \\&\quad \le d^{q/2+1} [C(q)]^q \max \left\{ \frac{{\mathbb {E}}[\Vert X_{1}\Vert ^q]}{n^{q-1}} , \frac{({\mathbb {E}}[\Vert X_{1}\Vert ^2])^{q/2}}{n^{q/2}} \right\} \qquad \text {(since }{\mathbb {E}}[|X_{1,j}|^q] \le {\mathbb {E}}[\Vert X_{1}\Vert ^q]\text {)}\\&\quad \le d^{q/2+1} [C(q)]^q \max \left\{ \frac{G^q}{n^{q-1}} , \frac{G^q}{n^{q/2}} \right\} \qquad \text {(H}\ddot{\mathrm{o}}\text {lder's inequality)}\\&\quad = \frac{d^{q/2+1} [C(q)]^q}{n^{q/2}}G^q. \qquad (q-1>q/2 \text { when }q>2) \end{aligned}$$
To complete this proof, we just need to set $C_v(q,d)=d^{q/2+1} [C(q)]^q$. $\square $

To bound the moment of the mean of i.i.d. random matrices, let us consider another matrix norm – Frobenius norm ${\left| \left| \left| \cdot \right| \right| \right| }_F$, i.e., ${\left| \left| \left| A \right| \right| \right| }_F = \sqrt{\sum _{i,j} | a_{ij} |^2}, \forall A \in {\mathbb {R}}^{d \times d}$. Let us recall that we also use matrix norm ${\left| \left| \left| \cdot \right| \right| \right| }$, which is defined as its maximal singular value, i.e., we have ${\left| \left| \left| A \right| \right| \right| } = \sup _{u: u \in R^d, \Vert u\Vert \le 1} \Vert Au \Vert $. We can easily show that

$$\begin{aligned} {\left| \left| \left| A \right| \right| \right| } \le {\left| \left| \left| A \right| \right| \right| }_F \le \sqrt{d} {\left| \left| \left| A \right| \right| \right| }, \end{aligned}$$

With Frobenius norm, we can regard a random matrix $X \in {\mathbb {R}}^{d \times d}$ as a random vector in ${\mathbb {R}}^{d^2}$ and apply Lemma 17 to obtain the following lemma.

Lemma 18

Let $X_1,\ldots ,X_n \in {\mathbb {R}}^{d \times d}$ be i.i.d. random matrices with ${\mathbb {E}}[X_i] = {\mathbf {0}}_{d \times d}$. Let ${\left| \left| \left| X_i \right| \right| \right| }$ denote the norm of $X_i$, which is defined as its maximal singular value. Suppose ${\mathbb {E}} [{\left| \left| \left| X_i \right| \right| \right| }^{q_0}] \le H^{q_0}$, where $q_0 \ge 2$ and $H>0$. Then for $\overline{X }= \frac{1}{n}\sum _{i=1}^n X_i$ and $1 \le q \le q_0$, we have

$$\begin{aligned} {\mathbb {E}} \left[ {\left| \left| \left| \overline{X }\right| \right| \right| }^q \right] \le \frac{C_m(q,d)}{n^{q/2}}H^q, \end{aligned}$$

where $C_m(q,d)$ is a constant depending on q and d only.

Proof

By the fact ${\left| \left| \left| A \right| \right| \right| }_F \le \sqrt{d} {\left| \left| \left| A \right| \right| \right| }$, we have ${\mathbb {E}} \left[ {\left| \left| \left| X_i \right| \right| \right| }_F^{q_0} \right] \le {\mathbb {E}} \left[ {\left| \left| \left| \sqrt{d} X_i \right| \right| \right| }^{q_0} \right] \le (\sqrt{d}H)^{q_0}$. Then by the fact $ {\left| \left| \left| A \right| \right| \right| } \le {\left| \left| \left| A \right| \right| \right| }_F$ and Lemma 17, we have

$$\begin{aligned} {\mathbb {E}} \left[ {\left| \left| \left| \overline{X }\right| \right| \right| }^q \right] \le {\mathbb {E}} \left[ {\left| \left| \left| \overline{X }\right| \right| \right| }_F^q \right] \le \frac{C_v(q,d^2)}{n^{q/2}}(\sqrt{d}H)^q = \frac{C_v(q,d^2)d^\frac{q}{2}}{n^{q/2}}H^q. \end{aligned}$$

In the second inequality, we treat $\overline{X }$ as a $d^2$-dimensional random vector and then apply Lemma 17. Then the proof can be completed by setting $C_m(q,d) = C_v(q,d^2)d^\frac{q}{2}$. $\square $

B Error bound of local M-estimator and simple averaging estimator

Since the simple averaging estimator is the average of all local estimators and the one-step estimator is just a single Newton–Raphson update from the simple averaging estimator. Thus, it is natural to study the upper bound of the mean squared error (MSE) of a local M-estimator and the upper bound of the MSE of the simple averaging estimator. The main idea in the following proof is similar to the thread in the proof of Theorem 1 in Zhang and Duchi [21], but the conclusions are different. Besides, in the following proof, we use a correct analogy of mean value theorem for vector-valued functions.

1.1 B.1 Bound the error of local M-estimators $\theta _i, i=1,\ldots ,k$

In this subsection, we would like to analyze the mean squared error of a local estimator $\theta _i = \mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{\theta \in \varTheta } M_i(\theta )$, $i = 1, \ldots , k$ and prove the following lemma in the rest of this subsection.

Lemma 19

Let $\varSigma = {\ddot{M}}_0(\theta _0)^{-1} {\mathbb {E}}[{\dot{m}}(X;\theta _0){\dot{m}}(X;\theta _0)^t] {\ddot{M}}_0(\theta _0)^{-1}$, where the expectation is taken with respect to X. Under Assumption 3, 4, 5 and 6, for each $i=1,\ldots ,k$, we have

$$\begin{aligned} {\mathbb {E}} [\Vert \theta _i - \theta _0 \Vert ^2] \le \frac{2}{n} Tr(\varSigma ) + O(n^{-2}). \end{aligned}$$

Since ${\dot{M}}_i(\theta _i) = 0$, by Theorem 4.2 in Chapter XIII of Lang [11], we have

$$\begin{aligned} 0= & {} {\dot{M}}_i(\theta _i) = {\dot{M}}_i(\theta _0) + \int _{0}^1 {\ddot{M}}_i((1 - \rho ) \theta _0 +\rho \theta _i) d\rho \, [\theta _i - \theta _0] \\= & {} {\dot{M}}_i(\theta _0) + {\ddot{M}}_0(\theta _0) [\theta _i - \theta _0] + \left[ \int _{0}^1 {\ddot{M}}_i((1 - \rho ) \theta _0 +\rho \theta _i) d\rho - {\ddot{M}}_0(\theta _0) \right] [\theta _i - \theta _0] \\= & {} {\dot{M}}_i(\theta _0) + {\ddot{M}}_0(\theta _0) [\theta _i - \theta _0] + \left[ \int _{0}^1 {\ddot{M}}_i((1 - \rho ) \theta _0 +\rho \theta _i) d\rho \, - {\ddot{M}}_i(\theta _0) \right] [\theta _i - \theta _0] \\&+ [{\ddot{M}}_i(\theta _0) - {\ddot{M}}_0(\theta _0)] [\theta _i - \theta _0] \qquad \text {(subtract and add }{\ddot{M}}_i(\theta _0)), \end{aligned}$$

If last two terms in above equation are reasonably small, this lemma follows immediately. So, our strategy is as follows. First, we show that the mean squared error of both $[\int _{0}^1 {\ddot{M}}_i((1 - \rho ) \theta _0 +\rho \theta _i) d\rho \, - {\ddot{M}}_i(\theta _0)] [\theta _i - \theta _0]$ and $[{\ddot{M}}_i(\theta _0) - {\ddot{M}}_0(\theta _0)] [\theta _i - \theta _0]$ is small under some “good” events. Then we will show the probability of “bad” events is small enough. And Lemma 19 will follow by the fact that $\varTheta $ is compact.

Suppose $S_i = \{x_1,\ldots ,x_n\}$ is the data set on local machine i. Let us define some good events:

$$\begin{aligned}&E_1 = \left\{ \frac{1}{n}\sum _{j=1}^n L(x_j) \le 2L \right\} ,\quad E_2 = \left\{ {\left| \left| \left| {\ddot{M}}_i(\theta _0) - {\ddot{M}}_0(\theta _0) \right| \right| \right| } \le \lambda /4 \right\} ,\quad \\&E_3 = \left\{ \Vert {\dot{M}}_i(\theta _0) \Vert \le \frac{\lambda }{2} \delta ' \right\} , \end{aligned}$$

where $\delta ' = \min (\delta ,\frac{\lambda }{8L})$, $\lambda $ is the constant in Assumption 5 and L and $\delta $ are the constants in Assumption 6. We will show that event $E_1$ and $E_2$ ensure that $M_i(\theta )$ is strictly concave at a neighborhood of $\theta _0$. And we will also show that in event $E_3$, $\theta _i$ is fairly close to $\theta _0$. Let $E = E_1 \cap E_2 \cap E_3$, then we have the following lemma:

Lemma 20

Under event E, we have $\Vert \theta _i - \theta _0 \Vert \le \frac{4}{\lambda } \Vert {\dot{M}}_i(\theta _0) \Vert $.

Proof

First, we will show ${\ddot{M}}_i(\theta )$ is a negative definite matrix over a ball centered at $\theta _0$: $B_{\delta '} = \{ \theta \in \varTheta : \Vert \theta - \theta _0 \Vert \le \delta ' \} \subset B_{\delta }$. For any fixed $\theta \in B_{\delta '}$, we have

$$\begin{aligned} {\left| \left| \left| {\ddot{M}}_i(\theta ) - {\ddot{M}}_0(\theta _0) \right| \right| \right| }\le & {} {\left| \left| \left| {\ddot{M}}_i(\theta ) - {\ddot{M}}_i(\theta _0) \right| \right| \right| } + {\left| \left| \left| {\ddot{M}}_i(\theta _0) - {\ddot{M}}_0(\theta _0) \right| \right| \right| } \\\le & {} 2L\Vert \theta - \theta _0 \Vert + \frac{\lambda }{4} \le \lambda /4 + \lambda /4 = \lambda /2, \end{aligned}$$

where we apply event $E_1$, Assumption 6 and the fact that $\delta '=\min (\delta ,\frac{\lambda }{8L})$ on the first term and event $E_2$ on the second term. Since ${\ddot{M}}_0(\theta _0)$ is negative definite by Assumption 5, above inequality implies that ${\ddot{M}}_i(\theta )$ is negative definite for all $\theta \in B_{\delta '}$ and

$$\begin{aligned} \sup _{u \in {\mathbb {R}}^d : \Vert u \Vert \le 1} u^t {\ddot{M}}_i(\theta ) u \le -\lambda /2. \end{aligned}$$

(B.9)

With negative definiteness of ${\ddot{M}}_i(\theta ), \theta \in B_{\delta '}$, event $E_3$ and concavity of $M_i(\theta ), \theta \in \varTheta $, we have

$$\begin{aligned} \frac{\lambda }{2} \delta ' {\mathop {\ge }\limits ^{E_3}} \Vert {\dot{M}}_i(\theta _0) \Vert = \Vert {\dot{M}}_i(\theta _0) - {\dot{M}}_i(\theta _i) \Vert {\mathop {\ge }\limits ^{({ B}.9)}} \frac{\lambda }{2} \Vert \theta _i - \theta _0 \Vert . \end{aligned}$$

Thus, we know $\Vert \theta _i - \theta _0 \Vert \le \delta '$, or equivalently, $\theta _i \in B_{\delta '}$. Then by applying Taylor’s Theorem on $M_i(\theta )$ at $\theta _0$, we have

$$\begin{aligned} M_i(\theta _i) {\mathop {\le }\limits ^{ ({ B}.9) }} M_i(\theta _0) + {\dot{M}}_i(\theta _0)^t (\theta _i - \theta _0) - \frac{\lambda }{4} \Vert \theta _i - \theta _0 \Vert ^2. \end{aligned}$$

As $\theta _i$ is the maximizer of $M_i(\cdot )$, we know $M_i(\theta _0) \le M_i(\theta _i)$, thus,

$$\begin{aligned} \frac{\lambda }{4} \Vert \theta _i - \theta _0 \Vert ^2&\le M_i(\theta _0) - M_i(\theta _i) + {\dot{M}}_i(\theta _0)^t (\theta _i - \theta _0) \le \Vert {\dot{M}}_i(\theta _0) \Vert \Vert \theta _i - \theta _0 \Vert , \end{aligned}$$

which implies $\Vert \theta _i - \theta _0 \Vert \le \frac{4}{\lambda } \Vert {\dot{M}}_i(\theta _0) \Vert .$$\square $

For $1 \le q \le 8$, we can bound ${\mathbb {E}}[ \Vert {\dot{M}}_i(\theta _0) \Vert ^q ]$ by Lemma 17 and Assumption 6,

$$\begin{aligned} {\mathbb {E}}[ \Vert {\dot{M}}_i(\theta _0) \Vert ^q ] \le \frac{C_v(q,d)}{n^{q/2}}G^q, \end{aligned}$$

where $C_v(q,d)$ is a constant depending on q and d only. Then by conditioning on event E, we have

$$\begin{aligned} {\mathbb {E}}[ \Vert \theta _i - \theta _0 \Vert ^q]= & {} {\mathbb {E}}[ \Vert \theta _i - \theta _0 \Vert ^q 1_{(E)}] + {\mathbb {E}}[ \Vert \theta _i - \theta _0 \Vert ^q 1_{(E^c)}] \\&\le \frac{4^q}{\lambda ^q}{\mathbb {E}}[ \Vert {\dot{M}}_i(\theta _0) \Vert ^q ] + D^q \hbox {Pr}(E^c)\\&\le \frac{4^q}{\lambda ^q}\frac{C_v(q,d)}{n^{q/2}}G^q + D^q \hbox {Pr}(E^c). \end{aligned}$$

If we can show $\hbox {Pr}(E^c) = O(n^{-\frac{q}{2}})$, then ${\mathbb {E}}[ \Vert \theta _i - \theta _0 \Vert ^q] = O(n^{-\frac{q}{2}})$ follows immediately.

Lemma 21

Under Assumption 6, we have $\hbox {Pr}(E^c) = O(n^{-4}).$

Proof

Under Assumption 6, by applying Lemmas 17 and 18, we can bound the moments of ${\dot{M}}_i(\theta _0)$ and ${\ddot{M}}_i(\theta _0) - {\ddot{M}}_0(\theta _0)$. Rigorously, for $1 \le q \le 8$, we have

$$\begin{aligned} {\mathbb {E}} \left[ \Vert {\dot{M}}_i(\theta _0)\Vert ^q \right] \le \frac{C_v(q,d)}{n^{q/2}}G^q, {\mathbb {E}} \left[ {\left| \left| \left| {\ddot{M}}_i(\theta _0) - {\ddot{M}}_0(\theta _0) \right| \right| \right| }^q \right] \le \frac{C_m(q,d)}{n^{q/2}}H^q. \end{aligned}$$

Therefore, by Markov’s inequality, we have

$$\begin{aligned} \hbox {Pr}(E^c)= & {} \hbox {Pr}(E_1^c \cup E_2^c \cup E_3^c) \le \hbox {Pr}(E_1^c) + \hbox {Pr}(E_2^c) + \hbox {Pr}(E_3^c) \\&\le \frac{{\mathbb {E}} \left[ |\frac{1}{n}\sum _{j=1}^n L(x_j) - {\mathbb {E}}[L(x)]|^8 \right] }{L^8} + \frac{{\mathbb {E}} \left[ {\left| \left| \left| {\ddot{M}}_i(\theta _0) - {\ddot{M}}_0(\theta _0) \right| \right| \right| }^8 \right] }{(\lambda /4)^8} \\&+ \frac{{\mathbb {E}} \left[ \Vert {\dot{M}}_i(\theta _0)\Vert ^8 \right] }{(\lambda \delta '/2)^8} \le O\left( \frac{1}{n^4}\right) + O\left( \frac{1}{n^4}\right) + O\left( \frac{1}{n^4}\right) = O(n^{-4}). \end{aligned}$$

$\square $

Now, we have showed that for $1 \le q \le 8$,

$$\begin{aligned} {\mathbb {E}}[ \Vert \theta _i - \theta _0 \Vert ^q] \le \frac{4^q}{\lambda ^q}\frac{C_v(q,d)}{n^{q/2}}G^q + O(n^{-4}) = O(n^{-\frac{q}{2}}). \end{aligned}$$

(B.10)

Until now, $[{\ddot{M}}_i(\theta _0) - {\ddot{M}}_0(\theta _0)] [\theta _i - \theta _0]$ has been well bounded. Next, we will consider the moment bound of $\int _{0}^1 {\ddot{M}}_i((1 - \rho ) \theta _0 +\rho \theta _i) d\rho - {\ddot{M}}_i(\theta _0)$.

Lemma 22

Under Assumption 6, for $1 \le q \le 4$,

$$\begin{aligned}&{\mathbb {E}}\left[ {\left| \left| \left| \int _{0}^1 {\ddot{M}}_i((1 - \rho ) \theta _0 +\rho \theta _i) d\rho - {\ddot{M}}_i(\theta _0) \right| \right| \right| }^q \right] \\&\quad \le L^q \frac{4^q}{\lambda ^q}\frac{\sqrt{C_v(2q,d)}}{n^{q/2}}G^q + O(n^{-2})= O(n^{-q/2}). \end{aligned}$$

Proof

By Minkowski’s integral inequality, we have

$$\begin{aligned}&{\mathbb {E}}\left[ {\left| \left| \left| \int _{0}^1 {\ddot{M}}_i((1 - \rho ) \theta _0 +\rho \theta _i) d\rho - {\ddot{M}}_i(\theta _0) \right| \right| \right| }^q \right] \\&\quad \le {\mathbb {E}}\left[ \int _{0}^1 {\left| \left| \left| {\ddot{M}}_i((1 - \rho ) \theta _0 +\rho \theta _i) - {\ddot{M}}_i(\theta _0) \right| \right| \right| }^q d\rho \right] \\&\quad = \int _{0}^1 {\mathbb {E}}\left[ {\left| \left| \left| {\ddot{M}}_i((1 - \rho ) \theta _0 +\rho \theta _i) - {\ddot{M}}_i(\theta _0) \right| \right| \right| }^q \right] d\rho . \end{aligned}$$

For simplicity of notation, we use $\theta ' = (1 - \rho ) \theta _0 +\rho \theta _i$ in this proof. When event E holds, we have $\Vert \theta ' - \theta _0 \Vert = \Vert \rho (\theta _i - \theta _0) \Vert \le \rho \delta ' \le \delta $, which means that $\theta ' \in B_{\delta }, \forall \rho \in [0,1]$. Thus, because of the convexity of the matrix norm ${\left| \left| \left| \cdot \right| \right| \right| }$, we can apply Jensen’s inequality and Assumption 6 and get

$$\begin{aligned} {\left| \left| \left| {\ddot{M}}_i(\theta ') - {\ddot{M}}_i(\theta _0) \right| \right| \right| }^q\le & {} \frac{1}{n} \sum _{j=1}^n {\left| \left| \left| \ddot{m}(x_j;\theta ') - \ddot{m}(x_j;\theta _0) \right| \right| \right| }^q \\\le & {} \frac{1}{n} \sum _{j=1}^n L(x_i)^q \Vert \theta ' - \theta _0 \Vert ^q. \end{aligned}$$

Then apply Hölder’s inequality,

$$\begin{aligned}&{\mathbb {E}} \left[ {\left| \left| \left| {\ddot{M}}_i(\theta ') - {\ddot{M}}_i(\theta _0) \right| \right| \right| }^q 1_{(E)} \right] \le \left\{ {\mathbb {E}}\left[ \left( \frac{1}{n} \sum _{j=1}^n L(x_i)^q\right) ^2 \right] \right\} ^{1/2} \left\{ {\mathbb {E}}[ \Vert \theta ' - \theta _0 \Vert ^{2q} ] \right\} ^{1/2} \\&\quad {\mathop {\le }\limits ^{\hbox {Jensen's}}} C(q) L^q \rho ^q \left\{ {\mathbb {E}}[ \Vert \theta _i - \theta _0 \Vert ^{2q} ] \right\} ^{1/2} {\mathop {\le }\limits ^{(\hbox {B.10})}}C(q) L^q \frac{4^q \sqrt{C(2q,d)} G^q}{\lambda ^q n^{q/2}} + O(n^{-2}). \end{aligned}$$

When event E does not hold, we know that ${\left| \left| \left| {\ddot{M}}_i(\theta ') - {\ddot{M}}_i(\theta _0) \right| \right| \right| }^q$ must be finite by the assumption that $\varTheta $ is compact and ${\ddot{M}}_i(\theta )$ is continuous. By Lemma 21, the probability that event E does not hold is bounded by $O(n^{-4})$, which implies,

$$\begin{aligned} {\mathbb {E}} \left[ {\left| \left| \left| {\ddot{M}}_i(\theta ') - {\ddot{M}}_i(\theta _0) \right| \right| \right| }^q \right]\le & {} C(q) L^q \frac{4^q \sqrt{C(2q,d)} G^q}{\lambda ^q n^{q/2}} \\&\quad + O(n^{-2}) + O(n^{-4}) = O(n^{-q/2}). \end{aligned}$$

Therefore, we have

$$\begin{aligned}&{\mathbb {E}}\left[ {\left| \left| \left| \int _{0}^1 {\ddot{M}}_i((1 - \rho ) \theta _0 +\rho \theta _i) d\rho - {\ddot{M}}_i(\theta _0) \right| \right| \right| }^q \right] \\&\quad \le \int _{0}^1 {\mathbb {E}}\left[ {\left| \left| \left| {\ddot{M}}_i((1 - \rho ) \theta _0 +\rho \theta _i) - {\ddot{M}}_i(\theta _0) \right| \right| \right| }^q \right] d\rho \\&\quad \le C(q) L^q \frac{4^q \sqrt{C(2q,d)} G^q}{\lambda ^q n^{q/2}} + O(n^{-2}) + O(n^{-4}) = O(n^{-q/2}). \end{aligned}$$

$\square $

Now, recall that we have

$$\begin{aligned} 0= & {} {\dot{M}}_i(\theta _0) + {\ddot{M}}_0(\theta _0) [\theta _i - \theta _0] + \left[ \int _{0}^1 {\ddot{M}}_i((1 - \rho ) \theta _0 +\rho \theta _i) d\rho - {\ddot{M}}_i(\theta _0) \right] [\theta _i - \theta _0] \nonumber \\&+\, [{\ddot{M}}_i(\theta _0) - {\ddot{M}}_0(\theta _0)] [\theta _i - \theta _0]. \end{aligned}$$

(B.11)

For the sum of last two terms, we have

$$\begin{aligned}&{\mathbb {E}}\left[ \left\| \left[ \int _{0}^1 {\ddot{M}}_i((1 - \rho ) \theta _0 +\rho \theta _i) d\rho - {\ddot{M}}_i(\theta _0)\right] [\theta _i - \theta _0] \right. \right. \\&\quad \left. \left. + [{\ddot{M}}_i(\theta _0) - {\ddot{M}}_0(\theta _0)] [\theta _i - \theta _0] \right\| ^2 \right] \\&\quad \le 2{\mathbb {E}}\left[ \left\| \left[ \int _{0}^1 {\ddot{M}}_i((1 - \rho ) \theta _0 +\rho \theta _i) d\rho - {\ddot{M}}_i(\theta _0)\right] [\theta _i - \theta _0] \right\| ^2 \right] \\&\qquad + 2{\mathbb {E}}\left[ \Vert [{\ddot{M}}_i(\theta _0) - {\ddot{M}}_0(\theta _0)] [\theta _i - \theta _0] \Vert ^2 \right] \qquad \text {(since }(a+b)^2 \le 2a^2 + 2b^2\text {)}\\&\quad \le 2 \left( {\mathbb {E}} \left[ \left\| \int _{0}^1 {\ddot{M}}_i((1 - \rho ) \theta _0 +\rho \theta _i) d\rho - {\ddot{M}}_i(\theta _0) \right\| ^4 \right] \right) ^{1/2} ({\mathbb {E}}[\Vert \theta _i - \theta _0 \Vert ^4])^{1/2} \\&\qquad + 2 ({\mathbb {E}}[\Vert {\ddot{M}}_i(\theta _0) - {\ddot{M}}_0(\theta _0) \Vert ^4])^{1/2} ({\mathbb {E}}[\Vert \theta _i - \theta _0 \Vert ^4])^{1/2} \qquad \text {(H}\ddot{\mathrm{o}}\text {lder's inequality)}\\&\quad = O(n^{-2}) \qquad \text {(Lemmas}18\, \hbox {and}\, 22\hbox { and }(\hbox {B.10})\hbox {)} \end{aligned}$$

Until now, we have established the upper bound for the mean squared error of local M-estimators, ${\mathbb {E}} [\Vert \theta _i - \theta _0 \Vert ^2] \le \frac{2}{n} \hbox {Tr}(\varSigma ) + O(n^{-2}), \hbox { for } i = 1,\ldots ,k$.

1.2 B.2 Bound the error of simple averaging estimator $\theta ^{(0)}$

Next, we will study the mean squared error of simple averaging estimator, $\theta ^{(0)} = \frac{1}{k} \sum _{i=1}^k \theta _i.$ We start with a lemma, which bounds the bias of local M-estimator $\theta _i, i=1,\ldots ,k$.

Lemma 23

There exists some constant ${\tilde{C}}>0$ such that for $i=1,\ldots ,k$, we have

$$\begin{aligned} \Vert {\mathbb {E}}[\theta _i - \theta _0] \Vert \le \frac{{\tilde{C}}}{n} + O(n^{-2}), \end{aligned}$$

where ${\tilde{C}} = 16 [C_v(4,d)]^{\frac{1}{4}}\sqrt{C_v(2,d)} \lambda ^{-3} G^2L + 4 \sqrt{C_m(2,d)}\sqrt{C_v(2,d)} \lambda ^{-2} GH$.

Proof

The main idea of this proof is to use Eq. B.11) and apply the established error bounds of Hessian and the aforementioned local m-estimators. By Eq. (B.11) and fact ${\mathbb {E}}[M_i(\theta _0)] = 0$, we have

$$\begin{aligned} \Vert {\mathbb {E}}[\theta _i - \theta _0] \Vert\le & {} \Vert {\mathbb {E}}\left\{ {\ddot{M}}_0(\theta _0)^{-1} \left[ \int _{0}^1 {\ddot{M}}_i((1 - \rho ) \theta _0 +\rho \theta _i) d\rho - {\ddot{M}}_i(\theta _0) \right] [\theta _i - \theta _0] \right\} \Vert \\&+ \Vert {\mathbb {E}}\{ {\ddot{M}}_0(\theta _0)^{-1} [{\ddot{M}}_i(\theta _0) - {\ddot{M}}_0(\theta _0)] [\theta _i - \theta _0] \} \Vert \\&{\mathop {\le }\limits ^{\hbox {Jensen's}}} {\mathbb {E}}\left[ \Vert {\ddot{M}}_0(\theta _0)^{-1} \left[ \int _{0}^1 {\ddot{M}}_i((1 - \rho ) \theta _0 +\rho \theta _i) d\rho - {\ddot{M}}_i(\theta _0) \right] [\theta _i - \theta _0] \Vert \right] \\&+ {\mathbb {E}}\left[ \Vert {\ddot{M}}_0(\theta _0)^{-1} [{\ddot{M}}_i(\theta _0) - {\ddot{M}}_0(\theta _0)] [\theta _i - \theta _0] \Vert \right] \\&{\mathop {\le }\limits ^{\hbox {Assumption }5}} \lambda ^{-1} {\mathbb {E}}\left[ \left\| \int _{0}^1 {\ddot{M}}_i((1 - \rho ) \theta _0 +\rho \theta _i) d\rho - {\ddot{M}}_i(\theta _0) \right\| \Vert \theta _i - \theta _0 \Vert \right] \\&+ \lambda ^{-1} {\mathbb {E}}\left[ \Vert {\ddot{M}}_i(\theta _0) - {\ddot{M}}_0(\theta _0) \Vert \Vert \theta _i - \theta _0 \Vert \right] \\&{\mathop {\le }\limits ^{\hbox {H}\ddot{\mathrm{o}}\hbox {lder's}}} \lambda ^{-1} {\mathbb {E}} \left[ \left\| \int _{0}^1 {\ddot{M}}_i((1 - \rho ) \theta _0 +\rho \theta _i) d\rho - {\ddot{M}}_i(\theta _0) \right\| ^2 \right] ^{1/2} {\mathbb {E}}[ \Vert \theta _i - \theta _0 \Vert ^2 ]^{1/2} \\&+ \lambda ^{-1} {\mathbb {E}}[ \Vert {\ddot{M}}_i(\theta _0) - {\ddot{M}}_0(\theta _0) \Vert ^2 ]^{1/2} {\mathbb {E}}[ \Vert \theta _i - \theta _0 \Vert ^2 ]^{1/2}. \end{aligned}$$

Then we can apply Lemmas18 and 22, and (B.10) to bound each term, thus, we have

$$\begin{aligned} \Vert {\mathbb {E}}[\theta _i - \theta _0] \Vert ]&\le \lambda ^{-1} \sqrt{L^2 \frac{4^2}{\lambda ^2}\frac{\sqrt{C_v(4,d)}}{n}G^2 + O(n^{-2})} \sqrt{\frac{4^2}{\lambda ^2}\frac{C_v(2,d)}{n}G^2 + O(n^{-4})}\\&\quad + \lambda ^{-1} \sqrt{\frac{C_m(2,d)}{n}H^2} \sqrt{\frac{4^2}{\lambda ^2}\frac{C_v(2,d)}{n}G^2 + O(n^{-4})} \\&\le \lambda ^{-1} \left[ L \frac{4}{\lambda } \frac{C_v(4,d)^{1/4}}{\sqrt{n}} G + O(n^{-\frac{3}{2}}) \right] \left[ \frac{4}{\lambda }\frac{\sqrt{C_v(2,d)}}{\sqrt{n}}G + O(n^{-\frac{7}{2}}) \right] \\&\quad + \lambda ^{-1} \frac{\sqrt{C_m(2,d)}}{\sqrt{n}}H \left[ \frac{4}{\lambda }\frac{\sqrt{C_v(2,d)}}{\sqrt{n}}G + O(n^{-\frac{7}{2}}) \right] \\&= L \frac{4^2}{\lambda ^3} \frac{C_v(4,d)^{1/4}\sqrt{C_v(2,d)}}{n} G^2 + O(n^{-2}) \\&\quad + \frac{4}{\lambda ^2} \frac{\sqrt{C_m(2,d)}\sqrt{C_v(2,d)}}{n}GH + O(n^{-4}). \end{aligned}$$

Let ${\tilde{C}} = 16 [C_v(4,d)]^{\frac{1}{4}}\sqrt{C_v(2,d)} \lambda ^{-3} G^2L + 4 \sqrt{C_m(2,d)}\sqrt{C_v(2,d)} \lambda ^{-2} GH$, then we have $\Vert {\mathbb {E}}[\theta _i - \theta _0] \Vert \le \frac{{\tilde{C}}}{n} + O(n^{-2})$. $\square $

Then, we can show that the MSE of $\theta ^{(0)}$ could be bounded as follows.

Lemma 24

There exists some constant ${\tilde{C}}>0$ such that

$$\begin{aligned} {\mathbb {E}} [ \Vert \theta ^{(0)} - \theta _0 \Vert ^2] \le \frac{2}{N} \hbox {Tr}(\varSigma ) + \frac{{\tilde{C}}^2 k^2}{N^2} + O(kN^{-2}) + O(k^3N^{-3}), \end{aligned}$$

where ${\tilde{C}} = 16 [C_v(4,d)]^{\frac{1}{4}}\sqrt{C_v(2,d)} \lambda ^{-3} G^2L + 4 \sqrt{C_m(2,d)}\sqrt{C_v(2,d)} \lambda ^{-2} GH$.

Proof

The mean squared error of $\theta ^{(0)}$ could be decomposed into two parts: covariance and bias. Thus,

$$\begin{aligned} {\mathbb {E}} [ \Vert \theta ^{(0)} - \theta _0 \Vert ^2]&= \hbox {Tr}(\hbox {Cov}[\theta ^{(0)}]) + \Vert {\mathbb {E}}[\theta ^{(0)}-\theta _0] \Vert ^2 = \frac{1}{k} \hbox {Tr}(\hbox {Cov}[\theta _1]) + \Vert {\mathbb {E}}[\theta _1-\theta _0] \Vert ^2 \\&\le \frac{1}{k} {\mathbb {E}} [ \Vert \theta _1 - \theta _0 \Vert ^2] + \Vert {\mathbb {E}}[\theta _1-\theta _0] \Vert ^2, \end{aligned}$$

where the first term is well bounded by Lemma 19 and the second term could be bounded by Lemma 23. Thus, we know ${\mathbb {E}} [ \Vert \theta ^{(0)} - \theta _0 \Vert ^2] \le \frac{2}{N} \hbox {Tr}(\varSigma ) + \frac{{\tilde{C}}^2 k^2}{N^2} + O(kN^{-2}) + O(k^3N^{-3})$. More generally, for $1 \le q \le 8$, we have

$$\begin{aligned}&{\mathbb {E}}[ \Vert \theta ^{(0)} - \theta _0 \Vert ^q ] = {\mathbb {E}}[ \Vert ( \theta ^{(0)} - {\mathbb {E}}[\theta ^{(0)}] ) + ( {\mathbb {E}}[\theta ^{(0)}] - \theta _0 ) \Vert ^q ] \\&\quad \le 2^q {\mathbb {E}}[ \Vert \theta ^{(0)} - {\mathbb {E}}[\theta ^{(0)}] \Vert ^q] + 2^q \Vert {\mathbb {E}}[\theta ^{(0)}] - \theta _0 \Vert ^q \qquad \text {(since }(a+b)^q \le 2^q a^q + 2^q b^q\text {)} \\&\quad = 2^q {\mathbb {E}}[ \Vert \theta ^{(0)} - {\mathbb {E}}[\theta ^{(0)}] \Vert ^q ] + 2^q \Vert {\mathbb {E}}[\theta _1] - \theta _0 \Vert ^q \qquad \text {(since }{\mathbb {E}}[\theta ^{(0)}] = {\mathbb {E}}[\theta _1]\text {)}\\&\quad \le 2^q {\mathbb {E}}[ \Vert \theta ^{(0)} - \theta _0 \Vert ^q ] + 2^q \frac{{\tilde{C}}^q}{n^q} + O(n^{-q-1}) \\&\quad {\mathop {\le }\limits ^{\hbox {Lemma }17}} 2^q \frac{C_v(q,d)}{k^{q/2}} {\mathbb {E}}[ \Vert \theta _1 - \theta _0 \Vert ^q ]) + 2^q \frac{{\tilde{C}}^q}{n^q} + O(n^{-q-1}) \\&\quad {\mathop {\le }\limits ^{(\hbox {B.10})}} 2^q \frac{C_v(q,d)}{k^{q/2}} \left[ \frac{4^q}{\lambda ^q}\frac{C(q,d)}{n^{q/2}}G^q + O(n^{-4}) \right] + 2^q \frac{{\tilde{C}}^q}{n^q} + O(n^{-q-1}) \\&\quad = 8^{q} [C_v(q,d)]^2 \lambda ^{-q} G^q N^{-\frac{q}{2}} + O(N^{-\frac{q}{2}} n^{\frac{q}{2}-4}) + 2^q \frac{{\tilde{C}}^q}{n^q} + O(n^{-q-1}). \end{aligned}$$

In summary, we have

$$\begin{aligned} {\mathbb {E}}[ \Vert \theta ^{(0)} - \theta _0 \Vert ^q ] \le O(N^{-\frac{q}{2}}) + \frac{2^q {\tilde{C}}^q k^q}{N^q} + O(k^{q+1}N^{-q-1}) = O(N^{-\frac{q}{2}}) + O(k^{q}N^{-q}).\nonumber \\ \end{aligned}$$

(B.12)

$\square $

C Proof of Theorem 7

The whole proof could be completed in two steps: first, show simple averaging estimator $\theta ^{(0)}$ is $\sqrt{N}$-consistent when $k=O(\sqrt{N})$; then show the consistency and asymptotic normality of the one-step estimator $\theta ^{(1)}$. In the first step, we need to show the following.

Lemma 25

Under Assumptions 3, 4, 5 and 6, when $k=O(\sqrt{N})$, the simple averaging estimator $\theta ^{(0)}$ is $\sqrt{N}$-consistent estimator of $\theta _0$, i.e., $\sqrt{N} \Vert \theta ^{(0)} - \theta _0 \Vert = O_P(1) \text { as } N \rightarrow \infty .$

Proof

If k is finite and does not grow with N, the proof is trivial. So, we just need to consider the case that $k \rightarrow \infty $. We know that $\Vert {\mathbb {E}}[ \sqrt{n}(\theta _i - \theta _0) ] \Vert \le O(\frac{1}{\sqrt{n}})$ by Lemma 23 and ${\mathbb {E}}[ \Vert \sqrt{n}(\theta _i - \theta _0) \Vert ^2] \le 2 \hbox {Tr}(\varSigma ) + O(n^{-1})$ by Lemma 19. By applying Lindeberg-Lévy Central Limit Theorem, we have

$$\begin{aligned} \sqrt{N} (\theta ^{(0)} - \theta _0)&= \frac{1}{\sqrt{k}} \sum _{i=1}^k \left\{ \sqrt{n}(\theta _i - \theta _0) - {\mathbb {E}}[\sqrt{n}(\theta _i - \theta _0)] \right\} + \sqrt{nk}{\mathbb {E}}[ \theta _1 - \theta _0 ] \\&\xrightarrow {d} {\mathbf {N}}(0,\varSigma ) + \lim _{N \rightarrow \infty } \sqrt{nk}{\mathbb {E}}[ \theta _1 - \theta _0 ], \end{aligned}$$

It suffices to show $\lim _{N \rightarrow \infty } \sqrt{nk}{\mathbb {E}}[ \theta _1 - \theta _0 ]$ is finite. By Lemma 23, we have $\Vert {\mathbb {E}}[\theta _i - \theta _0] \Vert = O(\frac{1}{n})$, which means that $\Vert \sqrt{nk}{\mathbb {E}}[ \theta _i - \theta _0 ] \Vert = O(1)$ if $k=O(\sqrt{N})=O(n)$. Thus, when $k=O(\sqrt{N})$, $\sqrt{N} (\theta ^{(0)} - \theta _0)$ is bounded in probability. $\square $

Now, we can prove Theorem 7.

Proof

By the definition of the one-step estimator $\theta ^{(1)} = \theta ^{(0)} - {\ddot{M}}(\theta ^{(0)})^{-1} {\dot{M}}(\theta ^{(0)})$, and by Theorem 4.2 in Chapter XIII of Lang [11], we have

$$\begin{aligned}&\sqrt{N} {\ddot{M}}(\theta ^{(0)}) (\theta ^{(1)} - \theta _0) = {\ddot{M}}(\theta ^{(0)}) \sqrt{N} (\theta ^{(0)} - \theta _0) - \sqrt{N} ({\dot{M}}(\theta ^{(0)})-{\dot{M}}(\theta _0)) - \sqrt{N} {\dot{M}}(\theta _0) \\&\quad = {\ddot{M}}(\theta ^{(0)}) \sqrt{N} (\theta ^{(0)} - \theta _0) - \sqrt{N} \int _{0}^1 {\ddot{M}}((1 - \rho ) \theta _0 +\rho \theta ^{(0)}) d\rho \,(\theta ^{(0)}-\theta _0) - \sqrt{N} {\dot{M}}(\theta _0)\\&\quad = \left[ {\ddot{M}}(\theta ^{(0)}) - \int _{0}^1 {\ddot{M}}((1 - \rho ) \theta _0 +\rho \theta ^{(0)}) d\rho \right] \sqrt{N} (\theta ^{(0)} - \theta _0) - \sqrt{N} {\dot{M}}(\theta _0). \end{aligned}$$

As it is shown in (B.12), for any $\rho \in [0,1]$, when $k = O(\sqrt{N})$, we have $\Vert (1 - \rho ) \theta _0 +\rho \theta ^{(0)} - \theta _0 \Vert \le \rho \Vert \theta ^{(0)} - \theta _0 \Vert \xrightarrow {P} 0$. Since ${\ddot{M}}(\cdot )$ is a continuous function, ${\left| \left| \left| {\ddot{M}}(\theta ^{(0)}) - \int _{0}^1 {\ddot{M}}((1 - \rho ) \theta _0 +\rho \theta ^{(0)}) d\rho \right| \right| \right| }\xrightarrow {P} 0$. Thus, we have $\sqrt{N} {\ddot{M}}(\theta ^{(0)}) (\theta ^{(1)} - \theta _0) = - \sqrt{N} {\dot{M}}(\theta _0) + o_P(1)$. And, ${\ddot{M}}(\theta ^{(0)}) \xrightarrow {P} {\ddot{M}}_0(\theta _0)$ because of $\theta ^{(0)} \xrightarrow {P} \theta _0$ and Law of Large Numbers. By applying Slutsky’s Lemma, we can obtain $\sqrt{N} (\theta ^{(1)} - \theta _0) \xrightarrow {d} {\mathbf {N}}(0,\varSigma ) \hbox { as } N \rightarrow \infty $. $\square $

Remark 26

If we simply replace global Hessian ${\ddot{M}}(\theta ^{(0)})^{-1}$ with a local Hessian ${\ddot{M}}_1(\theta ^{(0)})^{-1}$, it is easy to verify that every step in above proof still holds. Thus, $\theta ^{(2)} = \theta ^{(0)} - {\ddot{M}}_1(\theta ^{(0)})^{-1} {\dot{M}}(\theta ^{(0)})$ enjoys the same asymptotic properties with $\theta ^{(1)}$.

D Proof of Theorem 10

Let us recall the formula for one-step estimator, $\theta ^{(1)} = \theta ^{(0)} - {\ddot{M}}(\theta ^{(0)})^{-1} {\dot{M}}(\theta ^{(0)})$. Then by Theorem 4.2 in Chapter XIII of Lang [11], we have

$$\begin{aligned}&{\ddot{M}}_0(\theta _0) (\theta ^{(1)} - \theta _0) = [{\ddot{M}}_0(\theta _0)-{\ddot{M}}(\theta ^{(0)})] (\theta ^{(1)} - \theta _0) + {\ddot{M}}(\theta ^{(0)}) (\theta ^{(1)} - \theta _0) \\&\quad = [{\ddot{M}}_0(\theta _0)-{\ddot{M}}(\theta ^{(0)})] (\theta ^{(1)} - \theta _0) + {\ddot{M}}(\theta ^{(0)}) (\theta ^{(0)} - \theta _0) - [{\dot{M}}(\theta ^{(0)}) - {\dot{M}}(\theta _0)] - {\dot{M}}(\theta _0) \\&\quad = [{\ddot{M}}_0(\theta _0)-{\ddot{M}}(\theta ^{(0)})] (\theta ^{(1)} - \theta _0) + \left[ {\ddot{M}}(\theta _0) - \int _{0}^1 {\ddot{M}}((1 - \rho ) \theta _0 +\rho \theta ^{(0)}) d\rho \right] \\&\qquad (\theta ^{(0)} - \theta _0) - {\dot{M}}(\theta _0). \end{aligned}$$

Then we have

$$\begin{aligned} \theta ^{(1)} - \theta _0= & {} - {\ddot{M}}_0(\theta _0)^{-1} {\dot{M}}(\theta _0) + {\ddot{M}}_0(\theta _0)^{-1} [{\ddot{M}}_0(\theta _0)-{\ddot{M}}(\theta ^{(0)})] (\theta ^{(1)} - \theta _0) \\&+ {\ddot{M}}_0(\theta _0)^{-1} \left[ {\ddot{M}}(\theta _0) - \int _{0}^1 {\ddot{M}}((1 - \rho ) \theta _0 +\rho \theta ^{(0)}) d\rho \right] (\theta ^{(0)} - \theta _0). \end{aligned}$$

We will show the last two terms are small enough. Similar to the proof of Lemma 19, we define a “good” event: $E_4 = \{ \Vert \theta ^{(0)} - \theta _0 \Vert \le \delta \}.$ The probability of above event is close to 1 when N is large.

$$\begin{aligned} \hbox {Pr}(E_4^c) \le \frac{{\mathbb {E}}[\Vert \theta ^{(0)} - \theta _0 \Vert ^8]}{\delta ^8} \le O(N^{-4}) + O(k^8 N^{-8}). \end{aligned}$$

Lemma 27

If event $E_4$ holds, for $1 \le q \le 4$, we have

$$\begin{aligned} {\mathbb {E}} \left[ {\left| \left| \left| {\ddot{M}}_0(\theta _0)-{\ddot{M}}(\theta ^{(0)}) \right| \right| \right| }^q \right]\le & {} O(N^{-\frac{q}{2}}) + O(k^{q}N^{-q}), \\ {\mathbb {E}} \left[ {\left| \left| \left| {\ddot{M}}(\theta _0) - \int _{0}^1 {\ddot{M}}((1 - \rho ) \theta _0 +\rho \theta ^{(0)}) d\rho \right| \right| \right| }^q \right]\le & {} O(N^{-\frac{q}{2}}) + O(k^{q}N^{-q}). \end{aligned}$$

Proof

By Lemma 18, we know ${\mathbb {E}} \left[ {\left| \left| \left| {\ddot{M}}_0(\theta _0)- {\ddot{M}}(\theta _0) \right| \right| \right| }^q \right] \le \frac{C(q,d)}{N^{q/2}} H^q$. Under event $E_4$ and Assumption 6, by applying Jensen’s inequality, we have

$$\begin{aligned} {\left| \left| \left| {\ddot{M}}(\theta _0) - {\ddot{M}}(\theta ^{(0)}) \right| \right| \right| }^q\le & {} \frac{1}{N} \sum _{i=1}^k \sum _{x \in S_i} {\left| \left| \left| \ddot{m}(x;\theta ^{(0)}) - \ddot{m}(x;\theta _0) \right| \right| \right| }^q \\\le & {} \frac{1}{N} \sum _{i=1}^k \sum _{x \in S_i} L(x)^q \Vert \theta ^{(0)} - \theta _0 \Vert ^q. \end{aligned}$$

Thus, for $1 \le q \le 4$, we have

$$\begin{aligned}&{\mathbb {E}} \left[ {\left| \left| \left| {\ddot{M}}(\theta _0) - {\ddot{M}}(\theta ^{(0)}) \right| \right| \right| }^q \right] {\mathop {\le }\limits ^{\hbox {H}\ddot{\mathrm{o}}\hbox {lder's}}} {\mathbb {E}} \left[ \left( \frac{1}{N} \sum _{i=1}^k \sum _{x \in S_i} L(x)^q\right) ^2 \right] ^{\frac{1}{2}} {\mathbb {E}} \left[ \Vert \theta ^{(0)} - \theta _0 \Vert ^{2q} \right] ^{\frac{1}{2}} \\&\quad {\mathop {\le }\limits ^{(\hbox {B.12})}} O\left( N^{-\frac{q}{2}}\right) + \frac{2^q {\tilde{C}}^q L^q k^q}{N^q} + O\left( \frac{k^{q+1}}{N^{q+1}}\right) . \end{aligned}$$

As a result, we have, for $1 \le q \le 4$,

$$\begin{aligned} {\mathbb {E}} \left[ {\left| \left| \left| {\ddot{M}}_0(\theta _0)-{\ddot{M}}(\theta ^{(0)}) \right| \right| \right| }^q \right]\le & {} 2^q {\mathbb {E}} \left[ {\left| \left| \left| {\ddot{M}}_0(\theta _0)-{\ddot{M}}(\theta _0)) \right| \right| \right| }^q \right] + 2^q {\mathbb {E}} \left[ {\left| \left| \left| {\ddot{M}}(\theta _0) - {\ddot{M}}(\theta ^{(0)}) \right| \right| \right| }^q \right] \\\le & {} O\left( N^{-\frac{q}{2}}\right) + \frac{4^q {\tilde{C}}^q L^q k^q}{N^q} + O\left( \frac{k^{q+1}}{N^{q+1}}\right) . \end{aligned}$$

In this proof, we let $\theta ' = (1 - \rho ) \theta _0 +\rho \theta ^{(0)}$ for the simplicity of notation. Note that $\theta '- \theta _0 = \rho (\theta ^{(0)} - \theta _0)$, then by event $E_4$, Assumption 6 and inequality (B.12), we have

$$\begin{aligned}&{\mathbb {E}} \left[ {\left| \left| \left| {\ddot{M}}(\theta _0) - {\ddot{M}}(\theta ') \right| \right| \right| }^q \right] {\mathop {\le }\limits ^{\hbox {Jensen's}}} {\mathbb {E}} \left[ \frac{1}{N} \sum _{i=1}^k \sum _{x \in S_i} L(x)^q \Vert \theta ' - \theta _0 \Vert ^q \right] \\ {\mathop {\le }\limits ^{\hbox {H}\ddot{\mathrm{o}}\hbox {lder's}}}&{\mathbb {E}} \left[ \left( \frac{1}{N} \sum _{i=1}^k \sum _{x \in S_i} L(x)^q\right) ^2 \right] ^{\frac{1}{2}} \rho ^q {\mathbb {E}} \left[ \Vert \theta ^{(0)} - \theta _0 \Vert ^{2q} \right] ^{\frac{1}{2}}\\&\qquad \le O(N^{-\frac{q}{2}}) + \frac{2^q {\tilde{C}}^q L^q k^q}{N^q} + O\left( \frac{k^{q+1}}{N^{q+1}}\right) . \end{aligned}$$

So, we have

$$\begin{aligned}&{\mathbb {E}} \left[ {\left| \left| \left| {\ddot{M}}(\theta _0) - \int _{0}^1 {\ddot{M}}((1 - \rho ) \theta _0 +\rho \theta ^{(0)}) d\rho \right| \right| \right| }^q \right] \\&\quad \le {\mathbb {E}}\left[ \int _{0}^1 {\left| \left| \left| {\ddot{M}}(\theta _0) - {\ddot{M}}((1 - \rho ) \theta _0 +\rho \theta ^{(0)}) \right| \right| \right| }^q d\rho \right] \\&\quad = \int _{0}^1 {\mathbb {E}}\left[ {\left| \left| \left| {\ddot{M}}(\theta _0) - {\ddot{M}}((1 - \rho ) \theta _0 +\rho \theta ^{(0)}) \right| \right| \right| }^q \right] d\rho \\&\quad \le O(N^{-\frac{q}{2}}) + \frac{2^q {\tilde{C}}^q L^q k^q}{N^q} + O\left( \frac{k^{q+1}}{N^{q+1}}\right) . \end{aligned}$$

$\square $

Therefore, under event $E_4$, for $1 \le q \le 4$, we can bound ${\ddot{M}}_0(\theta _0)^{-1} [{\ddot{M}}_0(\theta _0)-{\ddot{M}}(\theta ^{(0)})] (\theta ^{(1)} - \theta _0)$ and ${\ddot{M}}_0(\theta _0)^{-1} [{\ddot{M}}(\theta _0) - \int _{0}^1 {\ddot{M}}((1 - \rho ) \theta _0 +\rho \theta ^{(0)}) d\rho ] (\theta ^{(0)} - \theta _0)$ as follows:

$$\begin{aligned}&{\mathbb {E}}[ \Vert {\ddot{M}}_0(\theta _0)^{-1} [{\ddot{M}}_0(\theta _0)-{\ddot{M}}(\theta ^{(0)})] (\theta ^{(1)} - \theta _0) \Vert ^q ] \\&\quad \le \lambda ^{-q} {\mathbb {E}} \left[ {\left| \left| \left| {\ddot{M}}_0(\theta _0)-{\ddot{M}}(\theta ^{(0)}) \right| \right| \right| }^q \right] D^q \le O(N^{-\frac{q}{2}}) + O(k^{q}N^{-q}), \end{aligned}$$

and,

$$\begin{aligned}&{\mathbb {E}} \left[ \Vert {\ddot{M}}_0(\theta _0)^{-1} [{\ddot{M}}(\theta _0) - \int _{0}^1 {\ddot{M}}((1 - \rho ) \theta _0 +\rho \theta ^{(0)}) d\rho ] (\theta ^{(0)} - \theta _0) \Vert ^q \right] \\&\quad \le \lambda ^{-q}{\mathbb {E}} \left[ {\left| \left| \left| {\ddot{M}}(\theta _0) - \int _{0}^1 {\ddot{M}}((1 - \rho ) \theta _0 +\rho \theta ^{(0)}) d\rho \right| \right| \right| }^q \right] D^q \le O(N^{-\frac{q}{2}}) + O(k^{q}N^{-q}). \end{aligned}$$

And by Lemma 17, for $1 \le q \le 8$, we have ${\mathbb {E}} [ \Vert {\dot{M}}(\theta _0) \Vert ^q ] = O(N^{-\frac{q}{2}})$. Therefore, combining above three bounds and equation (D.13), we have, for $1 \le q \le 4$,

$$\begin{aligned} {\mathbb {E}}[ \Vert \theta ^{(1)} - \theta _0 \Vert ^q ]\le & {} 3^q {\mathbb {E}}[ \Vert {\ddot{M}}_0(\theta _0)^{-1} [{\ddot{M}}_0(\theta _0)-{\ddot{M}}(\theta ^{(0)})] (\theta ^{(1)} - \theta _0) \Vert ^q ] \\&+ 3^q {\mathbb {E}}\left[ | {\ddot{M}}_0(\theta _0)^{-1} \left[ {\ddot{M}}(\theta _0) - \int _{0}^1 {\ddot{M}}((1 - \rho ) \theta _0 +\rho \theta ^{(0)}) d\rho \right] (\theta ^{(0)} - \theta _0) \Vert ^q \right] \\&+ 3^q {\mathbb {E}} [ \Vert {\ddot{M}}_0(\theta _0)^{-1} {\dot{M}}(\theta _0) \Vert ^q ] + Pr(E_4^c) D^q\\= & {} O(N^{-\frac{q}{2}}) + O(k^{q}N^{-q}). \end{aligned}$$

Now, we can give tighter bounds for the first two terms in equation (D.13) by Hölder’s inequality.

$$\begin{aligned}&{\mathbb {E}}\left[ \Vert {\ddot{M}}_0(\theta _0)^{-1} [{\ddot{M}}_0(\theta _0)-{\ddot{M}}(\theta ^{(0)})] (\theta ^{(1)} - \theta _0) \Vert ^2 \right] \\&\quad \le \lambda ^{-2} \sqrt{ {\mathbb {E}} \left[ {\left| \left| \left| {\ddot{M}}_0(\theta _0)-{\ddot{M}}(\theta ^{(0)}) \right| \right| \right| }^4 \right] } \sqrt{ {\mathbb {E}}[ \Vert \theta ^{(1)} - \theta _0 \Vert ^4] } = O(N^{-2}) + O(k^{4}N^{-4}), \end{aligned}$$

and,

$$\begin{aligned}&{\mathbb {E}}\left[ \Vert {\ddot{M}}_0(\theta _0)^{-1} [{\ddot{M}}(\theta _0) - \int _{0}^1 {\ddot{M}}((1 - \rho ) \theta _0 +\rho \theta ^{(0)}) d\rho ] (\theta ^{(0)} - \theta _0) \Vert ^2 \right] \\&\quad \le \lambda ^{-2} \sqrt{ {\mathbb {E}} \left[ {\left| \left| \left| {\ddot{M}}(\theta _0)-\int _{0}^1 {\ddot{M}}((1 - \rho ) \theta _0 +\rho \theta ^{(0)}) d\rho \right| \right| \right| }^4 \right] } \sqrt{ {\mathbb {E}}[ \Vert \theta ^{(0)} - \theta _0 \Vert ^4] } \\&\quad = O(N^{-2}) + O(k^{4}N^{-4}). \end{aligned}$$

Now, we can finalize our proof by using Eq. (B.11) again,

$$\begin{aligned} {\mathbb {E}}[ \Vert \theta ^{(1)} - \theta _0 \Vert ^2 ]\le & {} 2 {\mathbb {E}}[ \Vert {\ddot{M}}_0(\theta _0)^{-1} {\dot{M}}(\theta _0) \Vert ^2] + 4 {\mathbb {E}}[ \Vert {\ddot{M}}_0(\theta _0)^{-1} [{\ddot{M}}_0(\theta _0)-{\ddot{M}}(\theta ^{(0)})] (\theta ^{(1)} - \theta _0) \Vert ^2 ] \\&+\, 4 {\mathbb {E}} \left[ \Vert {\ddot{M}}_0(\theta _0)^{-1} \left[ {\ddot{M}}(\theta _0) - \int _{0}^1 {\ddot{M}}((1 - \rho ) \theta _0 +\rho \theta ^{(0)}) d\rho \right] (\theta ^{(0)} - \theta _0) \Vert ^2 \right] \\\le & {} 2 \frac{\hbox {Tr}(\varSigma )}{N} + O(N^{-2}) + O(k^{4}N^{-4}). \end{aligned}$$

Remark 28

If we replace global Hessian ${\ddot{M}}(\theta ^{(0)})^{-1}$ with a local Hessian ${\ddot{M}}_1(\theta ^{(0)})^{-1}$, the first inequality in Lemma 27 has to be changed to ${\mathbb {E}} \left[ {\left| \left| \left| {\ddot{M}}_0(\theta _0)-{\ddot{M}}_1(\theta ^{(0)}) \right| \right| \right| }^q \right] \le O(k^{\frac{q}{2}}N^{-\frac{q}{2}})$, as local Hessian ${\ddot{M}}_1(\cdot )$ is based on only $n=N/k$ samples. As a result, the first two terms in equation (D.13) can be only bounded as

$$\begin{aligned}&{\mathbb {E}}[ \Vert {\ddot{M}}_0(\theta _0)^{-1} [{\ddot{M}}_0(\theta _0)-{\ddot{M}}_1(\theta ^{(0)})] (\theta ^{(1)} - \theta _0) \Vert ^2 ] \\&\quad \le \lambda ^{-2} \sqrt{ {\mathbb {E}} \left[ {\left| \left| \left| {\ddot{M}}_0(\theta _0)-{\ddot{M}}_1(\theta ^{(0)}) \right| \right| \right| }^4 \right] } \sqrt{ {\mathbb {E}}[ \Vert \theta ^{(1)} - \theta _0 \Vert ^4] } \\&\quad = \sqrt{O(k^2N^{-2}}) \sqrt{O(N^{-2}) + O(k^{4}N^{-4})} = O(kN^{-2}) + O(k^{3}N^{-3}). \end{aligned}$$

Therefore, the upper bound of mean squared error becomes ${\mathbb {E}}[ \Vert \theta ^{(1)} - \theta _0 \Vert ^2 ] \le 2 \frac{\hbox {Tr}(\varSigma )}{N} + O(kN^{-2}) + O(k^{3}N^{-3})$.

E Proof of Corollary 15

At first, we will present a lemma on the negative moments of a Binomial random variable, i.e., ${\mathbb {E}} \left[ \frac{1}{Z}1_{(Z>0)} \right] $ and ${\mathbb {E}} \left[ \frac{1}{Z^2} 1_{(Z>0)} \right] $, where $Z \sim {\mathbf {B}}(k,p)$ and ${\mathbf {B}}(k,p)$ denotes Binomial distribution with k independent trials and a success probability p for each trial. We believe that ${\mathbb {E}} \left[ \frac{1}{Z}1_{(Z>0)} \right] $ and ${\mathbb {E}}\left[ \frac{1}{Z^2} 1_{(Z>0)} \right] $ should have been well studied. However, we did not find any appropriate reference on their upper bounds that we need. So, we derive the upper bounds as follows, which will be useful in the proof of Corollary 15.

Lemma 29

Suppose $Z \sim {\mathbf {B}}(k,p)$, when $z>0$, we have

$$\begin{aligned} {\mathbb {E}} \left[ \frac{1}{Z} 1_{(Z>0)} \right]< \frac{1}{kp} + \frac{3}{k^2 p^2} \quad \text { and }\quad {\mathbb {E}} \left[ \frac{1}{Z^2} 1_{(Z>0)} \right] < \frac{6}{k^2 p^2}. \end{aligned}$$

Proof

By definition, we have

$$\begin{aligned}&{\mathbb {E}} \left[ \frac{1}{Z} 1_{(Z>0)} \right] = \sum _{z=1}^k \frac{z+1}{z} \frac{1}{(k+1)p}\frac{(k+1)!}{(z+1)!(k-z)!} p^{z+1} (1-p)^{k-z} \\&\quad = \sum _{z=1}^k \frac{1}{(k+1)p} \left( {\begin{array}{c}k+1\\ z+1\end{array}}\right) p^{z+1} (1-p)^{k-z} + \sum _{z=1}^k \frac{1}{z} \frac{1}{(k+1)p} \left( {\begin{array}{c}k+1\\ z+1\end{array}}\right) p^{z+1} (1-p)^{k-z} \\&\quad< \frac{1}{(k+1)p} + \sum _{z=1}^k \frac{z+2}{z} \frac{1}{(k+1)(k+2)p^2} \left( {\begin{array}{c}k+2\\ z+2\end{array}}\right) p^{z+2} (1-p)^{k-z} \\&\quad< \frac{1}{(k+1)p} + \frac{3}{(k+1)(k+2)p^2} < \frac{1}{kp} + \frac{3}{k^2 p^2}. \end{aligned}$$

Similarly, we have

$$\begin{aligned}&{\mathbb {E}} \left[ \frac{1}{Z^2} 1_{(Z>0)} \right] = \sum _{z=1}^k \frac{1}{z^2} \left( {\begin{array}{c}k\\ z\end{array}}\right) p^z (1-p)^{k-z} = \sum _{z=1}^k \frac{1}{z^2} \frac{k!}{z!(k-z)!} p^z (1-p)^{k-z} \\&\quad = \sum _{z=1}^k \frac{(z+1)(z+2)}{z^2} \frac{1}{(k+1)(k+2)p^2} \frac{(k+2)!}{(z+2)!(k-z)!} p^{z+2} (1-p)^{k-z} \\&\quad \le \sum _{z=1}^k \frac{6}{(k+1)(k+2)p^2} \left( {\begin{array}{c}k+2\\ z+2\end{array}}\right) p^{z+2} (1-p)^{k-z}< \frac{6}{(k+1)(k+2)p^2} < \frac{6}{k^2 p^2}. \end{aligned}$$

$\square $

Now, we can prove Corollary 15 could be as follows.

Proof

Let the random variable Z denote the number of machines that successfully communicate with the central machine, which means that Z follows Binomial distribution, $B(k,1-r)$. By Law of Large Numbers, $\frac{Z}{(1-r)k} \xrightarrow {P} 1\text { as } k \rightarrow \infty $. If Z is known, the size of available data becomes Zn. By Theorem 7, the one-step estimator $\theta ^{(1)}$ is still asymptotic normal when $k=O(\sqrt{N})$, i.e., $\sqrt{Zn}(\theta ^{(1)} - \theta _0) \xrightarrow {d} N(0,\varSigma ) \text { as } n \rightarrow \infty $. Therefore, when $k \rightarrow \infty $, we have

$$\begin{aligned} \sqrt{(1-r)N}(\theta ^{(1)} - \theta _0) = \sqrt{\frac{(1-r)N}{Zn}} \sqrt{Zn}(\theta ^{(1)} - \theta _0) \xrightarrow {d} \sqrt{\frac{(1-r)N}{Zn}} N(0,\varSigma ). \end{aligned}$$

Since $\frac{(1-r)N}{Zn}=\frac{(1-r)k}{Z} \xrightarrow {P} 1$, by Slutsky’s Lemma, we have $\sqrt{(1-r)N}(\theta ^{(0)} - \theta _0) \xrightarrow {d} N(0,\varSigma )$. This result indicates that when the local machines could lose communication independently with central machine with probability q, the one-step estimator $\theta ^{(1)}$ shares the same asymptotic properties with the oracle M-estimator using $(1-r) \times 100\%$ of the total samples.

Next, we will analyze the mean squared error of one-step estimator with the presence of local machine failures. Note that, when Z is fixed and known, by Theorem 10, we have

$$\begin{aligned} {\mathbb {E}}[ \Vert \theta ^{(1)} - \theta _0 \Vert ^2 \big | Z] \le \frac{2\hbox {Tr}[\varSigma ]}{nZ} + O(n^{-2}Z^{-2}) + O(n^{-4}). \end{aligned}$$

By Rule of Double Expectation and Lemma 29,

$$\begin{aligned}&{\mathbb {E}}[ \Vert \theta ^{(1)} - \theta _0 \Vert ^2 1_{(Z>0)}] = {\mathbb {E}}[ {\mathbb {E}}[ \Vert \theta ^{(1)} - \theta _0 \Vert ^2 \big | Z] 1_{(Z>0)} ] \\&\quad \le {\mathbb {E}} \left[ \frac{2\hbox {Tr}[\varSigma ]}{nZ} 1_{(Z>0)} \right] +{\mathbb {E}}[ (O(n^{-2}Z^{-2}) + O(n^{-4}) ) 1_{(Z>0)} ] \\&\quad \le 2\hbox {Tr}[\varSigma ] \left\{ \frac{1}{nk(1-r)} + \frac{3}{nk^2 (1-r)^2} \right\} + O(n^{-2} k^{-2}(1-r)^{-2}) + O(n^{-4})\\&\quad = \frac{2\hbox {Tr}[\varSigma ]}{N(1-r)} + \frac{6\hbox {Tr}[\varSigma ]}{N k (1-r)^2} + O(N^{-2}(1-r)^{-2}) + O(k^2 N^{-2}). \end{aligned}$$

$\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, C., Huo, X. A distributed one-step estimator. Math. Program. 174, 41–76 (2019). https://doi.org/10.1007/s10107-019-01369-0

Download citation

Received: 08 March 2017
Accepted: 02 February 2019
Published: 20 February 2019
Issue Date: 01 March 2019
DOI: https://doi.org/10.1007/s10107-019-01369-0

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A distributed one-step estimator

Abstract

Access this article

Similar content being viewed by others

Introduction to Machine Learning

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix overview

A Bounds on gradient and Hessian

Lemma 16

Lemma 17

Proof

Lemma 18

Proof

B Error bound of local M-estimator and simple averaging estimator

1.1 B.1 Bound the error of local M-estimators \(\theta _i, i=1,\ldots ,k\)

Lemma 19

Lemma 20

Proof

Lemma 21

Proof

Lemma 22

Proof

1.2 B.2 Bound the error of simple averaging estimator \(\theta ^{(0)}\)

Lemma 23

Proof

Lemma 24

Proof

C Proof of Theorem 7

Lemma 25

Proof

Proof

Remark 26

D Proof of Theorem 10

Lemma 27

Proof

Remark 28

E Proof of Corollary 15

Lemma 29

Proof

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation