Abstract
Distributed statistical inference has recently attracted enormous attention. Many existing work focuses on the averaging estimator, e.g., Zhang and Duchi (J Mach Learn Res 14:3321–3363, 2013) together with many others. We propose a one-step approach to enhance a simple-averaging based distributed estimator by utilizing a single Newton–Raphson updating. We derive the corresponding asymptotic properties of the newly proposed estimator. We find that the proposed one-step estimator enjoys the same asymptotic properties as the idealized centralized estimator. In particular, the asymptotic normality was established for the proposed estimator, while other competitors may not enjoy the same property. The proposed one-step approach merely requires one additional round of communication in relative to the averaging estimator; so the extra communication burden is insignificant. The proposed one-step approach leads to a lower upper bound of the mean squared error than other alternatives. In finite sample cases, numerical examples show that the proposed estimator outperforms the simple averaging estimator with a large margin in terms of the sample mean squared error. A potential application of the one-step approach is that one can use multiple machines to speed up large scale statistical inference with little compromise in the quality of estimators. The proposed method becomes more valuable when data can only be available at distributed machines with limited communication bandwidth.
Similar content being viewed by others
References
Battey, H., Fan, J., Liu, H., Lu, J., Zhu, Z.: Distributed estimation and inference with statistical guarantees (2015). arXiv preprint arXiv:1509.05457
Bickel, P.J.: One-step Huber estimates in the linear model. J. Am. Stat. Assoc. 70(350), 428–434 (1975)
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends® Mach Learn. 3(1), 1–122 (2011)
Chen, S., Donoho, D.L., Saunders, M.A.: Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20(1), 33–61 (1998)
Chen, X., Xie, Mg: A split-and-conquer approach for analysis of extraordinarily large data. Stat. Sin. 24, 1655–1684 (2014)
Corbett, J.C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J., Ghemawat, S., Gubarev, A., Heiser, C., Hochschild, P. et al.: Spanner: Googles globally distributed database. In: Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (2012)
Fan, J., Chen, J.: One-step local quasi-likelihood estimation. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 61(4), 927–943 (1999)
Jaggi, M., Smith, V., Takác, M., Terhorst, J., Krishnan, S., Hofmann, T., Jordan, M.I.: Communication-efficient distributed dual coordinate ascent. In: Advances in Neural Information Processing Systems, pp. 3068–3076 (2014)
Jordan, M.I., Lee, J.D., Yang, Y.: Communication-efficient distributed statistical inference. J. Am. Stat. Assoc. (just-accepted) (2018)
Kushner, H., Yin, G.G.: Stochastic Approximation and Recursive Algorithms and Applications, vol. 35. Springer, Berlin (2003)
Lang, S.: Real and Functional Analysis, vol. 142. Springer, Berlin (1993)
Lee, J.D., Sun, Y., Liu, Q., Taylor, J.E.: Communication-efficient sparse regression: a one-shot approach (2015). arXiv preprint arXiv:1503.04337
Liu, Q., Ihler, A.T.: Distributed estimation, information loss and exponential families. In: Advances in Neural Information Processing Systems, pp. 1098–1106 (2014)
Mitra, S., Agrawal, M., Yadav, A., Carlsson, N., Eager, D., Mahanti, A.: Characterizing web-based video sharing workloads. ACM Trans. Web 5(2), 8 (2011)
Rosenblatt, J., Nadler, B.: On the optimality of averaging in distributed statistical learning (2014). arXiv preprint arXiv:1407.2724
Rosenthal, H.P.: On the subspaces of \({L}^p\) (\(p>2\)) spanned by sequences of independent random variables. Isr. J. Math. 8(3), 273–303 (1970)
Shamir, O., Srebro, N., Zhang, T.: Communication-efficient distributed optimization using an approximate Newton-type method. In: Proceedings of the 31st International Conference on Machine Learning, pp. 1000–1008 (2014)
Shapiro, A., Dentcheva, D., Ruszczyński, A.: Lectures on stochastic programming: modeling and theory. SIAM (2009)
Tibshirani, R.: Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B 58(1), 267–288 (1996)
van der Vaart, A.W.: Asymptotic Statistics (Cambridge Series in Statistical and Probabilistic Mathematics). Cambridge University Press, Cambridge (2000)
Zhang, Y., Duchi, J.C., Wainwright, M.J.: Communication-efficient algorithms for statistical optimization. J. Mach. Learn. Res. 14, 3321–3363 (2013)
Zinkevich, M., Weimer, M., Li, L., Smola, A.J.: Parallelized stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp. 2595–2603 (2010)
Zou, H., Li, R.: One-step sparse estimates in nonconcave penalized likelihood models. Ann. Stat. 36(4), 1509–1533 (2008)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix overview
The appendix is organized as follows. In Section A, we analyze the upper bounds of sum of i.i.d. random vectors and random matrices, which will be useful in later proofs. In Sect. 1, we derive the upper bounds of the local M-estimators and the simple averaging estimator. We present the proofs to Theorems 7 and 10 in Section C and Sect. 1, respectively. A proof of Corollary 15 will be in Section E.
A Bounds on gradient and Hessian
In order to establish the convergence of gradients and Hessians of the empirical criterion function to those of population criterion function, which is essential for the later proofs, we will present some results on the upper bound of sums of i.i.d. random vectors and random matrices. We start with stating a useful inequality on the sum of independent random variables from Rosenthal [16].
Lemma 16
(Rosenthal’s Inequality, [16], Theorem 3) For \(q > 2\), there exists constant C(q) depending only on q such that if \(X_1,\ldots ,X_n\) are independent random variables with \({\mathbb {E}}[X_j] = 0\) and \({\mathbb {E}}[|X_j|^q] < \infty \) for all j, then
Equipped with the above lemma, we can bound the moments of mean of random vectors.
Lemma 17
Let \(X_1, \ldots , X_n \in {\mathbb {R}}^d\) be i.i.d. random vectors with \({\mathbb {E}}[X_i] = {\mathbf {0}}\). And there exists some constants \(G>0\) and \(q_0 \ge 2\) such that \({\mathbb {E}} [\Vert X_i \Vert ^{q_0}] < G^{q_0}\). Let \(\overline{X }= \frac{1}{n}\sum _{i=1}^n X_i\), then for \(1 \le q \le q_0\), we have
where C(q, d) is a constant depending solely on q and d.
Proof
The main idea of this proof is to transform the sum of random vectors into the sum of random variables and then apply Lemma 16. Let \(X_{i,j}\) denote the j-th component of \(X_i\) and \(\overline{X }_j\) denote the j-th component of \(\overline{X }\).
-
(1)
Let us start with a simpler case in which \(q=2\).
$$\begin{aligned} {\mathbb {E}} [\Vert \overline{X }\Vert ^2] = \sum _{j=1}^d \sum _{i=1}^n {\mathbb {E}}[| X_{i,j}/n |^2] = \sum _{j=1}^d {\mathbb {E}}[|X_{1,j}|^2]/n = n^{-1} {\mathbb {E}}[\Vert X_1\Vert ^2] \le n^{-1}G^2. \end{aligned}$$The last inequality holds because \({\mathbb {E}}[\Vert X_{1}\Vert ^q] \le ({\mathbb {E}}[\Vert X_{1}\Vert ^{q_0}])^{q/q_0} \le G^q\) for \(1 \le q \le q_0\) by Hölder’s inequality.
-
(2)
When \(1 \le q < 2\), we have
$$\begin{aligned} {\mathbb {E}} [\Vert \overline{X }\Vert ^q] \le ({\mathbb {E}} [\Vert \overline{X }\Vert ^2])^{q/2} \le n^{-q/2}G^q. \end{aligned}$$ -
(3)
For \(2 < q \le q_0\), with some simple algebra, we have
$$\begin{aligned} {\mathbb {E}} [\Vert \overline{X }\Vert ^q]&= {\mathbb {E}} \left[ \left( \sum _{j=1}^d |\overline{X }_j|^2\right) ^{q/2} \right] \le {\mathbb {E}} \left[ \left( d \max _{1 \le j \le d} |\overline{X }_j|^2\right) ^{q/2} \right] \\&= d^{q/2} {\mathbb {E}} \left[ \max _{1 \le j \le d} |\overline{X }_j|^q \right] \le d^{q/2} {\mathbb {E}} \left[ \sum _{j=1}^d |\overline{X }_j|^q \right] = d^{q/2} \sum _{j=1}^d {\mathbb {E}} \left[ |\overline{X }_j|^q \right] . \end{aligned}$$As a continuation, we have
$$\begin{aligned}&{\mathbb {E}} [\Vert \overline{X }\Vert ^q] \le d^{q/2} \sum _{j=1}^d {\mathbb {E}}[|\overline{X }_j|^q] \\&\quad \le d^{q/2} \sum _{j=1}^d [C(q)]^q \max \left\{ \sum _{i=1}^n {\mathbb {E}}[|X_{i,j}/n|^q] , \left( \sum _{i=1}^n {\mathbb {E}}[ |X_{i,j}/n|^2 ]\right) ^{q/2} \right\} \qquad \text {(Lemma }16)\\&\quad = d^{q/2} [C(q)]^q \sum _{j=1}^d \max \left\{ \frac{{\mathbb {E}}[|X_{1,j}|^q]}{n^{q-1}} , \frac{({\mathbb {E}}[|X_{1,j}|^2])^{q/2}}{n^{q/2}} \right\} \\&\quad \le d^{q/2+1} [C(q)]^q \max \left\{ \frac{{\mathbb {E}}[\Vert X_{1}\Vert ^q]}{n^{q-1}} , \frac{({\mathbb {E}}[\Vert X_{1}\Vert ^2])^{q/2}}{n^{q/2}} \right\} \qquad \text {(since }{\mathbb {E}}[|X_{1,j}|^q] \le {\mathbb {E}}[\Vert X_{1}\Vert ^q]\text {)}\\&\quad \le d^{q/2+1} [C(q)]^q \max \left\{ \frac{G^q}{n^{q-1}} , \frac{G^q}{n^{q/2}} \right\} \qquad \text {(H}\ddot{\mathrm{o}}\text {lder's inequality)}\\&\quad = \frac{d^{q/2+1} [C(q)]^q}{n^{q/2}}G^q. \qquad (q-1>q/2 \text { when }q>2) \end{aligned}$$To complete this proof, we just need to set \(C_v(q,d)=d^{q/2+1} [C(q)]^q\). \(\square \)
To bound the moment of the mean of i.i.d. random matrices, let us consider another matrix norm – Frobenius norm \({\left| \left| \left| \cdot \right| \right| \right| }_F\), i.e., \({\left| \left| \left| A \right| \right| \right| }_F = \sqrt{\sum _{i,j} | a_{ij} |^2}, \forall A \in {\mathbb {R}}^{d \times d}\). Let us recall that we also use matrix norm \({\left| \left| \left| \cdot \right| \right| \right| }\), which is defined as its maximal singular value, i.e., we have \({\left| \left| \left| A \right| \right| \right| } = \sup _{u: u \in R^d, \Vert u\Vert \le 1} \Vert Au \Vert \). We can easily show that
With Frobenius norm, we can regard a random matrix \(X \in {\mathbb {R}}^{d \times d}\) as a random vector in \({\mathbb {R}}^{d^2}\) and apply Lemma 17 to obtain the following lemma.
Lemma 18
Let \(X_1,\ldots ,X_n \in {\mathbb {R}}^{d \times d}\) be i.i.d. random matrices with \({\mathbb {E}}[X_i] = {\mathbf {0}}_{d \times d}\). Let \({\left| \left| \left| X_i \right| \right| \right| }\) denote the norm of \(X_i\), which is defined as its maximal singular value. Suppose \({\mathbb {E}} [{\left| \left| \left| X_i \right| \right| \right| }^{q_0}] \le H^{q_0}\), where \(q_0 \ge 2\) and \(H>0\). Then for \(\overline{X }= \frac{1}{n}\sum _{i=1}^n X_i\) and \(1 \le q \le q_0\), we have
where \(C_m(q,d)\) is a constant depending on q and d only.
Proof
By the fact \({\left| \left| \left| A \right| \right| \right| }_F \le \sqrt{d} {\left| \left| \left| A \right| \right| \right| }\), we have \({\mathbb {E}} \left[ {\left| \left| \left| X_i \right| \right| \right| }_F^{q_0} \right] \le {\mathbb {E}} \left[ {\left| \left| \left| \sqrt{d} X_i \right| \right| \right| }^{q_0} \right] \le (\sqrt{d}H)^{q_0}\). Then by the fact \( {\left| \left| \left| A \right| \right| \right| } \le {\left| \left| \left| A \right| \right| \right| }_F\) and Lemma 17, we have
In the second inequality, we treat \(\overline{X }\) as a \(d^2\)-dimensional random vector and then apply Lemma 17. Then the proof can be completed by setting \(C_m(q,d) = C_v(q,d^2)d^\frac{q}{2}\). \(\square \)
B Error bound of local M-estimator and simple averaging estimator
Since the simple averaging estimator is the average of all local estimators and the one-step estimator is just a single Newton–Raphson update from the simple averaging estimator. Thus, it is natural to study the upper bound of the mean squared error (MSE) of a local M-estimator and the upper bound of the MSE of the simple averaging estimator. The main idea in the following proof is similar to the thread in the proof of Theorem 1 in Zhang and Duchi [21], but the conclusions are different. Besides, in the following proof, we use a correct analogy of mean value theorem for vector-valued functions.
1.1 B.1 Bound the error of local M-estimators \(\theta _i, i=1,\ldots ,k\)
In this subsection, we would like to analyze the mean squared error of a local estimator \(\theta _i = \mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{\theta \in \varTheta } M_i(\theta )\), \(i = 1, \ldots , k\) and prove the following lemma in the rest of this subsection.
Lemma 19
Let \(\varSigma = {\ddot{M}}_0(\theta _0)^{-1} {\mathbb {E}}[{\dot{m}}(X;\theta _0){\dot{m}}(X;\theta _0)^t] {\ddot{M}}_0(\theta _0)^{-1}\), where the expectation is taken with respect to X. Under Assumption 3, 4, 5 and 6, for each \(i=1,\ldots ,k\), we have
Since \({\dot{M}}_i(\theta _i) = 0\), by Theorem 4.2 in Chapter XIII of Lang [11], we have
If last two terms in above equation are reasonably small, this lemma follows immediately. So, our strategy is as follows. First, we show that the mean squared error of both \([\int _{0}^1 {\ddot{M}}_i((1 - \rho ) \theta _0 +\rho \theta _i) d\rho \, - {\ddot{M}}_i(\theta _0)] [\theta _i - \theta _0]\) and \([{\ddot{M}}_i(\theta _0) - {\ddot{M}}_0(\theta _0)] [\theta _i - \theta _0]\) is small under some “good” events. Then we will show the probability of “bad” events is small enough. And Lemma 19 will follow by the fact that \(\varTheta \) is compact.
Suppose \(S_i = \{x_1,\ldots ,x_n\}\) is the data set on local machine i. Let us define some good events:
where \(\delta ' = \min (\delta ,\frac{\lambda }{8L})\), \(\lambda \) is the constant in Assumption 5 and L and \(\delta \) are the constants in Assumption 6. We will show that event \(E_1\) and \(E_2\) ensure that \(M_i(\theta )\) is strictly concave at a neighborhood of \(\theta _0\). And we will also show that in event \(E_3\), \(\theta _i\) is fairly close to \(\theta _0\). Let \(E = E_1 \cap E_2 \cap E_3\), then we have the following lemma:
Lemma 20
Under event E, we have \(\Vert \theta _i - \theta _0 \Vert \le \frac{4}{\lambda } \Vert {\dot{M}}_i(\theta _0) \Vert \).
Proof
First, we will show \({\ddot{M}}_i(\theta )\) is a negative definite matrix over a ball centered at \(\theta _0\): \(B_{\delta '} = \{ \theta \in \varTheta : \Vert \theta - \theta _0 \Vert \le \delta ' \} \subset B_{\delta }\). For any fixed \(\theta \in B_{\delta '}\), we have
where we apply event \(E_1\), Assumption 6 and the fact that \(\delta '=\min (\delta ,\frac{\lambda }{8L})\) on the first term and event \(E_2\) on the second term. Since \({\ddot{M}}_0(\theta _0)\) is negative definite by Assumption 5, above inequality implies that \({\ddot{M}}_i(\theta )\) is negative definite for all \(\theta \in B_{\delta '}\) and
With negative definiteness of \({\ddot{M}}_i(\theta ), \theta \in B_{\delta '}\), event \(E_3\) and concavity of \(M_i(\theta ), \theta \in \varTheta \), we have
Thus, we know \(\Vert \theta _i - \theta _0 \Vert \le \delta '\), or equivalently, \(\theta _i \in B_{\delta '}\). Then by applying Taylor’s Theorem on \(M_i(\theta )\) at \(\theta _0\), we have
As \(\theta _i\) is the maximizer of \(M_i(\cdot )\), we know \(M_i(\theta _0) \le M_i(\theta _i)\), thus,
which implies \(\Vert \theta _i - \theta _0 \Vert \le \frac{4}{\lambda } \Vert {\dot{M}}_i(\theta _0) \Vert .\)\(\square \)
For \(1 \le q \le 8\), we can bound \({\mathbb {E}}[ \Vert {\dot{M}}_i(\theta _0) \Vert ^q ]\) by Lemma 17 and Assumption 6,
where \(C_v(q,d)\) is a constant depending on q and d only. Then by conditioning on event E, we have
If we can show \(\hbox {Pr}(E^c) = O(n^{-\frac{q}{2}})\), then \({\mathbb {E}}[ \Vert \theta _i - \theta _0 \Vert ^q] = O(n^{-\frac{q}{2}})\) follows immediately.
Lemma 21
Under Assumption 6, we have \(\hbox {Pr}(E^c) = O(n^{-4}).\)
Proof
Under Assumption 6, by applying Lemmas 17 and 18, we can bound the moments of \({\dot{M}}_i(\theta _0)\) and \({\ddot{M}}_i(\theta _0) - {\ddot{M}}_0(\theta _0)\). Rigorously, for \(1 \le q \le 8\), we have
Therefore, by Markov’s inequality, we have
\(\square \)
Now, we have showed that for \(1 \le q \le 8\),
Until now, \([{\ddot{M}}_i(\theta _0) - {\ddot{M}}_0(\theta _0)] [\theta _i - \theta _0]\) has been well bounded. Next, we will consider the moment bound of \(\int _{0}^1 {\ddot{M}}_i((1 - \rho ) \theta _0 +\rho \theta _i) d\rho - {\ddot{M}}_i(\theta _0)\).
Lemma 22
Under Assumption 6, for \(1 \le q \le 4\),
Proof
By Minkowski’s integral inequality, we have
For simplicity of notation, we use \(\theta ' = (1 - \rho ) \theta _0 +\rho \theta _i\) in this proof. When event E holds, we have \(\Vert \theta ' - \theta _0 \Vert = \Vert \rho (\theta _i - \theta _0) \Vert \le \rho \delta ' \le \delta \), which means that \(\theta ' \in B_{\delta }, \forall \rho \in [0,1]\). Thus, because of the convexity of the matrix norm \({\left| \left| \left| \cdot \right| \right| \right| }\), we can apply Jensen’s inequality and Assumption 6 and get
Then apply Hölder’s inequality,
When event E does not hold, we know that \({\left| \left| \left| {\ddot{M}}_i(\theta ') - {\ddot{M}}_i(\theta _0) \right| \right| \right| }^q\) must be finite by the assumption that \(\varTheta \) is compact and \({\ddot{M}}_i(\theta )\) is continuous. By Lemma 21, the probability that event E does not hold is bounded by \(O(n^{-4})\), which implies,
Therefore, we have
\(\square \)
Now, recall that we have
For the sum of last two terms, we have
Until now, we have established the upper bound for the mean squared error of local M-estimators, \({\mathbb {E}} [\Vert \theta _i - \theta _0 \Vert ^2] \le \frac{2}{n} \hbox {Tr}(\varSigma ) + O(n^{-2}), \hbox { for } i = 1,\ldots ,k\).
1.2 B.2 Bound the error of simple averaging estimator \(\theta ^{(0)}\)
Next, we will study the mean squared error of simple averaging estimator, \(\theta ^{(0)} = \frac{1}{k} \sum _{i=1}^k \theta _i.\) We start with a lemma, which bounds the bias of local M-estimator \(\theta _i, i=1,\ldots ,k\).
Lemma 23
There exists some constant \({\tilde{C}}>0\) such that for \(i=1,\ldots ,k\), we have
where \({\tilde{C}} = 16 [C_v(4,d)]^{\frac{1}{4}}\sqrt{C_v(2,d)} \lambda ^{-3} G^2L + 4 \sqrt{C_m(2,d)}\sqrt{C_v(2,d)} \lambda ^{-2} GH\).
Proof
The main idea of this proof is to use Eq. B.11) and apply the established error bounds of Hessian and the aforementioned local m-estimators. By Eq. (B.11) and fact \({\mathbb {E}}[M_i(\theta _0)] = 0\), we have
Then we can apply Lemmas18 and 22, and (B.10) to bound each term, thus, we have
Let \({\tilde{C}} = 16 [C_v(4,d)]^{\frac{1}{4}}\sqrt{C_v(2,d)} \lambda ^{-3} G^2L + 4 \sqrt{C_m(2,d)}\sqrt{C_v(2,d)} \lambda ^{-2} GH\), then we have \(\Vert {\mathbb {E}}[\theta _i - \theta _0] \Vert \le \frac{{\tilde{C}}}{n} + O(n^{-2})\). \(\square \)
Then, we can show that the MSE of \(\theta ^{(0)}\) could be bounded as follows.
Lemma 24
There exists some constant \({\tilde{C}}>0\) such that
where \({\tilde{C}} = 16 [C_v(4,d)]^{\frac{1}{4}}\sqrt{C_v(2,d)} \lambda ^{-3} G^2L + 4 \sqrt{C_m(2,d)}\sqrt{C_v(2,d)} \lambda ^{-2} GH\).
Proof
The mean squared error of \(\theta ^{(0)}\) could be decomposed into two parts: covariance and bias. Thus,
where the first term is well bounded by Lemma 19 and the second term could be bounded by Lemma 23. Thus, we know \({\mathbb {E}} [ \Vert \theta ^{(0)} - \theta _0 \Vert ^2] \le \frac{2}{N} \hbox {Tr}(\varSigma ) + \frac{{\tilde{C}}^2 k^2}{N^2} + O(kN^{-2}) + O(k^3N^{-3})\). More generally, for \(1 \le q \le 8\), we have
In summary, we have
\(\square \)
C Proof of Theorem 7
The whole proof could be completed in two steps: first, show simple averaging estimator \(\theta ^{(0)}\) is \(\sqrt{N}\)-consistent when \(k=O(\sqrt{N})\); then show the consistency and asymptotic normality of the one-step estimator \(\theta ^{(1)}\). In the first step, we need to show the following.
Lemma 25
Under Assumptions 3, 4, 5 and 6, when \(k=O(\sqrt{N})\), the simple averaging estimator \(\theta ^{(0)}\) is \(\sqrt{N}\)-consistent estimator of \(\theta _0\), i.e., \(\sqrt{N} \Vert \theta ^{(0)} - \theta _0 \Vert = O_P(1) \text { as } N \rightarrow \infty .\)
Proof
If k is finite and does not grow with N, the proof is trivial. So, we just need to consider the case that \(k \rightarrow \infty \). We know that \(\Vert {\mathbb {E}}[ \sqrt{n}(\theta _i - \theta _0) ] \Vert \le O(\frac{1}{\sqrt{n}})\) by Lemma 23 and \({\mathbb {E}}[ \Vert \sqrt{n}(\theta _i - \theta _0) \Vert ^2] \le 2 \hbox {Tr}(\varSigma ) + O(n^{-1})\) by Lemma 19. By applying Lindeberg-Lévy Central Limit Theorem, we have
It suffices to show \(\lim _{N \rightarrow \infty } \sqrt{nk}{\mathbb {E}}[ \theta _1 - \theta _0 ]\) is finite. By Lemma 23, we have \(\Vert {\mathbb {E}}[\theta _i - \theta _0] \Vert = O(\frac{1}{n})\), which means that \(\Vert \sqrt{nk}{\mathbb {E}}[ \theta _i - \theta _0 ] \Vert = O(1)\) if \(k=O(\sqrt{N})=O(n)\). Thus, when \(k=O(\sqrt{N})\), \(\sqrt{N} (\theta ^{(0)} - \theta _0)\) is bounded in probability. \(\square \)
Now, we can prove Theorem 7.
Proof
By the definition of the one-step estimator \(\theta ^{(1)} = \theta ^{(0)} - {\ddot{M}}(\theta ^{(0)})^{-1} {\dot{M}}(\theta ^{(0)})\), and by Theorem 4.2 in Chapter XIII of Lang [11], we have
As it is shown in (B.12), for any \(\rho \in [0,1]\), when \(k = O(\sqrt{N})\), we have \(\Vert (1 - \rho ) \theta _0 +\rho \theta ^{(0)} - \theta _0 \Vert \le \rho \Vert \theta ^{(0)} - \theta _0 \Vert \xrightarrow {P} 0\). Since \({\ddot{M}}(\cdot )\) is a continuous function, \({\left| \left| \left| {\ddot{M}}(\theta ^{(0)}) - \int _{0}^1 {\ddot{M}}((1 - \rho ) \theta _0 +\rho \theta ^{(0)}) d\rho \right| \right| \right| }\xrightarrow {P} 0\). Thus, we have \(\sqrt{N} {\ddot{M}}(\theta ^{(0)}) (\theta ^{(1)} - \theta _0) = - \sqrt{N} {\dot{M}}(\theta _0) + o_P(1)\). And, \({\ddot{M}}(\theta ^{(0)}) \xrightarrow {P} {\ddot{M}}_0(\theta _0)\) because of \(\theta ^{(0)} \xrightarrow {P} \theta _0\) and Law of Large Numbers. By applying Slutsky’s Lemma, we can obtain \(\sqrt{N} (\theta ^{(1)} - \theta _0) \xrightarrow {d} {\mathbf {N}}(0,\varSigma ) \hbox { as } N \rightarrow \infty \). \(\square \)
Remark 26
If we simply replace global Hessian \({\ddot{M}}(\theta ^{(0)})^{-1}\) with a local Hessian \({\ddot{M}}_1(\theta ^{(0)})^{-1}\), it is easy to verify that every step in above proof still holds. Thus, \(\theta ^{(2)} = \theta ^{(0)} - {\ddot{M}}_1(\theta ^{(0)})^{-1} {\dot{M}}(\theta ^{(0)})\) enjoys the same asymptotic properties with \(\theta ^{(1)}\).
D Proof of Theorem 10
Let us recall the formula for one-step estimator, \(\theta ^{(1)} = \theta ^{(0)} - {\ddot{M}}(\theta ^{(0)})^{-1} {\dot{M}}(\theta ^{(0)})\). Then by Theorem 4.2 in Chapter XIII of Lang [11], we have
Then we have
We will show the last two terms are small enough. Similar to the proof of Lemma 19, we define a “good” event: \(E_4 = \{ \Vert \theta ^{(0)} - \theta _0 \Vert \le \delta \}.\) The probability of above event is close to 1 when N is large.
Lemma 27
If event \(E_4\) holds, for \(1 \le q \le 4\), we have
Proof
By Lemma 18, we know \({\mathbb {E}} \left[ {\left| \left| \left| {\ddot{M}}_0(\theta _0)- {\ddot{M}}(\theta _0) \right| \right| \right| }^q \right] \le \frac{C(q,d)}{N^{q/2}} H^q\). Under event \(E_4\) and Assumption 6, by applying Jensen’s inequality, we have
Thus, for \(1 \le q \le 4\), we have
As a result, we have, for \(1 \le q \le 4\),
In this proof, we let \(\theta ' = (1 - \rho ) \theta _0 +\rho \theta ^{(0)}\) for the simplicity of notation. Note that \(\theta '- \theta _0 = \rho (\theta ^{(0)} - \theta _0)\), then by event \(E_4\), Assumption 6 and inequality (B.12), we have
So, we have
\(\square \)
Therefore, under event \(E_4\), for \(1 \le q \le 4\), we can bound \({\ddot{M}}_0(\theta _0)^{-1} [{\ddot{M}}_0(\theta _0)-{\ddot{M}}(\theta ^{(0)})] (\theta ^{(1)} - \theta _0)\) and \({\ddot{M}}_0(\theta _0)^{-1} [{\ddot{M}}(\theta _0) - \int _{0}^1 {\ddot{M}}((1 - \rho ) \theta _0 +\rho \theta ^{(0)}) d\rho ] (\theta ^{(0)} - \theta _0)\) as follows:
and,
And by Lemma 17, for \(1 \le q \le 8\), we have \({\mathbb {E}} [ \Vert {\dot{M}}(\theta _0) \Vert ^q ] = O(N^{-\frac{q}{2}})\). Therefore, combining above three bounds and equation (D.13), we have, for \(1 \le q \le 4\),
Now, we can give tighter bounds for the first two terms in equation (D.13) by Hölder’s inequality.
and,
Now, we can finalize our proof by using Eq. (B.11) again,
Remark 28
If we replace global Hessian \({\ddot{M}}(\theta ^{(0)})^{-1}\) with a local Hessian \({\ddot{M}}_1(\theta ^{(0)})^{-1}\), the first inequality in Lemma 27 has to be changed to \({\mathbb {E}} \left[ {\left| \left| \left| {\ddot{M}}_0(\theta _0)-{\ddot{M}}_1(\theta ^{(0)}) \right| \right| \right| }^q \right] \le O(k^{\frac{q}{2}}N^{-\frac{q}{2}})\), as local Hessian \({\ddot{M}}_1(\cdot )\) is based on only \(n=N/k\) samples. As a result, the first two terms in equation (D.13) can be only bounded as
Therefore, the upper bound of mean squared error becomes \({\mathbb {E}}[ \Vert \theta ^{(1)} - \theta _0 \Vert ^2 ] \le 2 \frac{\hbox {Tr}(\varSigma )}{N} + O(kN^{-2}) + O(k^{3}N^{-3})\).
E Proof of Corollary 15
At first, we will present a lemma on the negative moments of a Binomial random variable, i.e., \({\mathbb {E}} \left[ \frac{1}{Z}1_{(Z>0)} \right] \) and \({\mathbb {E}} \left[ \frac{1}{Z^2} 1_{(Z>0)} \right] \), where \(Z \sim {\mathbf {B}}(k,p)\) and \({\mathbf {B}}(k,p)\) denotes Binomial distribution with k independent trials and a success probability p for each trial. We believe that \({\mathbb {E}} \left[ \frac{1}{Z}1_{(Z>0)} \right] \) and \({\mathbb {E}}\left[ \frac{1}{Z^2} 1_{(Z>0)} \right] \) should have been well studied. However, we did not find any appropriate reference on their upper bounds that we need. So, we derive the upper bounds as follows, which will be useful in the proof of Corollary 15.
Lemma 29
Suppose \(Z \sim {\mathbf {B}}(k,p)\), when \(z>0\), we have
Proof
By definition, we have
Similarly, we have
\(\square \)
Now, we can prove Corollary 15 could be as follows.
Proof
Let the random variable Z denote the number of machines that successfully communicate with the central machine, which means that Z follows Binomial distribution, \(B(k,1-r)\). By Law of Large Numbers, \(\frac{Z}{(1-r)k} \xrightarrow {P} 1\text { as } k \rightarrow \infty \). If Z is known, the size of available data becomes Zn. By Theorem 7, the one-step estimator \(\theta ^{(1)}\) is still asymptotic normal when \(k=O(\sqrt{N})\), i.e., \(\sqrt{Zn}(\theta ^{(1)} - \theta _0) \xrightarrow {d} N(0,\varSigma ) \text { as } n \rightarrow \infty \). Therefore, when \(k \rightarrow \infty \), we have
Since \(\frac{(1-r)N}{Zn}=\frac{(1-r)k}{Z} \xrightarrow {P} 1\), by Slutsky’s Lemma, we have \(\sqrt{(1-r)N}(\theta ^{(0)} - \theta _0) \xrightarrow {d} N(0,\varSigma )\). This result indicates that when the local machines could lose communication independently with central machine with probability q, the one-step estimator \(\theta ^{(1)}\) shares the same asymptotic properties with the oracle M-estimator using \((1-r) \times 100\%\) of the total samples.
Next, we will analyze the mean squared error of one-step estimator with the presence of local machine failures. Note that, when Z is fixed and known, by Theorem 10, we have
By Rule of Double Expectation and Lemma 29,
\(\square \)
Rights and permissions
About this article
Cite this article
Huang, C., Huo, X. A distributed one-step estimator. Math. Program. 174, 41–76 (2019). https://doi.org/10.1007/s10107-019-01369-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-019-01369-0