Skip to main content
Log in

Testing Shape Restrictions of Discrete Distributions

  • Published:
Theory of Computing Systems Aims and scope Submit manuscript

Abstract

We study the question of testing structured properties (classes) of discrete distributions. Specifically, given sample access to an arbitrary distribution D over [n] and a property \(\mathcal {P}\), the goal is to distinguish between D\(\mathcal {P}\) and ℓ1(D, \(\mathcal {P}\)) > ε. We develop a general algorithm for this question, which applies to a large range of “shape-constrained” properties, including monotone, log-concave, t-modal, piecewise-polynomial, and Poisson Binomial distributions. Moreover, for all cases considered, our algorithm has near-optimal sample complexity with regard to the domain size and is computationally efficient. For most of these classes, we provide the first non-trivial tester in the literature. In addition, we also describe a generic method to prove lower bounds for this problem, and use it to show our upper bounds are nearly tight. Finally, we extend some of our techniques to tolerant testing, deriving nearly–tight upper and lower bounds for the corresponding questions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. Recall that the identity testing problem asks, given the explicit description of a distribution D and sample access to an unknown distribution D, to decide whether D is equal to D or far from it; while in closeness testing both distributions to compare are unknown.

  2. For the sample complexity of testing monotonicity, [12] originally states an \(\tilde {O}\left ({\sqrt {n}}/{{\varepsilon }^{4}} \right )\) upper bound, but the proof seems to only result in an \(\tilde {O}\left ({\sqrt {n}}/{{\varepsilon }^{6}} \right )\) bound. Regarding the class of PBDs, [2] obtain an \({n^{1/4}}\cdot \tilde {O}\left ({1/{\varepsilon }^{2}}\right ) + \tilde {O}\left ({1/{\varepsilon }^{6}}\right )\) sample complexity, to be compared with our \(\tilde {O}\left ({n^{1/4}/{\varepsilon }^{7/2}}\right ) + {O\left (\log ^{4} n/{\varepsilon }^{4} \right )}\) upper bound; as well as an \({\Omega }\left ({n^{1/4}/{\varepsilon }^{2}}\right )\) lower bound.

  3. As a simple example, consider the class \(\mathcal {C}\) of all distributions, for which testing membership is trivial.

  4. Tolerant testing of a property \(\mathcal {P}\) is defined as follows: given 0 ≤ ε 1 < ε 2 ≤ 1, one must distinguish between (a) \(\ell _{1}({D},{\mathcal {P}}) \leq {\varepsilon }_{1}\) and (b) \(\ell _{1}({D},{\mathcal {P}}) \geq {\varepsilon }_{2}\). This turns out to be, in general, a much harder task than that of “regular” testing (where we take ε 1 = 0).

  5. Note that this slightly deviates from the Statistics literature, where only the peaks are counted as modes (so that what is usually referred to as a bimodal distribution is, according to our definition, 3-modal).

  6. In more detail, we want to argue that if D is in the class, then a decomposition with at most L pieces is found by the algorithm. Since there is a dyadic decomposition with at most L pieces (namely, \(\mathcal {I}(\gamma ,\gamma ,{D})=(I_{1},\dots ,I_{t})\)), it suffices to argue that the algorithm will never split one of the I j ’s (as every single I j will eventually be considered by the recursive binary splitting, unless the algorithm stopped recursing in this “path” before even considering I j , which is even better). But this is the case by the above argument, which ensures each such I j will be recognized as satisfying one of the two conditions for “good decomposition” (being either close to uniform in 2 distance, or having very little mass).

  7. Indeed, it is not hard to show that a monotone distribution can only be ε-far from uniform if it puts probability weight \(1/2+{\Omega \left ({\varepsilon } \right )}\) on the first half of the domain. Estimating this probability weight to an additive O(ε) is thus sufficient to conclude.

  8. Specifically, these lower bounds hold as long as \({\varepsilon }={\Omega \left (1/n^{\alpha } \right )}\) for some absolute constant α > 0 (so that the sample complexity of the agnostic learner is indeed negligible in front of \(\sqrt {n}/{\varepsilon }^{2}\)).

  9. Note the quasi-quadratic dependence on ε of the learner, which allows us to get ε into our lower bound for \(n\gg \text {poly}\log (1/{\varepsilon })\).

  10. For any sequence \(x=(x_{1},\dots ,x_{n})\in {\mathbb {R}}^{n}\), \(p > 0 \mapsto \lVert x{\rVert }_{p}\) is non-increasing. In particular, for \(0 < p \leq q <\infty \),

    $$\left( \sum\limits_{i} \left\lvert x_{i} \right\rvert^{q}\right)^{1/q} = \lVert x{\rVert}_{q} \leq \lVert x{\rVert}_{p} = \left( \sum\limits_{i} \left\lvert x_{i} \right\rvert^{p}\right)^{1/p}\;. $$

    To see why, one can easily prove that if \(\lVert x{\rVert }_{p} = 1\), then \(\lVert x{\rVert }_{q}^{q} \leq 1\) (bounding each term \(\left \lvert {x_{i}} \right \rvert ^{q} \leq \left \lvert {x_{i}} \right \rvert ^{p}\)), and therefore \(\lVert {x}{\rVert }_{q} \leq 1 = \lVert {x}{\rVert }_{p}\). Next, for the general case, apply this to \(y = x/\lVert {x}{\rVert }_{p}\), which has unit p norm, and conclude by homogeneity of the norm.

  11. Namely, for ε ∈ (0,ε 0), define the mixture \(D_{{\varepsilon }} \overset {\text {def}}{=} \frac {{\varepsilon }}{{\varepsilon }_{0}}{D}+(1-\frac {{\varepsilon }}{{\varepsilon }_{0}}){\operatorname {Bin}\!\left (n, 1/2 \right )}\). Being able to distinguish \({\lVert {{D}_{{\varepsilon }}-{\operatorname {Bin}\!\left (n, 1/2 \right )}}{\rVert }}_1 \leq {\varepsilon }\) from \({\lVert {{D}_{{\varepsilon }}-{\operatorname {Bin}\!\left (n, 1/2 \right )}}{\rVert }}_1 \geq 100{\varepsilon }\) in q samples then allows one to distinguish \({\lVert {{D}-{\operatorname {Bin}\!\left (n, 1/2 \right )}}{\rVert }}_1 \leq {\varepsilon }_{0}\) from \({\lVert {{D}-{\operatorname {Bin}\!\left (n, 1/2 \right )}}{\rVert }}_1 \geq 100{\varepsilon }_{0}\) in O(εq) samples.

References

  1. Alon, N., Andoni, A., Kaufman, T., Matulef, K., Rubinfeld, R., Xie, N.: Testing k-wise and almost k-wise independence. In: Proceedings of the 39th ACM Symposium on Theory of Computing, STOC 2007, San Diego, California, USA, June 11–13, 2007, pp. 496–505. New York (2007)

  2. Acharya, J., Daskalakis, C.: Testing Poisson binomial distributions. In: Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2015, San Diego, CA, USA, January 4–6, 2015, pp. 1829–1840 (2015)

  3. Acharya, J., Daskalakis, C., Kamath, G.C.: Optimal testing for properties of distributions. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M. Garnett, R., Garnett, R. (eds.) Advances in Neural Information Processing Systems 28, pp. 3577–3598. Curran Associates, Inc. (2015)

  4. Acharya, J., Diakonikolas, I., Li, J.Z., Schmidt, L.: Sample-optimal density estimation in nearly-linear time. CoRR, arXiv:1506.00671 (2015)

  5. Arora, S., Khot, S.: Fitting algebraic curves to noisy data. J. Comput. Syst. Sci. 67(2), 325–340 (2003). Special Issue on STOC 2002

    Article  MathSciNet  MATH  Google Scholar 

  6. An, M.Y.: Log-concave probability distributions: theory and statistical testing. Technical report, Centre for Labour Market and Social Research, Denmark (1996)

  7. Bagnoli, M., Bergstrom, T.: Log-concave probability and its applications. Econ. Theory 26(2), 445–469 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  8. Barlow, R.E., Bartholomew, D.J., Bremner, J.M., Brunk, H.D.: Statistical inference under order restrictions: the theory and application of isotonic regression. Wiley Series in Probability and Mathematical Statistics. Wiley, London, New York (1972)

  9. Batu, T., Dasgupta, S., Kumar, R., Rubinfeld, R.: The complexity of approximating the entropy. SIAM J. Comput. 35(1), 132–150 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  10. Batu, T., Fischer, E., Fortnow, L., Kumar, R., Rubinfeld, R., White, P.: Testing random variables for independence and identity. In: 42nd Annual IEEE Symposium on Foundations of Computer Science, FOCS 2001, Las Vegas, Nevada, USA, October 14–17 2001, pp. 442–451 (2001)

  11. Batu, T., Fortnow, L., Rubinfeld, R., Smith, W.D., White, P.: Testing that distributions are close. In: 41st Annual IEEE Symposium on Foundations of Computer Science, FOCS 2000, Redondo Beach, California, USA, November 12–14 2000, pp. 259–269 (2000)

  12. Batu, T., Kumar, R., Rubinfeld, R.: Sublinear algorithms for testing monotone and unimodal distributions. In: Proceedings of the 36th ACM Symposium on Theory of Computing, STOC 2004, Chicago, IL, June 13–16, 2004, pp. 381–390. ACM, New York (2004)

  13. Canonne, C.L.: A survey on distribution testing: your data is big. But is it blue? Electronic Colloquium on Computational Complexity (ECCC) 22, 63 (2015)

    Google Scholar 

  14. Canonne, C.L.: Are few bins enough: testing histogram distributions. In: Proceedings of PODS. Association for Computing Machinery (ACM) (2016)

  15. Chan, S., Diakonikolas, I., Servedio, R.A., Sun, X.: Learning mixtures of structured distributions over discrete domains. In: Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2013, New Orleans, Louisiana, USA, January 6–8, 2013, pp. 1380–1394 (2013)

  16. Chan, S., Diakonikolas, I., Servedio, R.A., Sun, X.: Efficient density estimation via piecewise polynomial approximation. In: Proceedings of the 45th ACM Symposium on Theory of Computing, STOC 2014, New York, NY, USA, May 31–June 03, 2014, pp. 604–613. ACM (2014)

  17. Chan, S., Diakonikolas, I., Servedio, R.A., Sun, X.: Near-optimal density estimation in near-linear time using variable-width histograms. In: Annual Conference on Neural Information Processing Systems (NIPS), pp. 1844–1852 (2014)

  18. Chan, S., Diakonikolas, I., Valiant, G., Valiant, P.: Optimal algorithms for testing closeness of discrete distributions. In: Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2014, Portland, Oregon, USA, January 5–7, 2014, pp. 1193–1203 (2014)

  19. Chakraborty, S., Fischer, E., Goldhirsh, Y., Matsliah, A.: On the power of conditional samples in distribution testing. In: Proceedings of ITCS, pages 561–580, New York, NY, USA. ACM (2013)

  20. Canonne, C.L., Ron, D., Servedio, R.A.: Testing probability distributions using conditional samples. SIAM J. Comput. (SICOMP) 44(3), 540–616 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  21. Daskalakis, C., Diakonikolas, I., O’Donnell, R., Servedio, R.A., Tan, L.-Y.: Learning sums of independent integer random variables. In: 54th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2013, Berkeley, CA, USA, October 26–29, 2013, pp. 217–226. IEEE Computer Society (2013)

  22. Daskalakis, C., Diakonikolas, I., Servedio, r.A.: Learning k-modal distributions via testing. In: Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2012, Kyoto, Japan, January 17–19, 2012, pp. 1371–1385. Society for Industrial and Applied Mathematics (SIAM) (2012)

  23. Daskalakis, C., Diakonikolas, I., Servedio, R.A.: Learning Poisson binomial distributions. In: Proceedings of the 44th ACM Symposium on Theory of Computing, STOC 2012 Conference, New York, NY, USA, May 19–22, 2012, STOC ’12, New York, NY, pp. 709–728. ACM (2012)

  24. Daskalakis, C., Diakonikolas, I., Servedio, r.A., Valiant, G., Valiant, P.: Testing k-modal distributions: Optimal algorithms via reductions. In: Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2013, New Orleans, Louisiana, USA, January 6–8, 2013, pp. 1833–1852. Society for Industrial and Applied Mathematics (SIAM) (2013)

  25. Diakonikolas, I.: Learning structured distributions. In: Handbook of Big Data. CRC Press (2016)

  26. Diakonikolas, I., Kane, D.M.: A New Approach for Testing Properties of discrete distributions. IEEE Computer Society (2016)

  27. Diakonikolas, I., Kane, D.M., Nikishkin, V.: Optimal algorithms and lower bounds for testing closeness of structured distributions. In: 56th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2016 (2015)

  28. Diakonikolas, I., Kane, D.M., Nikishkin, V.: Testing identity of structured distributions. In: Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2015, San Diego, CA, USA, January 4–6, 2015 (2015)

  29. Diakonikolas, I., Kane, D.M., Stewart, A.: Nearly optimal learning and sparse covers for sums of independent integer random variables. CoRR, arXiv:1505.00662 (2015)

  30. Diakonikolas, I., Kane, D.M., Stewart, A.: Efficient robust proper learning of log-concave distributions. CoRR, arXiv:1606.03077 (2016)

  31. Diakonikolas, I., Kane, D.M., Stewart, A.: Optimal learning via the fourier transform for sums of independent integer random variables. In: COLT, volume 49 of JMLR Workshop and Conference Proceedings, pp. 831–849. JMLR.org, 2016. Full version in [29]

  32. Dvoretzky, A., Kiefer, J., Wolfowitz, J.: Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. Ann. Math. Stat. 27(3), 642–669 (1956)

    Article  MathSciNet  MATH  Google Scholar 

  33. Fischer, E., Lachish, O., Vasudev, Y.: Improving and extending the testing of distributions for shape-restricted properties. arXiv:1609.06736 (2016)

  34. Guha, S., McGregor, A., Venkatasubramanian, S.: Streaming and sublinear approximation of entropy and information distances. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2006, Miami, Florida, USA, January 22–26, 2006, pp. 733–742. Society for Industrial and Applied Mathematics (SIAM), Philadelphia (2006)

  35. Goldreich, O., Ron, D.: On testing expansion in bounded-degree graphs. Technical Report TR00-020. In: Electronic Colloquium on Computational Complexity (ECCC) (2000)

  36. Hougaard, P.: Survival models for heterogeneous populations derived from stable distributions. Biometrika 73, 397–96 (1986)

    Article  MathSciNet  MATH  Google Scholar 

  37. Indyk, P., Levi, R., Rubinfeld, R.: Approximating and testing k-histogram distributions in sub-linear time. In: Proceedings of PODS, pp. 15–22 (2012)

  38. Keilson, J., Gerber, H.: Some results for discrete unimodality. J. Am. Stat. Assoc. 66(334), 386–389 (1971)

    Article  MATH  Google Scholar 

  39. Mandelbrot, B.: New methods in statistical economics. J. Polit. Econ. 71(5), 421–440 (1963)

    Article  Google Scholar 

  40. Massart, P.: The tight constant in the Dvoretzky–Kiefer–Wolfowitz inequality. Ann. Probab. 18(3), 1269–1283 (1990)

    Article  MathSciNet  MATH  Google Scholar 

  41. Massart, P., Picard, J.: Concentration inequalities and model selection. Lecture Notes in Mathematics, 33, 2003, Saint-Flour, Cantal. Springer (2007)

  42. Paninski, L.: A coincidence-based test for uniformity given very sparsely sampled discrete data. IEEE Trans. Inf. Theory 54(10), 4750–4755 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  43. Ron, D.: Property testing: a learning theory perspective. Found. Trends Mach. Learn. 1(3), 307–402 (2008)

    Article  MATH  Google Scholar 

  44. Ron, D.: Algorithmic and analysis techniques in property testing. Found. Trends Theor. Comput. Sci. 5, 73–205 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  45. Rubinfeld, R.: Taming big probability distributions. XRDS 19(1), 24–28 (2012)

    Article  Google Scholar 

  46. Sengupta, D., Nanda, A.K.: Log-concave and concave distributions in reliability. Nav. Res. Logist. (NRL) 46(4), 419–433 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  47. Silvapulle, M.J., Sen, P.K.: Constrained Statistical Inference. Wiley, New York (2001)

    Book  Google Scholar 

  48. Tsallis, C., Levy, S.V.F, Souza, A.M.C., Maynard, R.: Statistical-mechanical foundation of the ubiquity of Lévy distributions in nature. Phys. Rev. Lett. 75, 3589–3593 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  49. Valiant, P.: Testing symmetric properties of distributions. SIAM J. Comput. 40(6), 1927–1968 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  50. Valiant, G., Valiant, P.: A CLT and tight lower bounds for estimating entropy. Electron. Colloq. Comput. Complex. (ECCC) 17, 179 (2010)

    Google Scholar 

  51. Valiant, G., Valiant, P.: Estimating the unseen: an \(n/\log n\)-sample estimator for entropy and support size, shown optimal via new clts. In: Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011, San Jose, CA, USA, 6–8 June 2011, pp. 685–694 (2011)

  52. Valiant, G., Valiant, P.: The power of linear estimators. In: 52nd Annual IEEE Symposium on Foundations of Computer Science, FOCS 2011, Palm Springs, CA, USA, October 22–25, 2011, pp. 403–412 (2011)

  53. Valiant, G., Valiant, P.: An automatic inequality prover and instance optimal identity testing. In: 55th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2014, Philadelphia, PA, USA, October 18–21, 2014 (2014)

  54. Walther, G.: Inference and modeling with log-concave distributions. Stat. Sci. 24(3), 319–327 (2009)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

Clément L. Canonne’s research was supported by NSF CCF-1115703 and NSF CCF-1319788. Ilias Diakonikolas was supported by NSF Award CCF-1652862 (CAREER) and a Sloan Research Fellowship. This work was performed in part while visiting CSAIL, MIT. Ronitt Rubinfeld was supported by NSF grants CCF-1420692 and CCF-1650733, and Israel Science Foundation (ISF) grant 1536/14.

Themis Gouleakis was supported by NSF grants CCF-1420692 and CCF-1650733.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Clément L. Canonne.

Additional information

This article is part of the Topical Collection on Theoretical Aspects of Computer Science

Appendices

Appendix A: Proof of Lemma 2.10

We now give the proof of Lemma 2.10, restated below:

Lemma 2.10 (Adapted from [28, Theorem 11])

There exists an algorithm Check-Small- 2 which, given parameters ε, δ ∈ (0, 1)and \(c\cdot {\sqrt {\left \lvert {I} \right \rvert }}/{\varepsilon ^{2}} \log (1/\delta )\) independent samples from a distribution D over I (for some absolute constant c > 0), outputs either yes or no , and satisfies the following.

  • If \({\lVert {{D}-{\mathcal {U}}_{I}}{\rVert }}_2 > {{\varepsilon }}/{\sqrt {\left \lvert I \right \rvert }}\), then the algorithm outputs no with probability at least 1 − δ;

  • If \({\lVert {{D}-{\mathcal {U}}_{I}}{\rVert }}_2 \leq {{\varepsilon }}/{2\sqrt {\left \lvert I \right \rvert }}\), then the algorithm outputs yes with probability at least 1 − δ.

Proof

We first describe an algorithm that distinguishes between\({\lVert {{D}-{\mathcal {U}}}{\rVert }}_2^{2} \geq {\varepsilon }^{2}/{n}\)and \({\lVert {{{D}-{\mathcal {U}}}}{\rVert }}_2^{2} < \varepsilon ^{2}/(2n)\)with probability at least 2/3,using \(C\cdot \frac {\sqrt {n}}{\varepsilon ^{2}}\)samples. Boosting the success probability to 1 − δ at the price of a multiplicative \(\log \frac {1}{\delta }\)factor can then be achieved by standard techniques.

Similarly as in the proof of Theorem 11 (whose algorithm we use, but with a threshold\(\tau \overset {\text {def}}{=} \frac {3}{4}\frac {m^{2}\varepsilon ^{2}}{n}\)instead of\(\frac {4m}{\sqrt {n}}\)), define thequantities

$$Z_{k} \overset{\text{def}}{=} \left( X_{k}-\frac{m}{n}\right)^{2} - X_{k},\qquad k\in[n] $$

and\(Z\overset {\text {def}}{=}{\sum }_{k=1}^{n} Z_{k}\), where the X k ’s (and thus the Z k ’s) are independent byPoissonization, and \(X_{k}\sim {\text {Poisson}\!\left ({m {D}(k)} \right ) }\). Itis not hard to see that \(\mathbb {E} Z_{k} = {{\Delta }_{k}^{2}}\),where \({\Delta }_{k}\overset {\text {def}}{=} (\frac {1}{n}-{D}(k))\), so that\(\mathbb {E} Z = m^{2}{\lVert {{{D}-{\mathcal {U}}}}{\rVert }}_2^{2}\). Furthermore, we alsoget

$$\text{Var}~Z_{k} = 2m^{2}\left( \frac{1}{n}-{\Delta}_{k}\right)^{2} + 4m^{3}\left( \frac{1}{n}-{\Delta}_{k}\right){\Delta}_{k} $$

sothat

$$ \text{Var}~Z = 2m^{2}\left( \sum\limits_{k=1}^{n} {{\Delta}_{k}^{2}} + \frac{1}{n} -2 m\sum\limits_{k=1}^{n} {{\Delta}_{k}^{3}}\right) $$
(2)

(after expanding and since \({\sum }_{k=1}^{n} {\Delta }_{k} = 0\)).

Soundness

Almost straight from [ 28 ], but the threshold has changed. Assume \({\Delta }^{2}\overset {\text {def}}{=}{\lVert {{D}-{\mathcal {U}}}{\rVert }}_2^{2} \geq {\varepsilon }^{2}/n\); we will show that \(\Pr \!\left [\, Z < \tau \, \right ] \leq 1/3\). By Chebyshev’s inequality, it is sufficient to show that \(\tau \leq \mathbb {E} Z - \sqrt {3}\sqrt {\text {Var}~Z}\), as

$$\Pr\!\left[\, \mathbb{E} Z - Z > \sqrt{3}\sqrt{\text{Var}~Z} \, \right] \leq 1/3\;. $$

As \(\tau < \frac {3}{4}\mathbb {E} Z\), arguing that \(\sqrt {3}\sqrt {\text {Var}~Z} \leq \frac {1}{4}\mathbb {E} Z\) is enough, i.e. that \(48~\text {Var}~Z \leq (\mathbb {E} Z)^{2}\). From (2), this is equivalent to showing

$${\Delta}^{2} + \frac{1}{n} -2 m\sum\limits_{k=1}^{n} {{\Delta}_{k}^{3}} \leq \frac{m^{2}{\Delta}^{4}}{96}\;. $$

We bound the LHS term by term.

  • As \({\Delta }^{2} \geq \frac {{\varepsilon }^{2}}{n}\), we get \(m^{2}{\Delta }^{2} \geq \frac {C^{2}}{{\varepsilon }^{2}}\), and thus \(\frac {m^{2}{\Delta }^{4}}{288} \geq \frac {C^{2}}{288\varepsilon ^{2}}{\Delta }^{2} \geq {\Delta }^{2}\) (as C ≥ 17 and ε ≤ 1).

  • Similarly, \(\frac {m^{2}{\Delta }^{4}}{288} \geq \frac {C^{2}}{288{\varepsilon }^{2}}\cdot \frac {{\varepsilon }^{2}}{n} \geq \frac {1}{n}\).

  • Finally, recalling thatFootnote 10

    $$\sum\limits_{k=1}^{n} \left\lvert {\Delta}_{k} \right\rvert^{3} \leq \left( \sum\limits_{k=1}^{n} \left\lvert {\Delta}_{k} \right\rvert^{2} \right)^{3/2} = {\Delta}^{3} $$

    we get that \(\left \lvert 2m{\sum }_{k=1}^{n} \left \lvert {{\Delta }_{k}} \right \rvert ^{3} \right \rvert \leq 2m {\Delta }^{3} = \frac {m^{2} {\Delta }^{4}}{288} \cdot \frac {2\cdot 288}{m{\Delta }} \leq \frac {m^{2} {\Delta }^{4}}{288}\), using the fact that \(\frac {m{\Delta }}{2\cdot 288} \geq \frac {C}{576{\varepsilon }} \geq 1\) (by choice of C ≥ 576).

Overall, the LHS is at most \(3\cdot \frac {m^{2} {\Delta }^{4}}{288} = \frac {m^{2} {\Delta }^{4}}{96}\), as claimed.

Completeness

Assume \({\Delta }^{2}={\lVert {{D}-{\mathcal {U}}}{\rVert }}_2^{2} < {\varepsilon }^{2}/(4n)\). We need to show that \(\Pr \!\left [\, { Z \geq \tau }\, \right ] \leq 1/3\). Chebyshev’s inequality implies

$$\Pr\!\left[\, Z - \mathbb{E} Z > \sqrt{3}\sqrt{\text{Var}~Z} \, \right] \leq 1/3 $$

and therefore it is sufficient to show that

$$\tau \geq \mathbb{E} Z + \sqrt{3}\sqrt{\text{Var}~Z} $$

Recalling the expressions of \(\mathbb {E} Z\) and Var Z from (2), this is tantamount to showing

$$\frac{3}{4}\frac{m^{2}{\varepsilon}^{2}}{n} \geq m^{2}{\Delta}^{2} + \sqrt{6}m\sqrt{{\Delta}^{2} + \frac{1}{n} -2m \sum\limits_{k=1}^{n} {{\Delta}_{k}^{3}}} $$

or equivalently

$$\frac{3}{4} \frac{m}{\sqrt{n}}{\varepsilon}^{2} \geq m \sqrt{n} {\Delta}^{2} + \sqrt{6} \sqrt{1 + n{\Delta}^{2} -2nm \sum\limits_{k=1}^{n} {{\Delta}_{k}^{3}}}\;. $$

Since \(\sqrt {1 + n{\Delta }^{2} -2 n m {\sum }_{k=1}^{n} {{\Delta }_{k}^{3}}} \leq \sqrt {1 + n{\Delta }^{2}} \leq \sqrt {1 + {\varepsilon }^{2}/4} \leq \sqrt {5/4}\), we get that the second term is at most \(\sqrt {30/4} < 3\). All that remains is to show that \(m\sqrt {n}{\Delta }^{2} \geq 3m\frac {\varepsilon ^{2}}{4\sqrt {n}}-3\). But as Δ2 < ε 2/(4n), \(m\sqrt {n}{\Delta }^{2} \leq m\frac {\varepsilon ^{2}}{4\sqrt {n}}\); and our choice of \(m \geq C\cdot \frac {\sqrt {n}}{\varepsilon ^{2}}\) for some absolute constant C ≥ 6 ensures this holds. □

Appendix B: Proof of Theorem 4.5

In this section, we prove our structural result for MHR distributions, Theorem 4.5:

Theorem 4.5 (Monotone Hazard Rate)

For all γ, ζ > 0, the class \({\mathcal {MHR}}\) of MHR distributions on [n]is (γ, ζ, L)-decomposable for \(L \overset {\text {def}}{=} {O\left ({\frac {\log \frac {n}{\zeta }}{\gamma }} \right )}\) .

Proof

We reproduce and adapt the argument of [15, Section 5.1] to meetour definition of decomposability (which, albeit related, is incomparable to theirs). First,we modify the algorithm at the core of their constructive proof, in Algorithm 4: notethat the only two changes are in Steps 2 and 3, where we use parameters respectively\(\frac {\zeta }{n}\)and \(\frac {\zeta }{n^{2}}\).

figure c
figure d

Following the structure of their proof, we write\(\mathcal {Q}=\{I_{1},\dots ,I_{\left \lvert \mathcal {Q} \right \rvert }\}\)with I i = [a i ,b i ],and define \(\mathcal {Q}^{\prime }= \left \{ { I_{i}\in \mathcal {Q} } \;\colon \; { {D}(a_{i}) > {D}(a_{i+1})} \right \} \),\(\mathcal {Q}^{\prime \prime }= \left \{ { I_{i}\in \mathcal {Q} } \;\colon \; { {D}(a_{i}) \leq {D}(a_{i+1})} \right \} \).

We immediately obtain the analogues of their Lemmas 5.2 and 5.3:

Lemma B.1

We have \({\prod }_{I_{i}\in \mathcal {Q}^{\prime }} \frac {{D}(a_{i})}{{D}(a_{i+1})} \leq \frac {n}{\zeta }\) .

Lemma B.2

Step 4 of Algorithm 4 adds at most \({O\left (\frac {1}{\gamma }\log \frac {n}{\zeta } \right )}\) intervals to \(\mathcal {Q}\) .

Proof Sketch

This derives from observing that now \(D(I\cup I^{\prime }) \geq \zeta /n\),which as in [15, Lemma 5.3] in turnimplies

$$1 \geq \frac{\zeta}{n}(1+\gamma)^{\left\lvert \mathcal{Q}^{\prime} \right\rvert-1} $$

so that\(\left \lvert \mathcal {Q}^{\prime } \right \rvert = {O\left (\frac {1}{\gamma }\log \frac {n}{\zeta } \right )}\).

Again following their argument, we also get

$$\frac{{D}(a_{\left\lvert \mathcal{Q} \right\rvert+1})}{{D}(a_{1})} = \prod\limits_{I_{i}\in\mathcal{Q}^{\prime\prime}} \frac{{D}(a_{i+1})}{{D}(a_{i})}\cdot \prod\limits_{I_{i}\in\mathcal{Q}^{\prime}} \frac{{D}(a_{i+1})}{{D}(a_{i})} $$

by combining LemmaB.1 with the fact that \(D(a_{\left \lvert \mathcal {Q} \right \rvert +1} \leq 1\)andthat by construction D(a i ) ≥ ζ/n 2, weget

$$\prod\limits_{I_{i}\in\mathcal{Q}^{\prime\prime}} \frac{{D}(a_{i+1})}{{D}(a_{i})} \leq \frac{n}{\zeta} \cdot \frac{n^{2}}{\zeta} = \frac{n^{3}}{\zeta^{2}}\ . $$

But since each term inthe product is at least (1 + γ)(byconstruction of \(\mathcal {Q}\)and thedefinition of \(\mathcal {Q}^{\prime \prime }\)), this leadsto

$$(1+\gamma)^{\left\lvert \mathcal{Q}^{\prime\prime} \right\rvert} \leq \frac{n^{3}}{\zeta^{2}} $$

andthus \(\left \lvert \mathcal {Q}^{\prime \prime } \right \rvert = {O\left (\frac {1}{\gamma }\log \frac {n}{\zeta } \right )}\)aswell. □

It remains to show that \(\mathcal {Q}\cup \{I,I^{\prime },I^{\prime \prime }\}\) is indeed a good decomposition of [n] for D, as per Definition 3.1. Since by construction every interval in \(\mathcal {Q}\) satisfies item (ii), we only are left with the case of I, \(I^{\prime }\) and \(I^{\prime \prime }\). For the first two, as they were returned by Right-Interval either (a) they are singletons, in which case item (ii) trivially holds; or (b) they have at least two elements, in which case they have probability mass at most \(\frac {\zeta }{n}\) (by the choice of parameters for Right-Interval) and thus item (i) is satisfied. Finally, it is immediate to see that by construction \(D(I^{\prime \prime }) \leq n\cdot \zeta /n^{2} = \zeta /n\), and item (i) holds in this case as well. □

Appendix C: Proofs from Section 4

This section contains the proofs omitted from Section 4, namely the distance estimation procedures for t-piecewise degree-d (Theorem 4.13), monotone hazard rate (Lemma 4.14), and log-concave distributions (Lemma 4.15).

1.1 C.1 Proof of Theorem 4.13

In this section, we prove the following:

Theorem C.1 (Theorem 4.13, restated)

Let p be an -histogram over [−1, 1).There is an algorithm ProjectSinglePoly(d, ε)which runs in time poly(, d + 1, 1/ε), and outputs a degree- D polynomial q which defines a pdf over [−1, 1)such that \({\lVert {p-q}{\rVert }}_1 \leq 3 \ell _{1}(p,{\mathcal {P}_{d}}) + O({\varepsilon })\) .

As mentioned in Section 4, the proof of this statement is a rather straightforward adaptation of the proof of [16, Theorem 9], with two differences: first, in our setting there is no uncertainty or probabilistic argument due to sampling, as we are provided with an explicit description of the histogram p. Second, Chan et al. require some “well-behavedness” assumption on the distribution p (for technical reasons essentially due to the sampling access), that we remove here. Besides these two points, the proof is almost identical to theirs, and we only reproduce (our modification of) it here for the sake of completeness. (Any error introduced in the process, however, is solely our responsibility.)

Proof

Some preliminary definitions will be helpful:

Definition C.2(Uniform partition)

Let p be a subdistribution on an interval \(I \subseteq [-1,1)\).A partition \(\mathcal {I} = \{I_{1}, \dots , I_{\ell }\}\)of I is (p, η)-uniform if p(I j ) ≤ η for all 1 ≤ j.

We will also use the following notation: For this subsection, let I = [−1, 1) (I will denote a subinterval of [−1, 1) when the results are applied in the next subsection). We write \(\|f\|^{(I)}_{1}\) to denote \({\int }_{I} |f(x)| dx\), and we write \(\operatorname {d_{\text {TV}}}^{(I)}(p,q)\) to denote \({\lVert {p-q}{\rVert }}_1^{(I)}/2\). We write \({\textsc {opt}}^{(I)}_{1,d}\) to denote the infimum of the distance \({\lVert {p-g}{\rVert }}_1^{(I)}\) between p and any degree-d subdistribution g on I that satisfies g(I) = p(I).

The key step of ProjectSinglePoly is Step 2 where it calls the FindSinglePoly procedure. In this procedure T i (x) denotes the degree-i Chebychev polynomial of the first kind. The function FindSinglePoly should be thought of as the CDF of a “quasi-distribution” f; we say that \(f=F^{\prime }\) is a “quasi-distribution” and not a bona fide probability distribution because it is not guaranteed to be non-negative everywhere on [−1, 1). Step 2 of FindSinglePoly processes f slightly to obtain a polynomial q which is an actual distribution over [−1, 1).

figure e
figure f

The rest of this subsection gives the proof of Theorem 4.13. The claimed running time bound is obvious (the computation is dominated by solving the poly(d, 1/ε)-size LP in ProjectSinglePoly, with an additional term linear in when partitioning [−1, 1) in the initial first step), so it suffices to prove correctness.

Before launching into the proof we give some intuition for the linear program. Intuitively F(x) represents the cdf of a degree-d polynomial distribution f where \(f=F^{\prime }.\) Constraint (a) captures the endpoint constraints that any cdf must obey if it has the same total weight as p. Intuitively, constraint (b) ensures that for each interval [i j ,i k ), the value F(i k ) − F(i j ) (which we may alternately write as f([i j ,i k ))) is close to the weight p([i j ,i k )) that the distribution puts on the interval. Recall that by assumption p is opt1,d -close to some degree-d polynomial r. Intuitively the variable w represents \({\int }_{[i_{\ell }, i_{\ell +1})} (r-p)\) (note that these values sum to zero by constraint (c)(4), and y represents the absolute value of w (see constraint (c)(5)). The value τ, which by constraint (c)(6) is at least the sum of the y ’s, represents a lower bound on opt1,d . The constraints in (d) and (e) reflect the fact that as a cdf, F should be bounded between 0 and 1 (more on this below), and the (f) constraints reflect the fact that the pdf \(f=F^{\prime }\) should be everywhere nonnegative (again more on this below).

We begin by observing that ProjectSinglePoly calls FindSinglePoly with input parameters that satisfy FindSinglePoly’s input requirements:

  1. (I)

    the non-singleton intervals \(I_{0},\dots ,I_{z-1}\) are (p, η)-uniform; and

  2. (II)

    the singleton intervals each have weight at least \(\frac {\eta }{10}\).

We then proceed to show that, from there, FindSinglePoly’s LP is feasible and has a high-quality optimal solution.

Lemma C.3

Suppose p is an -histogram over [−1, 1), so that conditions (I) and (II) above hold; then the LP defined in Step 1 of FindSinglePoly is feasible; and the optimal solution τ is at most opt1,d .

Proof

As above, let r be a degree-d polynomial pdf such that\({\textsc {opt}}_{1,d}= {\lVert {p-r}{\rVert }}_1\)and r(I) = p(I).Weexhibit a feasible solution as follows: take F to be the cdf of r (a degree D polynomial). Take w to be \({\int }_{[i_{\ell },i_{\ell +1})} ({r-p})\),and take y to be \(\left \lvert w_{\ell } \right \rvert \).Finally, take τ to be \({\sum }_{0 \leq \ell < {z}} y_{\ell }.\)

We first argue feasibility of the above solution. We first take care of the easy constraints:since F is the cdf of a subdistribution over I it is clear that constraints (a) and (e) are satisfied, and since both r and p are pdfs with the same total weight it is clear that constraints (c)(4) and (f) are both satisfied. Constraints (c)(5) and (c)(6) also hold.So it remains to argue constraints (b) and (d).

Note that constraint (b) is equivalent to p + (rp) = r and r satisfying \((\mathcal {I}, {\varepsilon }/(d+1), {\varepsilon })\)-inequalities,therefore this constraint is satisfied.

To see that constraint (d)is satisfied we recall some of the analysis of Arora and Khot [5 Section 3].This analysis shows that since F is a cumulative distribution function (and in particulara function bounded between 0 and 1 on I) each of its Chebychev coefficients is at most\(\sqrt {2}\)in magnitude.

To conclude the proof of the lemma we need to argue that τ ≤opt1,d .Since \(w_{\ell } = {\int }_{[i_{\ell },i_{\ell +1})} ({r-p})\)it is easy to see that \(\tau = {\sum }_{0 \leq \ell < {z}} y_{\ell } = {\sum }_{0 \leq \ell < {z}} |w_{\ell }| \leq {\lVert {p-r}{\rVert }}_1\),and hence indeed τ ≤opt1,d as required. □

Having established that with high probability the LP is indeed feasible, henceforth we let τ denote the optimal solution to the LP and F, f, w , c i , y denote the values in the optimal solution. A simple argument (see e.g. the proof of [5 Theorem 8]) gives that \({\lVert {F}{\rVert }}_{\infty }\leq 2\). Given this bound on \({\lVert {F}{\rVert }}_{\infty }\), the Bernstein–Markov inequality implies that \({\lVert {f}{\rVert }}_{\infty } = {\lVert {F^{\prime }}{\rVert }}_{\infty }\leq O((d+1)^{2})\). Together with (f) this implies that f(z) ≥−ε/2 for all z ∈ [−1, 1). Consequently q(z) ≥ 0 for all z ∈ [−1, 1), and

$${\int}_{-1}^{1} q(x) dx = {\varepsilon} + (1 - {\varepsilon}) {\int}_{-1}^{1} f(x)dx = {\varepsilon} + (1-{\varepsilon})(F(1)-F(-1)) = 1. $$

So q(x) is indeed a degree-d pdf. To prove Theorem 4.13 it remains to show that \({\lVert {p-q}{\rVert }}_1 \leq 3 {\textsc {opt}}_{1,d} + O({\varepsilon }).\)

We sketch the argument that we shall use to bound \({\lVert {p-q}{\rVert }}_1\). A key step in achieving this bound is to bound the \(\lVert \cdot {\rVert }_{\mathcal {A}}\) distance between f and \(\widehat {p}_{m} + w\) where \({\mathcal {A}} = {\mathcal A_{d+1}}\) is the class of all unions of d + 1 intervals and w is a function based on the w values (see (9) below). If we can bound \(\lVert (p+w)- f{\rVert }_{\mathcal {A}} \leq O({\varepsilon })\) then it will not be difficult to show that \(\lVert <Emphasis Type="Italic">r</Emphasis> - f{\rVert }_{\mathcal {A}} \leq {\textsc {opt}}_{1,d} + O({\varepsilon })\). Since r and f are both degree-d polynomials we have \({\lVert {r - f}{\rVert }}_1 = 2\lVert <Emphasis Type="Italic">r</Emphasis> - f{\rVert }_{\mathcal {A}} \leq 2 {\textsc {opt}}_{1,d} + O({\varepsilon })\), so the triangle inequality (recalling that \({\lVert {p-r}{\rVert }}_1 = {\textsc {opt}}_{1,d}\)) gives \({\lVert {p-f}{\rVert }}_1 \leq 3 {\textsc {opt}}_{1,d}+O({\varepsilon }).\) From this point a simple argument (Proposition C.5) gives that \({\lVert {p-q}{\rVert }}_1 \leq {\lVert {p-f}{\rVert }}_1 + O({\varepsilon })\), which gives the theorem.

We will use the following lemma that translates \((\mathcal {I}, \eta ,{\varepsilon })\)-inequalities into a bound on \(\mathcal A_{d+1}\) distance.

Lemma C.4

Let \(\mathcal {I} = \{I_{0}=[i_{0}, i_{1}), \dots , I_{z-1}=[i_{z-1}, i_{z})\}\) be a (p, η)-uniform partition of I, possibly augmented with singleton intervals. If \(h\colon I\to {\mathbb {R}}\) and p satisfy the \((\mathcal {I}, \eta , {\varepsilon })\) -inequalities, then

$${\lVert p-h{\rVert}_{\mathcal A_{{d+1}}}^{(I)} \leq \sqrt{{\varepsilon} z {(d+1)}}\cdot \eta + \text{error},} $$

where error = O((d + 1)η).

Proof

To analyze \(\lVert p-h{\rVert }_{\mathcal A_{d+1}}\),consider any union of d + 1disjoint non-overlapping intervals \(S = J_{1} \cup {\dots } \cup J_{d+1}\).We will bound \(\lVert p - h {\rVert }_{\mathcal A_{d+1}}\)by bounding \(\left \lvert p(S) - h(S) \right \rvert \).

We lengthen intervals in S slightly to obtain\(T = J^{\prime }_{1} \cup {\dots } \cup J^{\prime }_{{d+1}}\)so that each \(J^{\prime }_{j}\)is a union of intervals of the form [i ,i +1).Formally, if J j = [a, b),then \(J^{\prime }_{j} = [a^{\prime },b^{\prime })\),where \(a^{\prime } = \max _{\ell } \left \{\; i_{\ell } \;\colon \; i_{\ell } \leq a \; \right \} \)and \(b^{\prime } = \min _{\ell } \left \{\; i_{\ell } \;\colon \; i_{\ell } \geq b \; \right \} \).We claim that

$$ \left\lvert p(S) - h(S) \right\rvert \leq O({(d+1)}\eta) + \left\lvert p(T) - f(T) \right\rvert . $$
(7)

Indeed, consider any interval of the form J = [i ,i +1)such that \(J \cap S \neq J \cap T\)(in particular, such an interval cannot be one of the singletons). We have

$$ \left\lvert p(J \cap S) - p(J \cap T) \right\rvert \leq p(J) \leq {O(\eta)}, $$
(8)

where the first inequality uses non-negativity of p and the second inequality follows from the bound p([i ,i +1)) ≤ η. The\((\mathcal {I}, \eta , {\varepsilon })\)-inequalities (betweenh and p) implies that the inequalities in (8) also hold with h in place of p. Now (7) follows by adding (8) acrossall J = [i ,i +1)such that\(J\cap S\neq J\cap T\)(there are at most 2(d + 1)such intervals J),since each interval J j in S can change at most two such J’s when lengthened.

Now rewrite T as a disjoint union of sd + 1intervals \([i_{L_{1}}, i_{R_{1}}) \cup {\dots } \cup [i_{L_{s}}, i_{R_{s}})\). Wehave

$$\left\lvert p(T) - h(T) \right\rvert \leq \sum\limits_{j=1}^{s} \sqrt{R_{j} - L_{j}} \cdot \sqrt {\varepsilon}\eta $$

by\((\mathcal {I}, \eta , {\varepsilon })\)-inequalities between p and h.Now observing that that 0 ≤ L 1R 1⋯ ≤ L s R s t = O((d + 1)/ε), we getthat the largest possible value of \({\sum }_{j=1}^{s} \sqrt {R_{j} - L_{j}}\)is \(\sqrt {sz} \leq {\sqrt {{(d+1)}z}}\), so the RHS of(7) is at most \(O({(d+1)}\eta ) + {\sqrt { {(d+1)}z{\varepsilon }}\eta }\),as desired. □

Recall from above that F, f, w , c i , y , τ denote the values in the optimal solution. We claim that

$$ \lVert (p+ w) - f {\rVert}_{\mathcal{A}} = O({\varepsilon}) , $$
(9)

where w is the subdistribution which is constant on each [i ,i +1) and has weight w there, so in particular \({\lVert {w}{\rVert }}_1 \leq \tau \leq {\textsc {opt}}_{1,d}\). Indeed, this equality follows by applying Lemma C.4 with h = fw. The lemma requires h and p to satisfy \((\mathcal {I}, \eta , {\varepsilon })\)-inequalities, which follows from constraint (b) (\((\mathcal {I}, \eta , {\varepsilon })\)-inequalities between p + w and f) and observing that (p + w) − f = p − (fw). We have also used η = Θ(ε/(d + 1)) to bound the error term of the lemma by O(ε).

Next, by the triangle inequality we have (writing \({\mathcal {A}}\) for \({\mathcal {A}}_{d+1}\))

$$\lVert r - f {\rVert}_{\mathcal{A}} \leq \lVert r - (p+w) {\rVert}_{\mathcal{A}} + \lVert (p+w) - f {\rVert}_{\mathcal{A}}. $$

The last term on the RHS has just been shown to be O(ε). The first term is bounded by

$$\| r-(p+w)\|_{\mathcal{A}} \leq \frac{1}{2}{\lVert{ r-(p+w) }{\rVert}}_1 \leq \frac{1}{2}({\lVert{r-p}{\rVert}}_1 + {\lVert{w}{\rVert}}_1) \leq {\textsc{opt}}_{1,d}. $$

Altogether, we get that \(\lVert r - f {\rVert }_{\mathcal {A}} \leq {\textsc {opt}}_{1,d}+ O({\varepsilon })\).

Since r and f are degree d polynomials, \({\lVert { <Emphasis Type="Italic">r</Emphasis> - f }{\rVert }}_1 = 2\lVert r - f {\rVert }_{\mathcal {A}} \leq 2{\textsc {opt}}_{1,d}+ O(\varepsilon )\). This implies \({\lVert { p - f }{\rVert }}_1 \leq {\lVert {p-r}{\rVert }}_1 + {\lVert { r - f }{\rVert }}_1 \leq 3{\textsc {opt}}_{1,d} + O({\varepsilon })\). Finally, we turn our quasidistribution f which has value ≥−ε/2 everywhere into a distribution q (which is nonnegative), by redistributing the weight. The following simple proposition bounds the error incurred.

Proposition C.5

Let f and p be any sub-quasidistribution on I. If \(q = {{\varepsilon } f(I)/\left \lvert I \right \rvert + (1- {\varepsilon })f}\), then \(\lVert q - p{\rVert }_{1} \leq \lVert f - p{\rVert }_{1} + {{\varepsilon }(f(I)+p(I))}\) .

Proof

We have

$$q - p = {{\varepsilon}(f(I)/\left\lvert I \right\rvert - p) + (1-{\varepsilon})(f - p)}. $$

Therefore

$${\lVert{ q - p }{\rVert}}_1 \leq { {\varepsilon} \lVert f(I)/|I| - p{\rVert}_{1} + (1-{\varepsilon}) \lVert f - p {\rVert}_{1} \leq {\varepsilon}(f(I)+p(I)) + \lVert f - p {\rVert}_{1} } . here $$

We now have \({\lVert { p - q }{\rVert }}_1 \leq {\lVert { p-f }{\rVert }}_1 + O({\varepsilon })\) by Proposition C.5, concluding the proof of Theorem 4.13. □

1.2 C.2 Proof of Lemma 4.14

Lemma 4.14 (Monotone Hazard Rate)

There exists a procedure \(\textsc {ProjectionDist}_{{\mathcal {MHR}}}^{\ast }\) that, on input n as well as the full specification of a k-histogram distribution D on [n]and of an -histogram distribution \(D^{\prime }\) on [n], runs in time poly(n, 1/ε), and satisfies the following.

  • If there is \(P\in {\mathcal {MHR}}\) such that \({\lVert {{D}-P}{\rVert }}_1 \leq {\varepsilon }\) and \({\lVert {D}^{\prime } - P{\rVert }_{\text {Kol}}} \leq {\varepsilon }^{3}\), then the procedure returns yes;

  • If \(\ell _{1}({D},{\mathcal {MHR}}) > 100{\varepsilon }\), then the procedure returns no.

Proof

For convenience, let \(\alpha \overset {\text {def}}{=} {\varepsilon }^{3}\);we also write [i, j]instead of \(\{i,\dots ,j\}\).

First, we note that it is easy to reduce our problem to the case where, in the completeness case, wehave \(P\in {\mathcal {MHR}}\)suchthat \({\lVert {{D}-P}{\rVert }}_1 \leq 2{\varepsilon }\)and\({\lVert {D} - P{\rVert }_{\text {Kol}}} \leq 2\alpha \); while in the soundness case\(\ell _{1}({D},{\mathcal {MHR}}) \geq 99{\varepsilon }\). Indeed, this can be done with a linear program on poly(k, )variables,asking to find a (k + )-histogram\(D^{\prime \prime }\) on a refinementof D and \(D^{\prime }\)minimizing the 1 distance to D, under the constraint that the Kolmogorov distance to \(D^{\prime }\)be boundedby ε. (Inthe completeness case, clearly a feasible solution exists, as p is one.) We therefore follow with thisnew formulation: either

  • (a) D is ε-closeto a monotone hazard rate distribution P (in 1distance) and D is α-closeto P (in Kolmogorov distance); and

  • (b) D is 32ε-farfrom monotone hazard rate

where D is a (k + )-histogram.

We then proceed by observing the following easy fact: suppose P is a MHR distribution on [n], i.e. such thatthe quantity \(h_{i} \overset {\text {def}}{=} \frac {P(i)}{{\sum }_{j=i}^{n} P(i)}\), i ∈ [n]isnon-increasing. Then, we have

$$ P(i) = h_{i} \prod\limits_{j=1}^{i-1} (1-h_{j}), \qquad i\in[n]. $$
(10)

and there is a bijective correspondence between P and (h i ) i∈[n].

We will write a linear program with variables\(y_{1},\dots ,y_{n}\), with thecorrespondence \(y_{i}\overset {\text {def}}{=}\ln (1-h_{i})\).Note that with this parameterization, we get that if the (y i ) i∈[n]correspond to a MHR distribution P, then for i ∈ [n]

$$P([i,n]) = \prod\limits_{j=1}^{i-1} e^{y_{j}} = e^{{\sum}_{j=1}^{i-1} y_{j}} $$

and askingthat \(\ln (1-{\varepsilon }) \leq {\sum }_{j=1}^{i-1} y_{j} - \ln {D}([i,n]) \leq \ln (1+{\varepsilon })\)amounts to requiring

$$P([i,n]) \in [1\pm{\varepsilon}] {D}([i,n]). $$

We focus first on the completeness case, to provide intuition for the linear program. Suppose thereexists \(P\in {\mathcal {MHR}}\)such\(P\in {\mathcal {MHR}}\)such that\({\lVert {{D}-P}{\rVert }}_1 \leq {\varepsilon }\)and\({\lVert {D}^{\prime } - P{\rVert }_{\text {Kol}}} \leq \alpha \). This impliesthat for all i ∈ [n],\(\left \lvert P([i,n]) - {D}([i,n]) \right \rvert \leq 2\alpha \). Define\(I=\{b+1,\dots ,n\}\)to be the longestinterval such that \(D(\{b+1,\dots ,n\})\leq \frac {{\varepsilon }}{2}\). Itfollows that for every i ∈ [n] ∖ I,

$$ \frac{P([i,n])}{{D}([i,n])} \leq \frac{{D}([i,n])+2\alpha}{{D}([i,n])} \leq 1+\frac{2\alpha}{{\varepsilon}/2} = 1+4{\varepsilon}^{2} \leq 1+{\varepsilon} $$
(11)

and similarly \(\frac {P([i,n])}{{D}([i,n])} \geq \frac {D([i,n])-2\alpha }{D([i,n]} \geq 1-{\varepsilon } \). This meansthat for the points i in [n] ∖ I,we can write constraints asking for multiplicative closeness (within 1 ± ε)between\(e^{{\sum }_{j=1}^{i-1} y_{j}}\)and D([i, n]),which is very easy to write down as linear constraints on the y i ’s.

The Linear Program

Let T and S be respectively the sets of “light” and “heavy” points, defined as \(T= \left \{\; i\in \{1,\dots ,b\} \;\colon \; {D}(i) \leq {\varepsilon }^{2} \; \right \} \) and \(S= \left \{\; i\in \{1,\dots ,b\} \;\colon \; {D}(i) > {\varepsilon }^{2} \; \right \} \), where b is as above. (In particular, \(\left \lvert S \right \rvert \leq 1/{\varepsilon }^{2}\).)

figure g

Given a solution to the linear program above, define \(\tilde {P}\) (a non-normalized probability distribution) by setting \(\tilde {P}(i) = (1-e^{y_{i}})e^{{\sum }_{j=1}^{i-1} y_{j}}\) for \(i\in \{1,\dots ,b\}\), and \(\tilde {P}(i) = 0\) for \(i\in I = \{b+1,\dots , n\}\). A MHR distribution is then obtained by normalizing \(\tilde {P}\).

Completeness

Suppose \(P\in {\mathcal {MHR}}\) is as promised. In particular, by the Kolmogorov distance assumption we know that every iT has P(i) ≤ ε 2 + 2α < 2ε 2.

  • For any iT, we have that \(\frac {P(i)}{P[i,n]} \leq \frac {2{\varepsilon }^{2}}{(1-{\varepsilon }){\varepsilon }} \leq 4{\varepsilon }\), and

    $$ \frac{{D}(i)-{\varepsilon}_{i}}{(1+{\varepsilon}){D}[i,n]} \leq \frac{P(i)}{P[i,n]} \leq \underbrace{-\ln(1-\frac{P(i)}{P[i,n]})}_{-y_{i}} \leq (1+4{\varepsilon})\frac{P(i)}{P[i,n]} = (1+4{\varepsilon})\frac{{D}(i)+{\varepsilon}_{i}}{P[i,n]} \leq \frac{1+4{\varepsilon}}{1-{\varepsilon}}\frac{{D}(i)+{\varepsilon}_{i}}{{D}[i,n]} $$
    (12)

    where we used (11) for the two outer inequalities; and so (15), (16), and (17) would follow from setting \(\varepsilon _{i} \overset {\text {def}}{=} \left \lvert {P(i)-{D}(i)} \right \rvert \) (along with the guarantees on 1 and Kolmogorov distances between P and D).

  • For iS, Constraint (18) is also met, as \(\frac {P(i)}{P([i,n])} \in \left [\frac {{D}(i)-2\alpha }{P([i,n])},\frac {{D}(i)+2\alpha }{P([i,n])}\right ] \subseteq \left [\frac {{D}(i)-2\alpha }{(1+{\varepsilon }){D}([i,n])},\frac {{D}(i)+2\alpha }{(1-{\varepsilon }){D}([i,n])}\right ]\).

Soundness

Assume a feasible solution to the linear program is found. We argue that this implies D is \({O\left ({\varepsilon } \right )}\)-close to some MHR distribution, namely to the distribution obtained by renormalizing \(\tilde {P}\).

In order to do so, we bound separately the 1 distance between D and \(\tilde {P}\), from I, S, and T. First, \({\sum }_{i\in I} \left \lvert {D}(i) - \tilde {P}(i) \right \rvert = {\sum }_{i\in I} {D}(i) \leq \frac {{\varepsilon }}{2}\) by construction. For iT, we have \(\frac {{D}(i)}{{D}[i,n]} \leq {\varepsilon }\), and

$$\begin{array}{@{}rcl@{}} \tilde{P}(i) = (1-e^{y_{i})}) e^{{\sum}_{j=1}^{i-1} y_{j}} \in \left[1\pm {\varepsilon}\right] (1-e^{y_{i}}) {D}([i,n]). \end{array} $$

Now,

$$\begin{array}{@{}rcl@{}} 1-(1-{\varepsilon})\frac{{D}(i)-{\varepsilon}_{i}}{(1+{\varepsilon}){D}[i,n]} \geq e^{-\frac{{D}(i)-{\varepsilon}_{i}}{(1+{\varepsilon}){D}[i,n]}} \geq e^{y_{i}} \geq e^{-(1+4{\varepsilon})\frac{{D}(i)+{\varepsilon}_{i}}{(1-{\varepsilon}){D}[i,n]}} \geq 1-(1+4{\varepsilon})\frac{{D}(i)+{\varepsilon}_{i}}{(1-{\varepsilon}){D}[i,n]} \end{array} $$

so that

$$(1-{\varepsilon})\frac{(1-{\varepsilon})}{(1+{\varepsilon})}({D}(i)-{\varepsilon}_{i}) \leq \tilde{P}(i) \leq (1+4{\varepsilon})\frac{(1+{\varepsilon})}{(1-{\varepsilon})}({D}(i)+{\varepsilon}_{i}) $$

which implies

$$(1-10{\varepsilon})({D}(i)-{\varepsilon}_{i}) \leq \tilde{P}(i) \leq (1+10{\varepsilon})({D}(i)+{\varepsilon}_{i}) $$

so that \({\sum }_{i\in T} \left \lvert {D}(i) - \tilde {P}(i) \right \rvert \leq 10{\varepsilon } {\sum }_{i\in T} {D}(i) + (1+10{\varepsilon }){\sum }_{i\in T} {\varepsilon }_{i} \leq 10\varepsilon + (1+10\varepsilon )\varepsilon \leq 20\varepsilon \) where the last inequality follows from Constraint (16).

To analyze the contribution from S, we observe that Constraint (18) implies that, for any iS,

$$\frac{{D}(i)-2\alpha}{(1+{\varepsilon}){D}([i,n])} \leq \frac{\tilde{P}(i)}{\tilde{P}([i,n])} \leq \frac{{D}(i)+2\alpha}{(1-{\varepsilon}){D}([i,n])} $$

which combined with Constraint (14) guarantees

$$\frac{{D}(i)-2\alpha}{(1+{\varepsilon})^{2}\tilde{P}([i,n])} \leq \frac{\tilde{P}(i)}{\tilde{P}([i,n])} \leq \frac{{D}(i)+2\alpha}{(1-{\varepsilon})^{2}\tilde{P}([i,n])} $$

which in turn implies that \(\left \lvert \tilde {P}(i) - {D}(i) \right \rvert \leq 3{\varepsilon }\tilde {P}(i) + 2\alpha \). Recalling that \(\left \lvert S \right \rvert \leq \frac {1}{{\varepsilon }^{2}}\) and α = ε 3, this yields \({\sum }_{i\in S} \left \lvert {D}(i) - \tilde {P}(i) \right \rvert \leq 3{\varepsilon } {\sum }_{i\in S} \tilde {P}(i) + 2{\varepsilon } \leq 3{\varepsilon }(1+{\varepsilon }) + 2{\varepsilon } \leq 8{\varepsilon }\). Summing up, we get \( {\sum }_{i=1}^{n} \left \lvert {D}(i) - \tilde {P}(i) \right \rvert \leq 30{\varepsilon } \) which finally implies by the triangle inequality that the 1 distance between D and the normalized version of \(\tilde {P}\) (a valid MHR distribution) is at most 32ε.

Running Time

The running time is immediate, from executing the two linear programs on poly(n, 1/ε) variables and constraints. □

1.3 C.3 Proof of Lemma 4.15

Lemma 4.15 (Log-concavity)

There exists a procedure \(\textsc {ProjectionDist}_{{\mathcal {L}}}^{\ast }\) that, on input n as well as the full specifications of a k-histogram distribution D on [n]and an -histogram distribution \(D^{\prime }\) on [n], runs in time poly(n, k, , 1/ε), and satisfies the following.

  • If there is \(P\in {\mathcal {L}}\) such that \({\lVert {{D}-P}{\rVert }}_1\leq {\varepsilon }\) and \({\lVert {D}^{\prime } - P{\rVert }_{\text {Kol}}}\leq \frac {\varepsilon ^{2}}{\log ^{2}(1/\varepsilon )}\), then the procedure returns yes;

  • If \(\ell _{1}({D},{\mathcal {L}}) \geq 100{\varepsilon }\), then the procedure returns no.

Proof

We set \(\alpha \overset {\text {def}}{=} \frac {{\varepsilon }^{2}}{\log ^{2}(1/{\varepsilon })}\),\(\beta \overset {\text {def}}{=} \frac {{\varepsilon }^{2}}{\log (1/{\varepsilon })}\),and \(\gamma \overset {\text {def}}{=} \frac {\varepsilon ^{2}}{10}\)(so that αβγε),

Given the explicit description of a distribution D on [n], which a k-histogramover a partition \(\mathcal {I}=(I_{1},\dots , I_{k})\)of [n]with\(k=\text {poly}(\log n, 1/\varepsilon )\)and the explicit descriptionof a distribution \(D^{\prime }\)on [n],one must efficiently distinguish between:

  • (a) D is ε-closeto a log-concave P (in 1distance) and \(D^{\prime }\)is α-closeto P (in Kolmogorov distance); and

  • (b) D is 100ε-farfrom log-concave.

If we are willing to pay an extra factor of \({O\left (n \right )}\),we can assume without loss of generality that we know the mode of the closest log-concavedistribution (which is implicitly assumed in the following: the final algorithm will simply try allpossible modes).

Outline

First, we argue that we can simplify to the case where D is unimodal. Then, reduce to the case where where D and \(D^{\prime }\) are only one distribution, satisfying both requirements from the completeness case. Both can be done efficiently (Section C.3.1), and make the rest much easier. Then, perform some ad hoc partitioning of [n], using our knowledge of D, into \(\tilde {O}\left ({1/\varepsilon ^{2}} \right )\) pieces such that each piece is either a “heavy” singleton, or an interval I with weight very close (multiplicatively) to D(I)under the target log-concave distribution, if it exists (Section C.3.2). This in particular simplifies the type of log-concave distribution we are looking for: it is sufficient to look for distributions putting that very specific weight on each piece, up to a (1 + o(1)) factor. Then, in Section C.3.3, we write and solve a linear program to try and find such a “simplified” log-concave distribution, and reject if no feasible solution exists.

Note that the first two sections allow us to argue that instead of additive (in 1) closeness, we can enforce constraints on multiplicative (within a (1 + ε) factor) closeness between D and the target log-concave distribution. This is what enables a linear program with variables being the logarithm of the probabilities, which plays very nicely with the log-concavity constraints.

We will require the following result of Chan, Diakonikolas, Servedio, and Sun:

Theorem C.6 ([15, Lemma 4.1])

Let D be a distribution over [n], log-concave and non-decreasing over \(\{1,\dots ,b\} \subseteq [n]\) .Let ab such that \(\sigma = D(\{1,\dots ,a-1\}) > 0\), and write \(\tau ={D}(\{a,\dots ,b\})\) .Then \(\frac {{D}(b)}{{D}(a)} \leq 1+\frac {\tau }{\sigma }\).

1.3.1 C.3.1 Step 1

Reducing to D Unimodal

Using a linear program, find a closest unimodal distribution \(\tilde {{D}}\) to D (also a k-histogram on \(\mathcal {I}\)) under the constraint that \({\lVert {{D} - {P}}{\rVert }_{\text {Kol}}} \leq \alpha \): this can be done in time poly(k). If \(\|{{D}-\tilde {{D}}}\|_{1} > \varepsilon \), output reject.

  • If D is ε-close to a log-concave distribution P as above, then it is in particular ε-close to unimodal and we do not reject. Moreover, by the triangle inequality \(\|{\tilde {{D}}\|_{1} - P} \leq 2\varepsilon \) and \(\lVert {{\tilde {{D}}} - {P}}{\rVert }_{\text {Kol}} \leq \alpha \leq 2\alpha \).

  • If D is 100ε-far from log-concave and we do not reject, then \(\ell _{1}(\tilde {{D}},{\mathcal {L}}) \geq 99{\varepsilon }\).

Reducing to \(D={D}^{\prime }\)

First, we note that it is easy to reduce our problem to the case where, in the completeness case, we have \(P\in {\mathcal {L}}\) such that \({\lVert {{D}-P}{\rVert }}_1 \leq 4{\varepsilon }\) and \({\lVert {D} - P{\rVert }_{\text {Kol}}} \leq 4\alpha \); while in the soundness case \(\ell _{1}({D},{\mathcal {L}}) \geq 97{\varepsilon }\). Indeed, this can be done with a linear program on poly(k, ) variables and constraints, asking to find a (k + )-histogram \(D^{\prime \prime }\) on a refinement of D and \(D^{\prime }\) minimizing the 1 distance to D, under the constraint that the Kolmogorov distance to \(D^{\prime }\) be bounded by 2α. (In the completeness case, clearly a feasible solution exists, as (the flattening on this (k + )-interval partition) of P is one.) We therefore follow with this new formulation: either

  • (a) D is 4ε-close to a log-concave P (in 1 distance) and D is 4α-close to P (in Kolmogorov distance); and

  • (b) D is 97ε-far from log-concave;

where D is a (k + )-histogram.

This way, we have reduced the problem to a slightly more convenient one, that of Section C.3.2.

Reducing to Knowing the Support [a, b]

The next step is to compute a good approximation of the support of any target log-concave distribution. This is easily obtained in time O(k) as the interval {a,⋯ ,b} such that

  • \(D(\{1,\dots ,a-1\}) \leq \alpha \) but \(D(\{1,\dots ,a\}) > \alpha \); and

  • \(D(\{b+1,\dots ,\}n) \leq \alpha \) but \(D(\{b,\dots ,n\}) > \alpha \).

Any log-concave distribution that is α-close to D must include {a,⋯ ,b} in its support, since otherwise the 1 distance between D and P is already greater than α. Conversely, if P is a log-concave distribution α-close to D, it is easy to see that the distribution obtained by setting P to be zero outside {a,⋯ ,b} and renormalizing the result is still log-concave, and O(α)-close to D.

1.3.2 C.3.2 Step 2

Given the explicit description of a unimodal distribution D on [n], which a k-histogram over a partition \(\mathcal {I}=(I_{1},\dots , I_{k})\) of [n] with \(k=\text {poly}(\log n, 1/\varepsilon )\), one must efficiently distinguish between:

  • (a) D is ε-close to a log-concave P (in 1 distance) and α-close to P (in Kolmogorov distance); and

  • (b) D is 24ε-far from log-concave,

assuming we know the mode of the closest log-concave distribution, which has support [n].

In this stage, we compute a partition \(\mathcal {J}\) of [n] into \(\tilde {O}\left (1/{\varepsilon }^{2} \right )\) intervals (here, we implicitly use the knowledge of the mode of the closest log-concave distribution, in order to apply Theorem C.6 differently on two intervals of the support, corresponding to the non-decreasing and non-increasing parts of the target log-concave distribution).

As D is unimodal, we can efficiently (\({O\left (\log k \right )}\)) find the interval S of heavy points, that is

$$S\overset{\text{def}}{=} \left\{\; x \in [n] \;\colon\; {D}(x) \geq \beta \; \right\} . $$

Each point in S will form a singleton interval in our partition. Let \(T\overset {\text {def}}{=} [n]\setminus S\) be its complement (T is the union of at most two intervals T 1,T 2 on which D is monotone, the head and tail of the distribution). For convenience, we focus on only one of these two intervals, without loss of generality the “head” T 1 (on which D is non-decreasing).

  1. 1.

    Greedily find \(J=\{1,\dots ,a\}\), the smallest prefix of the distribution satisfying \(D(J)\in \left [\frac {\varepsilon }{10}-\beta , \frac {\varepsilon }{10}\right ]\).

  2. 2.

    Similarly, partition T 1J into intervals \(I^{\prime }_{1},\dots ,I^{\prime }_{s}\) (with \(s={O\left ({1/\gamma } \right )}={O\left ({1/\varepsilon ^{2}} \right )}\)) such that \( \frac {\gamma }{10} \leq {D}(I^{\prime }_{j}) \leq \frac {9}{10}\gamma \) for all 1 ≤ js − 1, and \(\frac {\gamma }{10} \leq {D}(I^{\prime }_{s}) \leq \gamma \). This is possible as all points not in S have weight less than β, and βγ.

Discussion: Why Doing This?

We focus on the completeness case: let \(P\in {\mathcal {L}}\) be a log-concave distribution such that \({\lVert {{D}-P}{\rVert }}_1 \leq {\varepsilon }\) and \({\lVert {D} - P{\rVert }_{\text {Kol}}} \leq \alpha \). Applying Theorem C.6 on J and the \(I^{\prime }_{j}\)’s, we obtain (using the fact that \(\left \lvert P(I^{\prime }_{j}) - {D}(I^{\prime }_{j}) \right \rvert \leq 2\alpha \)) that:

$$\frac{\max_{x\in I^{\prime}_{j}} P(x)}{\min_{x\in I^{\prime}_{j}} P(x)} \leq 1+\frac{{D}(I^{\prime}_{j})+2\alpha}{{D}(J)-2\alpha} \leq 1 + \frac{\gamma+2\alpha}{\frac{{\varepsilon}}{10}-2\alpha} = 1+ {\varepsilon} + {O\left( \frac{{\varepsilon}^{2}}{\log^{2}(1/{\varepsilon})} \right)} \overset{\text{def}}{=} 1+\kappa. $$

Moreover, we also get that each resulting interval \(I^{\prime }_{j}\) will satisfy

$${D}(I^{\prime}_{j})(1-\kappa_{j}) = {D}(I^{\prime}_{j})-2\alpha \leq P(I^{\prime}_{j}) \leq {D}(I^{\prime}_{j})+2\alpha = {D}(I^{\prime}_{j})(1+\kappa_{j}) $$

with \(\kappa _{j} \overset {\text {def}}{=} \frac {2\alpha }{{D}(I^{\prime }_{j})} = {\Theta \left (1/\log ^{2}(1/{\varepsilon }) \right )}\).

Summing up, we have a partition of [n] into \(\left \lvert S \right \rvert +2 = \tilde {O}\left (1/{\varepsilon }^{2} \right )\) intervals such that:

  • The (at most) two end intervals have \(D(J)\in \left [\frac {{\varepsilon }}{10}-\beta , \frac {{\varepsilon }}{10}\right ]\), and thus \(P(J)\in \left [\frac {\varepsilon }{10}-\beta -2\alpha , \frac {\varepsilon }{10}+2\alpha \right ]\);

  • the \(\tilde {O}\left (1/{\varepsilon }^{2} \right )\) singleton-intervals from S are points x with D(x) ≥ β, so that \(P(x) \geq \beta -2\alpha \geq \frac {\beta }{2}\);

  • each other interval \(I=I^{\prime }_{j}\) satisfies

    $$ (1-\kappa_{j}) {D}(I) \leq P(I) \leq (1+\kappa_{j}) {D}(I) $$
    (20)

    with \(\kappa _{j}={O\left (1/\log ^{2}(1/{\varepsilon }) \right )}\); and

    $$ \frac{\max_{x\in I}P(x)}{\min_{x\in I}P(x)} \leq 1+\kappa < 1+\frac{3}{2}{\varepsilon}. $$
    (21)

We will use in the constraints of the linear program the fact that \((1+\frac {3}{2}{\varepsilon })(1+\kappa _{j}) \leq 1+2{\varepsilon }\), and \(\frac {1-\kappa _{j}}{1+\frac {3}{2}\varepsilon } \geq \frac {1}{1+2\varepsilon }\).

1.3.3 C.3.3 Step 3

We start by computing the partition \(\mathcal {J}=(J_{1},\dots ,J_{\ell })\) as in Section C.3.2; with \(\ell =\tilde {O}\left ({1/\varepsilon ^{2}} \right )\); and write \(J_{j}=\{a_{j},\dots ,b_{j}\}\) for all j ∈ []. We further denote by S and T the set of heavy and light points, following the notations from Section C.3.2; and let \(T^{\prime } \overset {\text {def}}{=} T_{1}\cup T_{2}\) be the set obtained by removing the two “end intervals” (called J in the previous section) from T.

figure h

Lemma C.7 (Soundness)

If the linear program (Algorithm 8) has a feasible solution, then \(\ell _{1}({D}, {\mathcal {L}})\leq {O\left (\varepsilon \right )}\) .

Proof

A feasible solution to this linear program will define (setting\(p_{i}=e^{x_{i}}\)) asequence \(p=(p_{1},\dots ,p_{n}) \in (0, 1]^{n}\)such that

  • p takes values in (0, 1](from (22));

  • p is log-concave (from (23));

  • p is “ (1 + O(ε))-multiplicatively constant” on each interval J j (from (24));

  • p puts roughly the right amount of weight on each J i :

    • weight (1 ± O(ε))D(J)on everyJ fromT (from (24)), so that the 1distance between D and P coming from\(T^{\prime }\)is at most O(ε);

    • it puts weight approximately D(J)on every singletonJ from S, i.e. such that D(J) ≥ β. To see why,observe that each ε i is in [0, 2α]by constraints (27). In particular, this means that\(\frac {\varepsilon _{i}}{{D}(i)} \leq 2\frac {\alpha }{\beta } \ll 1\), and wehave

      $${D}(i) - 4{\varepsilon}_{i} \leq {D}(i)\cdot e^{-4\frac{{\varepsilon}_{i}}{{D}(i)}} \leq p_{i} = e^{x_{i}} \leq {D}(i)\cdot e^{2\frac{{\varepsilon}_{i}}{{D}(i)}} \leq {D}(i)+4{\varepsilon}_{i} $$

      and together with (26) this guarantees that the 1distance between D and p coming from S is at most ε.

Note that the solution obtained this way may not sum to one—i.e., is not necessarily a probabilitydistribution. However, it is easy to renormalize p to obtain a bona fide probability distribution\(\tilde {P}\)as follows:set \(\tilde {P} = \frac {p(i)}{{\sum }_{i\in S\cup T^{\prime }} p(i)}\)forall \(i\in S\cup T^{\prime }\), and p(i) = 0 for \(i\in T\setminus T^{\prime }\).

Since by the above discussion we know that\(p(S\cup T^{\prime })\)is within\({O\left ({\varepsilon } \right )}\)of\(D(S\cup T^{\prime })\)(itself in\([1-\frac {9\varepsilon }{5}, 1+\frac {9\varepsilon }{5}]\)by constructionof \(T^{\prime }\)),\(\tilde {P}\)is a log-concavedistribution such that \(\|{\tilde {P}-{D}}_{1} = {O\left ({\varepsilon } \right )}\). □

Lemma C.8 (Completeness)

If there is P in \({\mathcal {L}}\) such that \({\lVert {{D}-P}{\rVert }}_1\leq {\varepsilon }\) and \({\lVert {D} - P{\rVert }_{\text {Kol}}}\leq \alpha \), then the linear program (Algorithm 8) has a feasible solution.

Proof

Let \(P\in {\mathcal {L}}\)such that \({\lVert {{D} - P}{\rVert }}_1\leq {\varepsilon }\)and \({\lVert {D} - P{\rVert }_{\text {Kol}}}\leq \alpha \).Define \(x_{i}\overset {\text {def}}{=} \ln P(i)\)for all i ∈ [n].Constraints (22) and (23) are immediately satisfied, since P is log-concave. By the discussion from Section C.3.2(more specifically, (20) and (21)), constraint (24) holds as well.

Letting \({\varepsilon }_{i}\overset {\text {def}}{=} \left \lvert P(i)-{D}(i) \right \rvert \)for iS, we also immediately have (26) and (27) (since \({\lVert {P-{D}}{\rVert }}_1 \leq {\varepsilon }\)and \({\lVert {D} - P{\rVert }_{\text {Kol}}}\leq \alpha \)by assumption). Finally, to see why (25) is satisfied, werewrite

$$x_{i} - \ln{D}(i) = \ln\frac{P(i)}{{D}(i)} = \ln\frac{{D}(i)\pm{\varepsilon}_{i}}{{D}(i)} = \ln\left( 1\pm \frac{{\varepsilon}_{i}}{{D}(i)}\right) $$

and use thefact that \(\ln (1+x) \leq x\)and \(\ln (1-x) \geq -2x\)(thelatter for \(x < \frac {1}{2}\),along with \(\frac {{\varepsilon }_{i}}{{D}(i)} \leq \frac {2\alpha }{\beta } \ll 1\)). □

1.3.4 C.3.4 Putting it All Together: Proof of Lemma 4.15

The algorithm is as follows (keeping the notations from Sections C.3.1 to C.3.3):

  • Set α, β, γ as above.

  • Follow Section C.3.1 to reduce it to the case where D is unimodal and satisfies the conditions for Kolmogorov and 1 distance; and a good [a, b] approximation of the support is known

  • For each of the \({O\left (n \right )}\) possible modes c ∈ [a, b]:

    • Run the linear program Algorithm 8, return accept if a feasible solution is found

  • None of the linear programs was feasible: return reject.

The correctness comes from Lemma C.7 and Lemma C.8 and the discussions in Sections C.3.1 to C.3.3; as for the claimed running time, it is immediate from the algorithm and the fact that the linear program executed each step has poly(n, 1/ε) constraints and variables. □

Appendix D: Proof of Theorem 6.3

In this section, we establish our lower bound for tolerant testing of the Binomial distribution, restated below:

Theorem 6.3

There exists an absolute constant ε 0 > 0such that the following holds. Any algorithm which, given sampling access to an unknown distribution D on Ωand parameter ε ∈ (0,ε 0), distinguishes with probability at least 2/3between (i) \({\lVert {{D}-{\operatorname {Bin}\!\left (n, 1/2 \right )}}{\rVert }}_1 \leq {\varepsilon }\) and (ii) \({\lVert {{D}-{\operatorname {Bin}\!\left (n, 1/2 \right )}}{\rVert }}_1 \geq 100{\varepsilon }\) must use \({\Omega \left (\frac {1}{{\varepsilon }}\frac {\sqrt {n}}{\log n} \right )}\) samples.

The theorem will be a consequence of the (slightly) more general result below:

Theorem D.1

There exist absolute constants ε 0 > 0and λ > 0such that the following holds. Any algorithm which, given sample access to an unknown distribution D on Ωand parameter ε ∈ (0,ε 0), distinguishes with probability at least 2/3between (i) \({\lVert {{D}-{\operatorname {Bin}\!\left (n, \frac {1}{2} \right )}}{\rVert }}_1 \leq {\varepsilon }\) and (ii) \({\lVert {{D}-{\operatorname {Bin}\!\left (n, \frac {1}{2} \right )}}{\rVert }}_1 \geq \lambda {\varepsilon }^{1/3}-{\varepsilon }\) must use \({\Omega \left ({\varepsilon }\frac {\sqrt {n}}{\log ({\varepsilon } n)} \right )}\) samples.

By choosing a suitable ε and working out the corresponding parameters, this for instance enables us to derive the following:

Corollary D.2

There exists an absolute constant ε 0 ∈ (0, 1/1000)such that the following holds. Any algorithm which, given sample access to an unknown distribution D on Ω, distinguishes with probability at least 2/3between (i) \({\lVert {{D}-{\operatorname {Bin}\!\left (n, \frac {1}{2} \right )}}{\rVert }}_1 \leq {\varepsilon }_{0}\) and (ii) \({\lVert {{D}-{\operatorname {Bin}\!\left (n, \frac {1}{2} \right )}}{\rVert }}_1 \geq 100{\varepsilon }_{0}\) must use \({\Omega \left (\frac {\sqrt {n}}{\log n} \right )}\) samples.

By standard techniques, this will in turn imply Theorem 6.3.Footnote 11

Proof of Theorem D.1

Hereafter, we write for convenience \(B_{n}\overset {\text {def}}{=} {\operatorname {Bin}\!\left (n, \frac {1}{2} \right )}\).To prove this lower bound, we will rely on the following:

Theorem D.3([50, Theorem 1])

For any constant ϕ ∈ (0, 1/4), following holds. Any algorithm which, given sample access to an unknown distribution D on \(\{1,\dots ,N\}\), distinguishes with probability at least 2/3between (i) \({\lVert {{D} - {\mathcal {U}}_{N}}{\rVert }}_1 \leq \phi \) and (ii) \({\lVert {{D} - {\mathcal {U}}_{N}}{\rVert }}_1 \geq \frac {1}{2}-\phi \), must have sample complexity at least \(\frac {\phi }{32}\frac {N}{\log N}\) .

Without loss of generality, assume n is even (so that B n has only one mode located at \(\frac {n}{2}\)). For c > 0, we write I n, c for the interval \(\{\frac {n}{2}-c\sqrt {n},\dots ,\frac {n}{2}+c\sqrt {n}\}\) and \(J_{n,c}\overset {\text {def}}{=}{\Omega }\setminus I_{n,c}\).

Fact D.4

For any c > 0,

$$\frac{B_{n}(\frac{n}{2} + c\sqrt{n})}{B_{n}({n}/{2})}, \frac{B_{n}(\frac{n}{2} - c\sqrt{n})}{B_{n}({n}/{2})} \operatorname*{\sim}_{n\to\infty} e^{-2c^{2}} $$

and

$$B_{n}(I_{n,c}) \in (1\pm o(1))\cdot[e^{-2c^{2}},1]\cdot 2c\sqrt{\frac{2}{\pi}} = {\Theta\left( c \right)}\,. $$

The reduction proceeds as follows: given sampling access to D on [N], we can simulate sampling access to a distribution \(D^{\prime }\) on [n] (where \(n={\Theta \left ({N^{2}} \right )}\)) such that

  • if \({\lVert {{D} - {\mathcal {U}}_{N}}{\rVert }}_1 \leq \phi \), then \({\lVert {{D}^{\prime } - B_{n}}{\rVert }}_1 < {\varepsilon }\);

  • if \({\lVert {{D} - {\mathcal {U}}_{N}}{\rVert }}_1 \geq \frac {1}{2}-\phi \), then \({\lVert {{D}^{\prime } - B_{n}}{\rVert }}_1 > {\varepsilon }^{\prime } - {\varepsilon }\)

for \({\varepsilon } \overset {\text {def}}{=} {\Theta }(\phi ^{3/2})\) and \({\varepsilon }^{\prime } \overset {\text {def}}{=} {\Theta }(\phi ^{\frac {1}{2}})\); in a way that preserves the sample complexity. The high-level idea is that (by the above fact) the Binomial distribution over Ω is almost uniform on the middle \(O(\sqrt {n})\) elements, and has a constant fraction of its probability mass there: we can therefore “embed” the tolerant uniformity testing lower bound (for support \(O(\sqrt {n})\)) into this middle interval.

More precisely, define \(c\overset {\text {def}}{=} \sqrt {\frac {1}{2}\ln \frac {1}{1-\phi }} ={\Theta \left (\sqrt {\phi } \right )}\) (so that \(\phi = 1 - e^{-2c^{2}}\)) and n such that \(\left \lvert {I_{n,c}} \right \rvert = N\) (that is, \(n=(N/(2c))^{2} = {\Theta \left ({N^{2}/\phi } \right )}\)). From now on, we can therefore identify [N] to I n, c in the obvious way, and see a draw from D as an element in I n, c .

Let \(p\overset {\text {def}}{=} B_{n}(I_{n,c}) = {\Theta \left (\sqrt {\phi } \right )}\), and B n, c , \(\bar {B}_{n,c}\) respectively denote the conditional distributions induced by B n on I n, c and J n, c . Intuitively, we want D to be mapped to the conditional distribution of \(D^{\prime }\) on I n, c , and the conditional distribution of \(D^{\prime }\) on J n, c to be exactly \(\bar {B}_{n,c}\). This is achieved by defining \(D^{\prime }\) by the process below:

  • with probability p, we draw a sample from D (seen as an element of I n, c );

  • with probability 1 − p, we draw a sample from \(\bar {B}_{n,c}\).

Let \(\tilde {B}_{n}\) be defined as the distribution which exactly matches B n on J n, c , but is uniform on I n, c :

$$\begin{array}{@{}rcl@{}} \tilde{B}_{n}(i) = \left\{\begin{array}{ll} \frac{p}{\left\lvert I_{n,c} \right\rvert} & i\in I_{n,c}\\ B_{n}(i) & i\in J_{n,c}\\ \end{array}\right. \end{array} $$

From the above, we have that \({\|{D}^{\prime } - \tilde {B}_{n}}\|_{1} = p\cdot {\lVert {{D} - {\mathcal {U}}_{N}}{\rVert }}_1\). Furthermore, by Fact D.4, Lemma 2.9 and the definition of I n, c , we get that \(\|{B_{n} - \tilde {B}_{n}} = p\cdot \|{(B_{n})_{I_{n,c}} - {\mathcal {U}}_{I_{n,c}}} \leq p\cdot \phi \). Putting it all together,

  • If \({\lVert {{D} - {\mathcal {U}}_{N}}{\rVert }}_1 \leq \phi \), then by the triangle inequality \({\lVert {{D}^{\prime } - B_{n}}{\rVert }}_1 \leq p(\phi + \phi ) = 2p\phi \);

  • If \({\lVert {{D} - {\mathcal {U}}_{N}}{\rVert }}_1 \geq \frac {1}{2}-\phi \), then similarly \({\lVert {{D}^{\prime } - B_{n}}{\rVert }}_1 \geq p(\frac {1}{2}-\phi -\phi ) = \frac {p}{4}-2p\phi \).

Recalling that \(p= {\Theta \left (\sqrt {\phi } \right )}\) and setting \({\varepsilon } \overset {\text {def}}{=} 2p\phi \) concludes the reduction. From Theorem D.3, we conclude that

$$\frac{\phi}{32}\frac{N}{\log N} = {\Omega\left( \phi\frac{\sqrt{\phi n}}{\log(\phi n)} \right)} = {\Omega\left( {\varepsilon}\frac{\sqrt{n}}{\log({\varepsilon} n)} \right)} $$

samples are necessary. □

Proof of Corollary D.2

The corollary follows from the proof of Theorem D.1, by choosing ε 0 > 0sufficiently small so that \(\frac {\lambda \varepsilon _{0}^{1/3}-\varepsilon _{0}}{\varepsilon _{0}} \geq 100\). □

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Canonne, C.L., Diakonikolas, I., Gouleakis, T. et al. Testing Shape Restrictions of Discrete Distributions. Theory Comput Syst 62, 4–62 (2018). https://doi.org/10.1007/s00224-017-9785-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00224-017-9785-6

Keywords

Navigation