Skip to main content

Optimal Bounds for Estimating Entropy with PMF Queries

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9235))

Abstract

Let p be an unknown probability distribution on \([n] := \{1, 2, \dots n\}\) that we can access via two kinds of queries: A SAMP query takes no input and returns \(x \in [n]\) with probability p[x]; a PMF query takes as input \(x \in [n]\) and returns the value p[x]. We consider the task of estimating the entropy of p to within \(\pm \varDelta \) (with high probability). For the usual Shannon entropy H(p), we show that \(\varOmega (\log ^2 n/\varDelta ^2)\) queries are necessary, matching a recent upper bound of Canonne and Rubinfeld. For the Rényi entropy \(H_\alpha \)(p), where \(\alpha >1\), we show that \(\varTheta (n^{1-1/\alpha })\) queries are necessary and sufficient. This complements recent work of Acharya et al. in the \(\mathsf SAMP \)-only model that showed \(O(n^{1-1/\alpha })\) queries suffice when \(\alpha \) is an integer, but \(\widetilde{\varOmega }(n)\) queries are necessary when \(\alpha \) is a noninteger. All of our lower bounds also easily extend to the model where CDF queries (given x, return \(\sum _{y \le x}\) p[y]) are allowed.

R. O’Donnell—Work performed while the author was at the Boğaziçi University Computer Engineering Department, supported by Marie Curie International Incoming Fellowship project number 626373.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    In this paper, \(\log \) denotes \(\log _2\).

  2. 2.

    In this paper, PMF, CDF and SAMP are abbreviations for probability mass function, cumulative distribution function and sampling, respectively.

  3. 3.

    They actually state \(O(\frac{\log ^2 (n/\varDelta )}{\varDelta ^2})\), but this is the same as \(O(\frac{\log ^2 n}{\varDelta ^2})\) because the range of interest is \(\frac{1}{\sqrt{n}} \le \varDelta \le \log n\).

  4. 4.

    Note that a PMF query can be simulated by two CDF queries.

References

  1. Acharya, J., Orlitsky, A., Suresh, A.T., Tyagi, H.: The complexity of estimating Rényi entropy. In: Proceedings of the 26th Annual ACM-SIAM Symposium on Discrete Algorithms (2015)

    Google Scholar 

  2. Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58(1), 137–147 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  3. Batu, T., Dasgupta, S., Kumar, R., Rubinfeld, R.: The complexity of approximating the entropy. SIAM J. Comput. 35(1), 132–150 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  4. Bhuvanagiri, L., Ganguly, S.: Estimating entropy over data streams. In: Azar, Y., Erlebach, T. (eds.) ESA 2006. LNCS, vol. 4168, pp. 148–159. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  5. Canonne, C.: A survey on distribution testing: Your data is big. But is it blue? Technical Report TR15-063, ECCC (2015)

    Google Scholar 

  6. Canonne, C., Rubinfeld, R.: Testing probability distributions underlying aggregated data. Technical Report 1402.3835, arXiv (2014)

    Google Scholar 

  7. Chakrabarti, A., Cormode, G., McGregor, A.: A near-optimal algorithm for computing the entropy of a stream, pp. 328–335 (2007)

    Google Scholar 

  8. Chakrabarti, A., Ba, K.D., Muthukrishnan, S.: Estimating entropy and entropy norm on data streams. Internet Math. 3(1), 63–78 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  9. Guha, S., McGregor, A., Venkatasubramanian, S.: Streaming and sublinear approximation of entropy and information distances. In: Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 733–742. ACM (2006)

    Google Scholar 

  10. Harvey, N., Nelson, J., Onak, K.: Sketching and streaming entropy via approximation theory. In: Proceedings of the 49th Annual IEEE Symposium on Foundations of Computer Science, pp. 489–498 (2008)

    Google Scholar 

  11. Kearns, M., Mansour, Y., Ron, D., Rubinfeld, R., Schapire, R., Sellie, L.: On the learnability of discrete distributions. In: Proceedings of the 26th Annual ACM Symposium on Theory of Computing, pp. 273–282 (1994)

    Google Scholar 

  12. Lall, A., Sekar, V., Ogihara, M., Xu, J., Zhang, H.: Data streaming algorithms for estimating entropy of network traffic. In: Proceedings of ACM SIGMETRICS, pp. 145–156 (2006)

    Google Scholar 

  13. Paninski, L.: Estimation of entropy and mutual information. Neural Comput. 15(6), 1191–1253 (2003)

    Article  MATH  Google Scholar 

  14. Paninski, L.: Estimating entropy on \(m\) bins given fewer than \(m\) samples. IEEE Trans. Inf. Theory 50(9), 2200–2203 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  15. Rubinfeld, R.: Taming big probability distributions. XRDS: Crossroads ACM Mag. Stud. 19(1), 24–28 (2012)

    Article  MathSciNet  Google Scholar 

  16. Tsallis, C.: Possible generalization of Boltzmann-Gibbs statistics. Technical Report CBPF-NF-062/87, CBPF (1987)

    Google Scholar 

  17. Valiant, G., Valiant, P.: A CLT and tight lower bounds for estimating entropy. Technical Report TR10-179, Electronic Colloquium on Computational Complexity (2011)

    Google Scholar 

  18. Valiant, G., Valiant, P.: Estimating the unseen: an \(n/\log (n)\)-sample estimator for entropy and support size, shown optimal via new CLTsse. In: Proceedings of the 43rd Annual ACM Symposium on Theory of Computing, pp. 685–694 (2011)

    Google Scholar 

  19. Valiant, P.: Testing symmetric properties of distributions. SIAM J. Comput. 40(6), 1927–1968 (2011)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

We thank Clément Canonne for his assistance with our questions about the literature, and an anonymous reviewer for helpful remarks on a previous version of this manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cafer Caferov .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Caferov, C., Kaya, B., O’Donnell, R., Say, A.C.C. (2015). Optimal Bounds for Estimating Entropy with PMF Queries. In: Italiano, G., Pighizzini, G., Sannella, D. (eds) Mathematical Foundations of Computer Science 2015. MFCS 2015. Lecture Notes in Computer Science(), vol 9235. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-48054-0_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-48054-0_16

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-48053-3

  • Online ISBN: 978-3-662-48054-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics