Density estimation with distribution element trees

Meyer, Daniel W.

doi:10.1007/s11222-017-9751-9

Density estimation with distribution element trees

Published: 16 May 2017

Volume 28, pages 609–632, (2018)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

Daniel W. Meyer¹

1485 Accesses
12 Citations
1 Altmetric
Explore all metrics

Abstract

The estimation of probability densities based on available data is a central task in many statistical applications. Especially in the case of large ensembles with many samples or high-dimensional sample spaces, computationally efficient methods are needed. We propose a new method that is based on a decomposition of the unknown distribution in terms of so-called distribution elements (DEs). These elements enable an adaptive and hierarchical discretization of the sample space with small or large elements in regions with smoothly or highly variable densities, respectively. The novel refinement strategy that we propose is based on statistical goodness-of-fit and pairwise (as an approximation to mutual) independence tests that evaluate the local approximation of the distribution in terms of DEs. The capabilities of our new method are inspected based on several examples of different dimensionality and successfully compared with other state-of-the-art density estimators.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Bagging of density estimators

Article 15 April 2019

Mathias Bourel & Jairo Cugliari

Density Estimation in High-Dimensional Spaces: A Multivariate Histogram Approach

Density estimation via Bayesian inference engines

Article 01 November 2021

M. P. Wand & J. C. F. Yu

Notes

Fixing the domain bounds based on the data range leads to bounds that are almost certainly too narrow. Accordingly, the resulting density estimates will display a bias toward too high values.
In the singular case of linearly dependent components $x_1$ and $x_2$, many small DEs, resolving the probability peak along the diagonal of the $x_1$-$x_2$-space, result from a DET estimator.

References

Achilleos, A., Delaigle, A.: Local bandwidth selectors for deconvolution kernel density estimation. Stat. Comput. 22(2), 563–577 (2012)
Article MathSciNet MATH Google Scholar
Bagnato, L., Punzo, A., Nicolis, O.: The autodependogram: a graphical device to investigate serial dependences. J. Time Ser. Anal. 33(2), 233–254 (2012)
Article MathSciNet MATH Google Scholar
Botev, Z.: Spectral implementation of adaptive kernel density estimator via diffusion. https://ch.mathworks.com/matlabcentral/fileexchange/14034-kernel-density-estimator (2007). Accessed 01 Sep 2016
Botev, Z.: Implementation of adaptive kernel density estimator for high dimensions via diffusion. https://ch.mathworks.com/matlabcentral/fileexchange/58312-kernel-density-estimator-for-high-dimensions (2016). Accessed 10 Jan 2017
Botev, Z.I., Grotowski, J.F., Kroese, D.P.: Kernel density estimation via diffusion. Ann. Stat. 38(5), 2916–2957 (2010)
Article MathSciNet MATH Google Scholar
Breiman, L., Friedman, J., Stone, C.J., Olshen, R.: Classification and Regression Trees. Wadsworth Statistics/Probability. Chapman and Hall/CRC, Boca Raton (1984)
MATH Google Scholar
Cao, R., Cuevas, A., Gonzalez Manteiga, W.: A comparative study of several smoothing methods in density estimation. Comput. Stat. Data Anal. 17(2), 153–176 (1994)
Article MATH Google Scholar
Cochran, W.G.: The chi square test of goodness of fit. Ann. Math. Stat. 23(3), 315–345 (1952)
Article MATH Google Scholar
Curtin, R.R., Cline, J.R., Slagle, N.P., March, W.B., Ram, P., Mehta, N.A., Gray, A.G.: Mlpack: a scalable C++ machine learning library. J. Mach. Learn. Res. 14, 801–805 (2013)
MathSciNet MATH Google Scholar
Ferguson, T.S.: A bayesian analysis of some nonparametric problems. Ann. Stat. 1(2), 209–230 (1973)
Article MathSciNet MATH Google Scholar
Fix, E., Hodges, J.: Discriminatory analysis, nonparametric estimation: consistency properties. Report 4, Project No. 21-49-004, USAF School of Aviation Medicine (1951)
Härdle, W., Werwatz, A., Müller, M., Sperlich, S.: Nonparametric and Semiparametric Models. Springer Series in Statistics, 1st edn. Springer, Berlin (2004)
Jiang, H., Mu, J.C., Yang, K., Du, C., Lu, L., Wong, W.H.: Computational aspects of optional Pólya tree. J. Comput. Graph. Stat. 25(1), 301–320 (2016)
Article Google Scholar
Jing, J., Koch, I., Naito, K.: Polynomial histograms for multivariate density and mode estimation. Scand. J. Stat. 39(1), 75–96 (2012)
Article MathSciNet MATH Google Scholar
Jones, M.C., Marron, J.S., Sheather, S.J.: A brief survey of bandwidth selection for density estimation. J. Am. Stat. Assoc. 91(433), 401–407 (1996)
Article MathSciNet MATH Google Scholar
Kendall, M.G.: A new measure of rank correlation. Biometrika 30(1–2), 81–93 (1938)
Article MATH Google Scholar
Kogure, A.: Asymptotically optimal cells for a historgram. Ann. Stat. 15(3), 1023–1030 (1987)
Article MathSciNet MATH Google Scholar
Kooperberg, C., Stone, C.J.: A study of logspline density estimation. Comput. Stat. Data Anal. 12(3), 327–347 (1991)
Article MathSciNet MATH Google Scholar
Loftsgaarden, D.O., Quesenberry, C.P.: A nonparametric estimate of a multivariate density function. Ann. Math. Stat. 36(3), 1049–1051 (1965)
Article MathSciNet MATH Google Scholar
Ma, L., Wong, W.H.: Coupling optional Pólya trees and the two sample problem. J. Am. Stat. Assoc. 106(496), 1553–1565 (2011)
Article MATH Google Scholar
Mann, H.B., Wald, A.: On the choice of the number of class intervals in the application of the chi square test. Ann. Math. Stat. 13(3), 306–317 (1942)
Article MathSciNet MATH Google Scholar
Marron, J.S., Wand, M.P.: Exact mean integrated squared error. Ann. Stat. 20(2), 712–736 (1992)
Article MathSciNet MATH Google Scholar
Neal, R.M.: Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Stat. 9(2), 249–265 (2000)
MathSciNet Google Scholar
Nelsen, R.B., Ubeda-Flores, M.: How close are pairwise and mutual independence? Stat. Probab. Lett. 82(10), 1823–1828 (2012)
Article MathSciNet MATH Google Scholar
O’Brien, T.A., Kashinath, K., Cavanaugh, N.R., Collins, W.D., O’Brien, J.P.: A fast and objective multidimensional kernel density estimation method: fastKDE. Comput. Stat. Data Anal. 101, 148–160 (2016)
Article MathSciNet Google Scholar
Papoulis, A.: Probability, Random Variables, and Stochastic Processes. McGraw-Hill Series in Electrical Engineering, 3rd edn. McGraw-Hill Inc., New York (1991)
Google Scholar
Park, B.U., Marron, J.S.: Comparison of data-driven bandwidth selectors. J. Am. Stat. Assoc. 85(409), 66–72 (1990)
Article Google Scholar
Park, B., Turlach, B.: Practical performance of several data driven bandwidth selectors. Report, Université Catholique de Louvain, Center for Operations Research and Econometrics (CORE) (1992)
Pearson, K.: On the criterion that a given system of deviations from the probable in the case of correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philos. Mag. 50, 157–175 (1900)
Article MATH Google Scholar
Petersen, A., Müller, H.G.: Functional data analysis for density functions by transformation to a Hilbert space. Ann. Stat. 44(1), 183–218 (2016)
Ram, P., Gray, A.G.: Density estimation trees. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Association for Computing Machinery, pp. 627–635 (2011)
Rosenblatt, M.: Remarks on some nonparametric estimates of a density function. Ann. Math. Stat. 27(3), 832–837 (1956)
Article MathSciNet MATH Google Scholar
Scott, D.W.: Multivariate Density Estimation Theory, Practice, and Visualization. Wiley Series in Probability and Statistics, 2nd edn. Wiley, Hoboken (2015)
Google Scholar
Scott, D.W., Sagae, M.: Adaptive density estimation with massive data sets. In: Proceedings of the Statistical Computing Section, pp. 104–108. ASA, American Statistical Association (1997)
Shampine, L.F.: Matlab program for quadrature in 2d. Appl. Math. Comput. 202(1), 266–274 (2008a)
MathSciNet MATH Google Scholar
Shampine, L.F.: Vectorized adaptive quadrature in matlab. J. Comput. Appl. Math. 211(2), 131–140 (2008b)
Article MathSciNet MATH Google Scholar
Sheather, S.J.: Density estimation. Stat. Sci. 19(4), 588–597 (2004)
Article MATH Google Scholar
Shorack, G.R.: Probability for Statisticians. Springer Texts in Statistics. Springer, Berlin (2000)
Google Scholar
Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Monographs on Statistics and Applied Probability. Chapman and Hall, CRC, Boca Raton (1998)
Google Scholar
Smirnov, N.: Table for estimating the goodness of fit of empirical distributions. Ann. Math. Stat. 19(2), 279–281 (1948)
Article MathSciNet MATH Google Scholar
Sriperumbudur, B., Fukumizu, K., Kumar, R., Gretton, A., Hyvaerinen, A.: Density estimation in infinite dimensional exponential families. arXiv:1509.04348v2 p. 42 (2013)
Steele, M., Chaseling, J.: Powers of discrete goodness-of-fit test statistics for a uniform null against a selection of alternative distributions. Commun. Stat. Simul. Comput. 35(4), 1067–1075 (2006)
Article MathSciNet MATH Google Scholar
Wang, X., Wang, Y.: Nonparametric multivariate density estimation using mixtures. Stat. Comput. 25(2), 349–364 (2015)
Article MathSciNet MATH Google Scholar
Wong, W.H., Ma, L.: Optional Pólya tree and Bayesian inference. Ann. Stat. 38(3), 1433–1459 (2010)
Article MATH Google Scholar
Yang, Y.: Penalized semiparametric density estimation. Stat. Comput. 19(4), 355 (2008)
Article MathSciNet Google Scholar
Zaunders, J., Jing, J., Leipold, M., Maecker, H., Kelleher, A.D., Koch, I.: Computationally efficient multidimensional analysis of complex flow cytometry data using second order polynomial histograms. Cytom. Part A 89(1), 44–58 (2016)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Fluid Dynamics, ETH Zürich, Zürich, Switzerland
Daniel W. Meyer

Authors

Daniel W. Meyer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daniel W. Meyer.

Additional information

The author is grateful to Marco Weibel for his help during the preparation of this manuscript. Very valuable feedback from an associate editor and two reviewers and helpful input from Oliver Brenner and Florian Müller are gratefully acknowledged. Moreover, the author acknowledges helpful comments from Nina Roth and feedback on the initial version of this manuscript from Patrick Jenny, both from ETH Zürich. The author has been financially supported by ETH Zürich.

Appendix: Derivation of MMSE slope estimator

Writing without loss of generality the linear marginal PDF (4) in a simpler form with $x_i\in [0,1]$ and the subscripts skipped, we obtain

$$\begin{aligned} p(x|\theta ) = \left( x - {\textstyle \frac{1}{2}}\right) \theta + 1. \end{aligned}$$

By calculating the mean of random variable X based on this PDF, we obtain $\langle X\rangle = \frac{1}{12}(6 + \theta )$, and therefore, can express the slope parameter in terms of this mean as

$$\begin{aligned} \theta = 6(2\langle X\rangle - 1). \end{aligned}$$

(20)

In the case of a finite ensemble, we estimate the mean with $\langle X\rangle _n = \frac{1}{n}\sum _{j = 1}^n x_j$ and the slope by

$$\begin{aligned} \hat{\theta } = 6c(2\langle X\rangle _n - 1). \end{aligned}$$

Here, c is a correction factor that is determined by minimizing the mean square error (MSE) expressed as

$$\begin{aligned} \langle (\hat{\theta } - \theta )^2\rangle= & {} \left\langle \left[ 6c\left( \frac{2}{n}\sum _{j = 1}^n x_j - 1\right) - \theta \right] ^2\right\rangle \\= & {} \left\langle 36c^2\left( \frac{2}{n}\sum _{j = 1}^n x_j - 1\right) ^2 \right. \nonumber \\&\left. - 12c\left( \frac{2}{n}\sum _{j = 1}^n x_j - 1\right) \theta + \theta ^2\right\rangle \\= & {} 36c^2\left\langle \frac{4}{n^2}\sum _{j = 1}^n\sum _{k = 1}^n x_j x_k - \frac{4}{n}\sum _{j = 1}^n x_j + 1\right\rangle \\&- 12c(2\langle X\rangle - 1)\theta + \theta ^2 \\= & {} 36c^2\left( \frac{4}{n^2}\left\langle \sum _{j = 1}^n\sum _{k = 1}^n x_j x_k\right\rangle - 4\langle X\rangle + 1\right) \\&- 12c(2\langle X\rangle - 1)\theta + \theta ^2 \\= & {} 36c^2\left( \frac{4}{n^2}\left\langle \sum _{j = 1}^n\sum _{\begin{array}{c} k = 1\\ k \ne j \end{array}}^n x_j x_k + \sum _{j = 1}^n x_j^2\right\rangle - 4\langle X\rangle + 1\right) \\&- 12c(2\langle X\rangle - 1)\theta + \theta ^2 \\= & {} 36c^2\left[ \frac{4(n-1)}{n}\langle X\rangle ^2 + \frac{4}{n}\langle X^2\rangle - 4\langle X\rangle + 1\right] \\&- 12c(2\langle X\rangle - 1)\theta + \theta ^2. \end{aligned}$$

To determine the minimum MSE, we set

$$\begin{aligned} \frac{\mathrm {d} }{\mathrm {d} c}\langle (\hat{\theta } - \theta )^2\rangle= & {} 72c\left( \frac{4(n-1)}{n}\langle X\rangle ^2 + \frac{4}{n}\langle X^2\rangle \right. \\&\left. - 4\langle X\rangle + 1\right) \\&- 12(2\langle X\rangle - 1)\theta = 0, \end{aligned}$$

which leads for the correction factor to

$$\begin{aligned} c= & {} \frac{(2\langle X\rangle - 1)n\theta }{6[4(n-1)\langle X\rangle ^2 + 4\langle X^2\rangle - 4n\langle X\rangle + n]} \nonumber \\{}= & {} \frac{(2\langle X\rangle - 1)n\theta }{6(4n\langle X\rangle ^2 - 4\langle X\rangle ^2 + 4\langle X^2\rangle - 4n\langle X\rangle + n])} \nonumber \\{}= & {} \frac{(2\langle X\rangle - 1)n\theta }{6[n(2\langle X\rangle - 1)^2 + 4\langle X^{\prime 2}\rangle ]} \nonumber \\{}= & {} \frac{6n(2\langle X\rangle - 1)^2}{6[n(2\langle X\rangle - 1)^2 + 4\langle X^{\prime 2}\rangle ]}. \end{aligned}$$

(21)

For $n\rightarrow \infty $ the correction factor c goes to one.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Meyer, D.W. Density estimation with distribution element trees. Stat Comput 28, 609–632 (2018). https://doi.org/10.1007/s11222-017-9751-9

Download citation

Received: 06 November 2016
Accepted: 02 May 2017
Published: 16 May 2017
Issue Date: May 2018
DOI: https://doi.org/10.1007/s11222-017-9751-9

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Density estimation with distribution element trees

Abstract

Access this article

Similar content being viewed by others

Bagging of density estimators

Density Estimation in High-Dimensional Spaces: A Multivariate Histogram Approach

Density estimation via Bayesian inference engines

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Derivation of MMSE slope estimator

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Density estimation with distribution element trees

Abstract

Access this article

Similar content being viewed by others

Bagging of density estimators

Density Estimation in High-Dimensional Spaces: A Multivariate Histogram Approach

Density estimation via Bayesian inference engines

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Derivation of MMSE slope estimator

Appendix: Derivation of MMSE slope estimator

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation