# Minimum distance histograms with universal performance guarantees

- 61 Downloads

## Abstract

We present a data-adaptive multivariate histogram estimator of an unknown density *f* based on *n* independent samples from it. Such histograms are based on binary trees called regular pavings (RPs). RPs represent a computationally convenient class of simple functions that remain closed under addition and scalar multiplication. Unlike other density estimation methods, including various regularization and Bayesian methods based on the likelihood, the minimum distance estimate (MDE) is guaranteed to be within an \(L_1\) distance bound from *f* for a given *n*, no matter what the underlying *f* happens to be, and is thus said to have universal performance guarantees (Devroye and Lugosi, Combinatorial methods in density estimation. Springer, New York, 2001). Using a form of tree matrix arithmetic with RPs, we obtain the first generic constructions of an MDE, prove that it has universal performance guarantees and demonstrate its performance with simulated and real-world data. Our main contribution is a constructive implementation of an MDE histogram that can handle large multivariate data bursts using a tree-based partition that is computationally conducive to subsequent statistical operations.

## Keywords

Rooted planar binary tree Yatracos class Tree matrix arithmetic Model selection Regular paving Density estimation## 1 Introduction

*X*has an unknown density

*f*on \(\mathbb {R}^d\), then for all Borel sets \(A \subseteq \mathbb {R}^d\),

*f*from an independent and identically distributed (IID) sample \(X_1,X_2,\ldots ,X_n\) drawn from

*f*. Density estimation is often the first step in many learning tasks, including anomaly detection, classification, regression and clustering.

*f*, in the absolute scale of [0, 1], to the total variation distance between them.

A non-parametric density estimator is said to have universal performance guarantees when the underlying *f* is allowed to be any density in \(L_1\) (Devroye and Lugosi 2001, p. 1). Histograms and kernel density estimators can approximate *f* in this universal sense in an asymptotic setting, i.e., as the number of data points *n* approaches infinity (the so-called asymptotic consistency of the estimator \(f_n\)). But for a fixed *n*, however large but finite, classical studies of the rate of convergence of \(f_n\) to *f* require additional assumptions on the smoothness class (to solve this so-called smoothing problem), such as \(f \in L_2 \ne L_1\) or \(f \in C^k\), the set of *k* times differentiable functions, as opposed to letting *f* simply belong to the set where densities exist, i.e., \(f \in L_1\), and thereby violate the universality property.

Universal performance guarantee is provided by the minimum distance estimate (MDE) due to Devroye and Lugosi (2001, 2004). Their fundamentally combinatorial approach combined ideas from Yatracos (1985, 1988) on minimum distance methods and from Vapnik and Chervonenkis (1971) on uniform convergence of empirical probability measures over classes of sets. See Devroye and Lugosi (2001) for a self-contained introduction to combinatorial methods in density estimation. Unlike the likelihood based methods, MDE gives universal performance guarantees, i.e., MDE does not assume that *f* is in \(L_2\) in order to address the smoothing problem for the given sample of size *n*, by directly minimizing the \(L_1\) distance over the so-called Yatracos class—a certain class of subsets of the support set that are induced by the partitions of each ordered pair of histograms in the set of histograms from which one has to choose the optimally smoothed histogram (Devroye and Lugosi 2001).

The Yatracos class is not trivial to represent for the purposes of concretely obtaining the MDE in a nonparametric multivariate setting involving large sample sizes. The particular class of MDEs studied in Devroye and Lugosi (2001, 2004) were limited to kernel estimates and histograms under simpler partitioning rules. Inspired by this, here we develop an MDE over statistical regular pavings using tree-based partitioning strategies to produce a much more general nonparametric MDE that has (1) data-adaptive partitions (2) in *d* dimensions with (3) partitioning structures imbued with arithmetic for downstream statistical operations. Briefly, our approach exploits a recursive arithmetic using nodes imbued with recursively computable statistics and a specialized collator structure to compute the supremal deviation of the held-out empirical measure over the Yatracos class of the candidate densities.

Unlike other tree-based partitions, our regular paving structure restricts partitioning by only bisecting a box along its first widest coordinate to make the countable set of such trees closed under addition and scalar multiplication and thereby allowing for computationally efficient computer arithmetic over a dense set of simple functions. See Harlow et al. (2012) for statistical applications of this arithmetic, including conditional density regression and multivariate tail probability computations for anomaly detection. Although a more efficient algorithm (up to pre-processing the \(L_1\) distances for each pair of densities) is characterized in Mahalanabis and Stefankovic (2008), we are not aware of any publicly available implementations of the MDE using data-adaptive multivariate histograms for bursts of data common in many industrial applications today, especially for downstream statistical operations with the density estimate, including anomaly detection (with \(n \approxeq 10^7\) in dimensions up to 6 for instance in a non-distributed computational setting over one commodity machine).

To the best of our knowledge, the accompanying code of this paper in mrs2 Sainudiin et al. (2008–2019) is the only publicly available implementation of such an MDE estimator. Our main contribution in this work is a rigorous implementation of the minimum distance estimate proposed by Devroye and Lugosi (2001) for the nonparametric multivariate setting that can handle large bursts of data. The estimator has been successfully used in industry-scale problems where one needs to construct a multivariate density estimate in a “batch” setting and use this estimate for producing anomaly scores.

In the next two sections, we give the definitions, algorithms, theorems and proofs needed for our minimum distance estimator. Three core algorithms are given in the Appendix for completeness. We finally conclude after evaluating the performance of the estimator on simulated and real-world datasets.

## 2 Regular pavings and histograms

*d*with coordinates in \(\varDelta := \{1,2,\ldots ,d\}\) is an interval vector with \(\iota\) as the first coordinate of maximum width:

*d*. A

*bisection*or

*split*of \({\varvec{x}}\) perpendicularly at the mid-point along this first widest coordinate \(\iota\) gives the left and right child boxes of \({\varvec{x}}\):

*regular paving*(Kieffer et al. 2001) or

*n*-tree (Samet 1990) of \({\varvec{x}}_{\rho }\). A regular paving of \({\varvec{x}}_{\rho }\) can also be seen as a binary tree formed by recursively bisecting the box \({\varvec{x}}_{\rho }\) at the root node. Each node in the binary tree has either no children or two children. These trees are known as plane binary trees in enumerative combinatorics (Stanley 1999, Ex. 6.19(d), p. 220) and as finite, rooted binary trees (frb-trees) in geometric group theory (Meier 2008, Chap. 10). The relationship of trees, labels and partitions is illustrated in Fig. 1 via a sequence of bisections of a square (2-dimensional) root box by always bisecting on the first widest coordinate.

*j*th interval of a box \({\varvec{x}}_{{{\rho \mathsf {v}}}}\) be \([\underline{x}_{{{\rho \mathsf {v}}},j}, \overline{x}_{{{\rho \mathsf {v}}},j}]\), the volume of a

*d*-dimensional box \({\varvec{x}}_{{{\rho \mathsf {v}}}}\) be \({\mathrm{vol\,}}({\varvec{x}}_{{{\rho \mathsf {v}}}}) = \prod _{j=1}^d (\overline{x}_{{{\rho \mathsf {v}}},j} - \underline{x}_{{{\rho \mathsf {v}}},j})\). Let the set of all nodes, leaf nodes and internal nodes (or splits) of a regular paving \(s\) be \(\mathbb {V}(s) := \rho \cup \{ \rho \{{{\mathsf {L}}}, {{\mathsf {R}}} \}^j : j \in \mathbb {N}\}\), \(\mathbb {L}(s)\) and \(\breve{\mathbb {V}}(s) :=\mathbb {V}(s)\setminus \mathbb {L}(s)\), respectively. The set of leaf boxes of a regular paving \(s\) with root box \({\varvec{x}}_{\rho }\) is denoted by \({\varvec{x}}_{\mathbb {L}(s)}\) and it specifies a partition of the root box \({\varvec{x}}_{\rho }\). Let \(\mathbb {S}_k\) be the set of all regular pavings with root box \({\varvec{x}}_{\rho }\) made of

*k*splits. Note that the number of leaf nodes \(m= |\mathbb {L}(s) |=k+1\) if \(s\in \mathbb {S}_k\). The number of distinct binary trees with

*k*splits is equal to the Catalan number \(C_k\):

*k*splits where \(k \in \{i,i+1,\ldots ,j\}\). Let the set of all regular pavings be \(\mathbb {S}_{0:\infty } := \lim _{j \rightarrow \infty } \mathbb {S}_{0:j}\).

A statistical regular paving (SRP) denoted by \(s\) is an extension of the RP structure that is able to act as a partitioned ‘container’ and responsive summarizer for multivariate data. An SRP can be used to create a histogram of a data set. A recursively computable statistic (Fisher 1925; Gray and Moore 2003) that an SRP node \({{\rho \mathsf {v}}}\) caches is \(\#{\varvec{x}}_{{{\rho \mathsf {v}}}}\), the count of the number of data points that fell into \({\varvec{x}}_{{{\rho \mathsf {v}}}}\). A leaf node \({{\rho \mathsf {v}}}\) with \(\#{\varvec{x}}_{{{\rho \mathsf {v}}}} > 0\) is a non-empty leaf node. The set of non-empty leaves of an SRP \(s\) is \(\mathbb {L}^{+}(s) := \{{{\rho \mathsf {v}}}\in \mathbb {L}(s) : \#{\varvec{x}}_{{{\rho \mathsf {v}}}} > 0\} \subseteq \mathbb {L}(s)\).

*n*data points that fell into \({\varvec{x}}_{\rho }\) of SRP

*s*as follows:

*s*for notational convenience. SRP histograms have some similarities to dyadic histograms [for eg. Klemelä (2009, chap. 18), Lu et al. (2013)]. Both are binary tree-based and partition so that a box may only be bisected at the mid-point of one of its coordinates, but the RP structure restricts partitioning further by only bisecting a box on its first widest coordinate in order to make \(\mathbb {S}_{0:\infty }\) closed under addition and scalar multiplication and thereby allowing for computationally efficient computer arithmetic over a dense set of simple functions [see Harlow et al. (2012) for statistical applications of this arithmetic]. Crucially, when data bursts have large sample sizes, this restrictive partitioning does not affect the \(L_1\) errors when compared to a computationally more expensive Bayes estimator (see Sect. 4).

A statistically equivalent block (SEB) partition of a sample space is some partitioning scheme that results in equal numbers of data points in each element (block) of the partition (Tukey 1947). The output of \(\texttt {SEBTreeMC}(s, \overline{\#}, \overline{m})\) of Algorithm 1 is \([s(0),s(1),\ldots ,s(T)]\), a sequence of SRP states visited by a sample path of the Markov chain \(\{S(t)\}_{t \in \mathbb {Z}_+}\) on \(\mathbb {S}_{0:\overline{m}-1}\), such that, \(\mathbb {L}^{\bigtriangledown }(s(T)) = \emptyset\), or \(\#({{\rho \mathsf {v}}}) \le \overline{\#}\)\(\forall {{\rho \mathsf {v}}}\in \mathbb {L}^{\bigtriangledown }(s(T))\), or \(|\mathbb {L}(s(T)) |= \overline{m}\) and *T* is a corresponding random stopping time. As the initial state \(S(t=0)\) is the root \(s\in \mathbb {S}_0\), the Markov chain \(\{S(t)\}_{t \in \mathbb {Z}_+}\) on \(\mathbb {S}_{0:\overline{m}-1}\) satisfies \(S(t) \in \mathbb {S}_{t}\) for each \(t \in \mathbb {Z}_+\), i.e., the state at time *t* has \(t+1\) leaves or *t* splits. The operation may only be considered to be successful if \(|\mathbb {L}(s) |\le \overline{m}\) and \(\#{\varvec{x}}_{{{\rho \mathsf {v}}}} \le \overline{\#}\)\(\forall {{\rho \mathsf {v}}}\in \mathbb {L}^{\bigtriangledown }(s)\). Therefore, the sequence of SRP histogram states visited by \(\texttt {SEBTreeMC}\) that successfully terminates at stopping time *T* will have the terminal histogram with at most \(\overline{\#}\) many of the *n* data points in each of its leaf nodes and with at most \(\overline{m}\) many leaf nodes.

## 3 Minimum distance estimation using statistical regular pavings

*n*at appropriate rates. This is done by proving the three conditions in Theorem 1 of Lugosi and Nobel (1996). We will need to show that as the number of sample points increases linearly, the following conditions are met:

- 1.
the number of leaf boxes grows sub-linearly;

- 2.
the partition grows sub-exponentially in terms of a combinatorial complexity measure;

- 3.
and the volume of the leaf boxes in the partition is shrinking.

*n*fixed points \(\{x_{1}, \ldots , x_{n}\} \in \left( \mathbb {R}^d\right) ^n\). Let \(\varPi \left( {\mathcal {L}}_{n}, \{x_{1}, \ldots , x_{n}\}\right)\) be the number of distinct partitions of the finite set \(\{x_{1}, \ldots , x_{n}\}\) that are induced by partitions \(\mathbb {L}(\dot{s}) \in {\mathcal {L}}_{n}\):

*n*points, the growth function of \({\mathcal {L}}_{n}\) is then

*A*is the maximum Euclidean distance between any two points of

*A*, i.e., \(\mathrm{diam}(A) := \sup _{x,y\in A} \sqrt{\sum _{i=1}^d (x_i-y_i)^2}\). Thus, for a box \({\varvec{x}}= [\underline{x}_{1}, \overline{x}_{1}] \times \cdots \times [\underline{x}_{d}, \overline{x}_{d}]\), \(\mathrm{diam}({\varvec{x}}) = \sqrt{\sum _{i=1}^d(\overline{x}_i-\underline{x}_i)^2}\).

### Theorem 1

*Let*\(X_{1}, X_{2}, \ldots\)

*be independent and identical random vectors in*\({\mathbb {R}}^{d}\)

*whose common distribution*\(\mu\)

*has a non-atomic density f, i.e.,*\(\mu \ll \lambda\)

*. Let*\(\{S_{n}(i)\}_{i =0}^{\dot{I}}\)

*on*\(\mathbb {S}_{0:\infty }\)

*be the Markov chain formed using*\(\texttt {SEBTreeMC}\)

*(Algorithm 1) with terminal state*\(\dot{s}\)

*and histogram estimate*\(f_{n,{\dot{s}}}\)

*over the collection of partitions*\({\mathcal {L}}_n\)

*. As*\(n \rightarrow \infty\)

*, if*\(\overline{\#}\rightarrow \infty,\)\(\overline{\#}/n \rightarrow 0,\)\(\overline{m}\ge n/\overline{\#},\)

*and*\(\overline{m}/n \rightarrow 0,\)

*then the density estimate*\(f_{n,{\dot{s}}}\)

*is asymptotically consistent in*\(L_{1},\)

*i.e.,*

### *Proof*

Condition (a) is satisfied by the assumption that \(\overline{m}/n \rightarrow 0\) since \(m({\mathcal {L}}_n) \le \overline{m}\).

*n*point subset of \({\mathbb {R}}^{d}\) that are induced by the partitions in \({\mathcal {L}}_{n}\) is upper bounded by the size of the collection of partitions \({\mathcal {L}}_{n} \subseteq \mathbb {S}_{0:\overline{m}-1}\), i.e.,

*k*is the number of splits.

*n*on both sides of the above two equations, and using the assumption that \(\overline{m}/n \rightarrow 0\) as \(n \rightarrow \infty\), we can see that condition (b) is satisfied

*M*, such that \(\mu (\check{{\varvec{x}}}^{c}) < \xi\), where \(\check{{\varvec{x}}}^{c} := \mathbb {R}^d \setminus [-M,M]^d\). Consequently

*i*large enough, we can upper bound \(m_{\gamma }\) by \((2M\sqrt{d}/\gamma )^d\), a quantity that is independent of

*n*, such that

*d*and shatter coefficient \(S(\mathbb {IR}^d, n) \le (e n/2d)^{2d}\) (Devroye et al. 1996, Thms. 12.5, 13.3 and p. 220), i.e.,

*d*, the right-hand side of the above inequality can be made arbitrarily small for

*n*large enough. This convergence in probability is equivalent to the following almost sure convergence by the bounded difference inequality:

Let \(\varTheta\) index a set of finitely many density estimates: \(\{f_{n, \theta }: \theta \in \varTheta \}\), such that \(\int f_{n,\theta } = 1\) for each \(\theta \in \varTheta\). We can index the SRP trees by \(\{s_{\theta } : \theta \in \varTheta \}\), where \(\theta\) is the sequence of leaf node depths that uniquely identifies the SRP tree, and denote the density estimate corresponding to \(s_{\theta }\) by \(f_{n,s_{\theta }}\) or simply by \(f_{n,\theta }\). Now, consider the asymptotically consistent path taken by the Markov chain of \(\texttt {SEBTreeMC}\). For a fixed sample size *n*, let \(\{ s_{\theta } : \theta \in \varTheta \}\) be an ordered subset of states visited by the Markov chain, with \(s_{\theta } \prec s_{\vartheta }\) if \(s_{\vartheta }\) is a refinement of \(s_{\theta }\), i.e., if \(s_{\theta }\) is visited before \(s_{\vartheta }\). The goal is to select the optimal estimate from \(|\varTheta |\) many candidates.

*n*data points, use \(n-\varphi n\) points as the training set and the remaining \(\varphi n\) points as the validation set (by \(\varphi n\) we mean \(\lfloor \varphi n \rfloor\)). Denote the set of training data by \({\mathcal {T}} := \{x_1, \ldots , x_{n - \varphi n}\}\) and the set of validation data by \({\mathcal {V}} := \{x_{n - \varphi n + 1}, \ldots , x_{n}\} = \{y_1, \ldots , y_{\varphi n}\}\). For an ordered pair \((\theta , \vartheta ) \in \varTheta ^2\), with \(\theta \ne \vartheta\), the set

*Scheffé set*. The

*Yatracos class*(Yatracos 1985) is the collection of all such Scheffé sets over \(\varTheta\):

*minimum distance estimate*or MDE \(f_{n-\varphi n,\theta ^*}\) is the density estimate \(f_{n-\varphi n, \theta }\) constructed from the training set \({\mathcal {T}}\) with the smallest index \(\theta ^*\) that minimizes:

Our approach to obtaining the MDE \(f_{n-\varphi n,\theta ^*}\) with optimal SRP \(s_{\theta ^*}\) exploits the partition refinement order in \(\{ s_{\theta } : \theta \in \varTheta \}\), a subset of states along the path taken by the \(\texttt {SEBTreeMC}\). Using nodes imbued with recursively computable statistics for both training and validation data, and a specialized collation according to SRPCollate (Algorithm 3) over SRPs, we compute the objective \(\varDelta _{\theta }\) in (3) using GetDelta (Algorithm 2) via a dynamically grown Yatracos Matrix with pointers to all Scheffé sets constituting the Yatracos class according to GetYatracos (Algorithm 4). We briefly outline the core ideas in these three algorithms next [see Appendix for their pseudocode and mrs2 Sainudiin et al. (2008–2019) for details].

*collator regular paving*(CRP) where the space of CRP trees is also \(\mathbb {S}_{0:\infty }\). Consider now two SRPs \(s_{\theta }\) and \(s_{\vartheta }\) for which the corresponding histogram estimates \(f_{n, \theta }\) and \(f_{n, \vartheta }\) are computed. Both SRPs \(s_{\theta }\) and \(s_{\vartheta }\) have the same root box \({\varvec{x}}_{\rho }\). By collating the two SRPs, we get a CRP

*c*with the same root box and the tree obtained from a union of \(s_{\theta }\) and \(s_{\vartheta }\). Unlike the union operation over RPs (Harlow et al. 2012, Algorithm 1), each node \(\rho v\) of the SRP collator

*c*stores \(f_{n, {\theta }}\) and \(f_{n, {\vartheta }}\) as a vector \({\varvec{f}}_{n, c}(\rho v) := (f_{n, \theta }(\rho v), f_{n, \vartheta }(\rho v))\). The empirical measure of the validation data \(\mu _{\varphi n}({\varvec{x}}_{\rho v})\) will also be stored at each node \(\rho v\) and can be easily accessed via pointers. Figure 5 shows how CRP

*c*can collate two SRPs \(s_{\theta }\) and \(s_{\vartheta }\) using SRPCollate.

We now use Theorem 10.1 of Devroye and Lugosi (2001, p. 99) and Theorem 6.6 of Devroye and Lugosi (2001, p. 54) to obtain the \(L_1\)-error bound of the minimum distance estimate \(f_{n-\varphi n,\theta ^*}\), with \(\theta ^*\in \varTheta\) and \(|\varTheta | < \infty\).

### Theorem 2

*If*\(\int f_{n-\varphi n, \theta } = 1\)

*for all*\(\theta \in \varTheta,\)

*then for the minimum distance estimate*\(f_{n-\varphi n, \theta ^*}\)

*obtained by minimizing*\(\varDelta _{\theta }\)

*in*(3),

*we have*

*where*

Theorem 2 can be proved directly by a conditional application of Theorem 6.3 of Devroye and Lugosi (2001, p. 54) and is nothing but the finite \(\varTheta\) version of their Theorem 10.1 (Devroye and Lugosi 2001, p. 99) without the additional 3 / *n* term due to \(|\varTheta |<\infty\).

*f*is unknown and \(2^n > |{\mathcal {A}}_{\varTheta }|\), \(\varDelta\) may be approximated using the cardinality bound (Devroye et al. 1996, Theorem 13.6, p. 219) for the shatter coefficient of \({\mathcal {A}}_{\varTheta }\). Given \(\{x_1, \ldots , x_n\}\) the

*n*th shatter coefficient of \({\mathcal {A}}_{\varTheta }\) is defined as

*n*th shatter coefficient is bounded as follows:

### Theorem 3

*Let*\(0< \varphi < 1/2\)

*and*\(n < \infty.\)

*Let the finite set*\(\varTheta\)

*determine a class of adaptive multivariate histograms based on statistical regular pavings with*\(\int f_{n-\varphi n, \theta } = 1\)

*for all*\(\theta \in \varTheta.\)

*Let*\(f_{n, \theta ^*}\)

*be the minimum distance estimate. Then for all n*, \(\varphi n,\)\(\varTheta\)

*and*\(f \in L_1:\)

### *Proof*

## 4 Performance evaluation

### 4.1 Practical minimum distance estimation

To effectively use the error bound, we need to ensure that \(|\varTheta |\) is not too large and the densities in \(\varTheta\) are close to the true density *f*. Next, we highlight the effectiveness and limitations of our MDE.

The size of \(\varTheta\) is kept small (typically less than 100) and independent of *n* by an adaptive search. Note that \(|\varTheta |\) is upper-bounded by \(\overline{m}\) if we were to exhaustively consider each SRP state along the entire path of the \(\texttt {SEBTreeMC}\) in \(\varTheta\), our set of candidate SRP partitions. Such an exhaustive approach is computationally inefficient as the Yatracos matrix that updates the Scheffé sets grows quadratically with \(|\varTheta |\). We take a simple adaptive search approach by considering only *k* (typically \(10 \le k \le 20\)) SRP states in each iteration. In the initial iteration, we add *k* states to \(\varTheta\) by picking uniformly spaced states from a long-enough \(\texttt {SEBTreeMC}\) path that starts from the root node and ends at a state with a large number of leaves and a significantly higher \(\varDelta _{\theta }\) score than its preceding states. Then, we simply zoom-in around the states with the lowest \(\varDelta _{\theta }\) values and add another *k* states along the same \(\texttt {SEBTreeMC}\) path close to such optimal states from the first iteration. We repeat this adaptive search process until we are unable to zoom-in further. Typically, we are able to find nearly optimal states within 5 or fewer iterations. By Theorem 1, we know that the histogram partitioning strategy of \(\texttt {SEBTreeMC}\) is asymptotically consistent. Thus, the adaptive search set \(\varTheta\) that is selected iteratively from the set of histogram states along the path of \(\texttt {SEBTreeMC}\) with optimal \(\varDelta _{\theta }\) values will naturally contain densities that approach *f* as *n* increases. However, the rate at which the \(L_1\) distance between the best density in \(\varTheta\) and *f* approach 0 will depend on the complexity of *f* in terms of the number of leaves needed to uniformly approximate *f* using simple functions with SRP partitions, a class that is dense in \({\mathcal {C}}({\varvec{x}}_{\rho },\mathbb {R})\), the algebra of real-valued continuous functions over the root box \({\varvec{x}}_{\rho }\) by the Stone–Weierstrass Theorem (Harlow et al. 2012, Theorem 4.1). This dependence on the structural complexity of *f* is evaluated next.

### 4.2 Simulations

*d*of the uniform density on \([0,1]^d\) ranges in \(\{1,10,100,1000\}\), the true density is given by the root box, the first candidate density indexed by \(\varTheta\). Based on the mean integrated absolute errors (MIAE) shown in Table 1 for each

*d*and

*n*in \(\{10^2, 10^3, 10^4, 10^5, 10^6, 10^7\}\), there is a dimension-free performance by the MDE for such a target density that immediately belongs to the set of candidate densities indexed by \(\varTheta\). The sample mean of the integrated absolute errors was taken over five replicate simulations with standard error less than half of the MIAE values. When the sample size is \(10^7\) and dimension is 1000, the data cannot be represented in a machine with 32GB of memory (as indicated by the ‘–’ entry in Table 1).

The MIAE for MDE with different sample sizes *n* for the 1D-, 10D-, 100D- and 1000D-Uniform densities

Dimension | Sample size | |||||
---|---|---|---|---|---|---|

| \(10^7\) | \(10^6\) | \(10^5\) | \(10^4\) | \(10^3\) | \(10^2\) |

1 | \(3.439\mathrm{{e}}-04\) | \(1.981\mathrm{{e}}-03\) | \(2.866\mathrm{{e}}-03\) | \(1.405\mathrm{{e}}-02\) | \(3.237\mathrm{{e}}-02\) | \(1.000\mathrm{{e}}-01\) |

10 | \(3.606\mathrm{{e}}-04\) | \(1.803\mathrm{{e}}-03\) | \(2.689\mathrm{{e}}-03\) | \(1.156\mathrm{{e}}-02\) | \(3.191\mathrm{{e}}-02\) | \(1.470\mathrm{{e}}\mathrm{{e}}-01\) |

100 | \(3.446\mathrm{{e}}-04\) | \(1.908\mathrm{{e}}-03\) | \(2.953\mathrm{{e}}-03\) | \(1.540\mathrm{{e}}-02\) | \(2.898\mathrm{{e}}-02\) | \(1.520\mathrm{{e}}-01\) |

1000 | – | \(1.720\mathrm{{e}}-03\) | \(2.576\mathrm{{e}}-03\) | \(1.619\mathrm{{e}}-02\) | \(2.998\mathrm{{e}}-02\) | \(1.125\mathrm{{e}}-01\) |

*d*dimensions for various sample sizes:

The MIAE for MDE and posterior mean estimates with different sample sizes for the 1D-, 2D-, and 5D-Gaussian densities, as well as the 2D- and 5D-Rosenbrock densities

| Standard Gaussian densities | Rosenbrock densities | |||
---|---|---|---|---|---|

1D | 2D | 5D | 2D | 5D | |

Minimum distance estimate’s mean \(L_1(f_{n,\theta ^*},f)\), \(L_1(f_{n,\theta ^*},f)-\min _{\theta \in \varTheta } L_1(f_{n,\theta },f)\) | |||||

\(10^2\) | 0.4154, 0.0348 | 0.6018, 0.0325 | 1.4944, 0.1093 | 1.1843, 0.0208 | 1.6853, 0.0424 |

\(10^3\) | 0.2643, 0.0091 | 0.3515, 0.0144 | 0.8521, 0.0053 | 0.7533, 0.0119 | 1.3323, 0.0061 |

\(10^4\) | 0.0888, 0.0058 | 0.2038, 0.0044 | 0.6764, 0.0020 | 0.4502, 0.0050 | 1.0154, 0.0018 |

\(10^5\) | 0.0504, 0.0046 | 0.1140, 0.0014 | 0.4744, 0.0006 | 0.2476, 0.0024 | 0.7278, 0.0060 |

\(10^6\) | 0.0204, 0.0014 | 0.0656, 0.0014 | 0.3310, 0.0006 | 0.1430, 0.0006 | 0.4772, 0.0034 |

\(10^7\) | 0.0100, 0.0004 | 0.0376, 0.0002 | 0.2548, 0.0014 | 0.0828, 0.0012 | 0.2661, 0.0016 |

MCMC posterior mean estimate’s MIAE (standard error) | |||||

\(10^4\) | 0.0565 (0.0053) | 0.1673 (0.0046) | 0.6467 (0.0051) | 0.3717 (0.0103) | 1.0190 (0.0059) |

\(10^5\) | 0.0274 (0.0011) | 0.0932 (0.0002) | 0.4655 (0.0020) | 0.1982 (0.0067) | 0.7250 (0.0011) |

\(10^6\) | 0.0129 (0.0006) | 0.0533 (0.0005) | 0.3274 (0.0009) | 0.1102 (0.0006) | 0.4812 (0.0012) |

\(10^7\) | 0.0060 (0.0001) | 0.0304 (0.0002) | 0.2292 (0.0034) | 0.0608 (0.0049) | 0.3302 (0.0004) |

The sample standard deviations about the mean integrated absolute errors or MIAEs for the MDE method, i.e., \(L_1(f_{n,\theta ^*},f)\) (shown in the top panel of Table 2), based on ten trials, are below \(10^{-3}\) and \(10^{-4}\) for values of *n* in \(\{10^4,10^5\}\) and \(\{10^6,10^7\}\), respectively. Thus, these standard errors are not shown. However, the \(L_1\) distance between the MDE and the best estimate in the candidate set \(\varTheta\), \(L_1(f_{n,\theta ^*},f)-\min _{\theta \in \varTheta } L_1(f_{n,\theta },f)\), is shown in Table 2 for each density and sample size. For comparison, as shown in the bottom panel of Table 2, we used the Bayes estimator from the posterior mean histograms (Sainudiin et al. 2013, see for details on this evaluation). Note how the \(L_1\) errors decrease with the sample size and how the errors are comparable between the methods, albeit the MDE method is at least an order of magnitude faster than the posterior mean histogram that does not provide universal performance guarantees like most density estimators.

Due to the use of space-partitioning regularly paved trees, our MDE histograms cannot provide small \(L_1\) errors for highly structured densities beyond 10 or so dimensions on the basis of sample sizes in the order of millions. The reason is simply due to the large \(L_1\) distance between the best candidate density in \(\varTheta\) based on a reasonable maximal number of splits. However, using modern dimensionality reduction techniques including auto-encoders we can often project high dimensional data nonlinearly to a lower dimensional space and use the MDE histograms to construct a density estimate and do further statistical processing as we show below.

All experiments were performed on the same physical machine that is currently considered to be commodity hardware (Sainudiin et al. 2013, for machine specifications and CPU times).

### 4.3 Detecting bot flows using MDE tail probabilities

We apply the MDE histogram on the real-world scenario 11 of the CTU-13 dataset of botnet traffic on a computer network (Garcia et al. 2014). The dataset captured 8164 real botnet traffic flows mixed with 99087 normal and background traffic flows. These flows were augmented into 80 dimensions using Word2Vec embeddings of the flows (datapoints) and reduced to 8 dimensions by training a deep auto-encoder with a bottleneck layer of eight nodes by Ramström (2019), who gives domain-specific details on how the raw data was augmented and fitted to an auto-encoder. For the purpose of our application, it suffices to note that a deep auto-encoder was trained on appropriately augmented normal flows in order to non-linearly reduce the dimensions from 80 to 8. We then used the \(n=99087\) samples in 8 dimensions to obtain the MDE histogram. For each data point *x* from the normal as well as the botnet flow, we computed its multivariate tail probability (Harlow et al. 2012, Algorithm 9). Briefly, this is given by 1 minus the sum of the probability mass of all leaf nodes in the MDE histogram whose density (“height”) is larger than that of the leaf node whose box contains *x*. The tail probability can be directly used as a score of how unlikely an event is under the density estimate constructed with the normal flows. We obtain these tail probabilities for all 107251 flows (mixed with 7.6% botnet flows) and sort them by their tail probabilities. Our histogram estimate was able to identify 87% and 99.1% of the botnet flows, i.e., 7115 and 8090 out of 8164 botnet flows were within the lowest 7.6% and 10% of the tail probabilities, respectively. Thus, using the tail probabilities of the MDE histogram estimated from the normal flows was extremely effective in identifying the botnet flows.

## 5 Conclusion and future directions

Thus, using the collator regular paving (CRP), we obtain the minimum distance estimate (MDE) with universal performance guarantees. All the methods are implemented and available in mrs2 Sainudiin et al. (2008–2019), including the downstream statistical operations for anomaly detection and conditional density regression (Harlow et al. 2012). We limited our minimum distance estimate (MDE) to the candidate set given by the SRP histograms visited along the path of the Markov chain \(\texttt {SEBTreeMC}\). This was done to take advantage of the structure of consecutive refinements of the tree partitions along a single path of \(\texttt {SEBTreeMC}\).

However, obtaining the MDE from an arbitrary set of SRP histograms taken from \(\mathbb {S}_{0:\infty }\) will need more sophisticated collators. Initial experiments using the Scheffé tournament approach (as opposed to the MDE) to find the best estimate in a candidate set of arbitrary SRP histograms (not just those along a path in \(\mathbb {S}_{0:\infty }\)) look feasible. Such a Scheffé tournament will allow us to compare estimates from entirely different methodological schools (Bayesian, penalized likelihood, etc.). Finally, the pure tree structure allows one to possibly extend this MDE to a distributed fault-tolerant computational setting such as Zaharia et al. (2016) as the sample size becomes too large for the memory of a single machine.

## Notes

### Acknowledgements

RS and GT proved the theorems and GT implemented the three MDE algorithms based on codes by Jennifer Harlow and RS in mrs2. This research began from a conversation RS had with Luc Devroye at the World Congress in Probability and Statistics in 2008 and was partly supported by RS’s external consulting revenues from the New Zealand Ministry of Tourism, UC College of Engineering Sabbatical Grant and Visiting Scholarship at Department of Mathematics, Cornell University, Ithaca NY, USA and completed through the project CORCON: Correctness by Construction, Seventh Framework Programme of the European Union, Marie Curie Actions-People, International Research Staff Exchange Scheme (IRSES) with counter-part funding from the Royal Society of New Zealand. The application to botnet detection was partially supported by Combient Mix AB and the Department of Mathematics, Uppsala University.

### Funding

This study was funded by FP7 People: Marie-Curie Actions (project CORCON) (Grant number: 612638).

## References

- Devroye, L., Györfi, L., & Lugosi, G. (1996).
*A probabilistic theory of pattern recognition*. New York: Springer-Verlag.CrossRefzbMATHGoogle Scholar - Devroye, L., & Lugosi, G. (2001).
*Combinatorial methods in density estimation*. New York: Springer-Verlag.CrossRefzbMATHGoogle Scholar - Devroye, L., & Lugosi, G. (2004). Bin width selection in multivariate histograms by the combinatorial method.
*TEST*,*13*(1), 129–145.MathSciNetCrossRefzbMATHGoogle Scholar - Fisher, R. A. (1925). Theory of statistical estimation.
*Mathematical Proceedings of the Cambridge Philosophical Society*,*22*, 700–725.CrossRefzbMATHGoogle Scholar - Garcia, S., Grill, M., Stiborek, H., & Zunino, A. (2014). An empirical comparison of botnet detection methods.
*Computers and Security Journal*,*45*, 100–123.CrossRefGoogle Scholar - Gray, A. G., & Moore, A. W. (2003). Nonparametric density estimation: Towards computational tractability. In SIAM international conference on data mining (pp. 203–211). San Francisco, California, USA: SIAM.Google Scholar
- Harlow, J., Sainudiin, R., & Tucker, W. (2012). Mapped regular pavings.
*Reliable Computing*,*16*, 252–282.MathSciNetGoogle Scholar - Kieffer, M., Jaulin, L., Braems, I., & Walter, E. (2001). Guaranteed set computation with subpavings. In W. Kraemer & J. Gudenberg (Eds.),
*Scientific computing, validated numerics, interval methods, proceedings of SCAN 2000*(pp. 167–178). New York: Kluwer Academic Publishers.CrossRefGoogle Scholar - Klemelä, J. (2009).
*Smoothing of multivariate data: density estimation and visualization*. Chichester: Wiley.CrossRefzbMATHGoogle Scholar - Lu, L., Jiang, H., & Wong, W. H. (2013). Multivariate density estimation by bayesian sequential partitioning.
*Journal of the American Statistical Association*,*108*(504), 1402–1410. https://doi.org/10.1080/01621459.2013.813389.MathSciNetCrossRefzbMATHGoogle Scholar - Lugosi, G., & Nobel, A. (1996). Consistency of data-driven histogram methods for density estimation and classification.
*The Annals of Statistics*,*24*(2), 687–706.MathSciNetCrossRefzbMATHGoogle Scholar - Mahalanabis, S., & Stefankovic, D. (2008). Density estimation in linear time. In R. A. Servedio & T. Zhang (Eds.),
*21st annual conference on learning theory—COLT 2008*(pp. 503–512). Finland: Omnipress, Helsinki.Google Scholar - Mattarei, S. (2010). Asymptotics of partial sums of central binomial coefficients and Catalan numbers. arXiv.0906.4290v3
- Meier, J. (2008).
*Groups, graphs and trees: an introduction to the geometry of infinite groups*. Cambridge: Cambridge University Press.CrossRefzbMATHGoogle Scholar - Ramström, K. (2019). Botnet detection on flow data using the reconstruction error from Autoencoders trained on Word2Vec network embeddings. Msc thesis, Uppsala UniversityGoogle Scholar
- Sainudiin, R., Teng, G., Harlow, J., & Lee, D. S. (2013). Posterior expectation of regularly paved random histograms.
*ACM Transactions on Modeling and Computer Simulation*,*23*(26), 6:1–6:20.MathSciNetzbMATHGoogle Scholar - Sainudiin, R., York, T., Harlow, J., Teng, G., Tucker, W., & George, D. (2008–2019). MRS 2.0, a C++ class library for statistical set processing and computer-aided proofs in statistics. https://github.com/lamastex/mrs2
- Samet, H. (1990).
*The design and analysis of spatial data structures*. Boston: Addison-Wesley Longman.Google Scholar - Stanley, R.P. (1999). Enumerative combinatorics. Vol. 2, Cambridge Studies in Advanced Mathematics, vol 62. Cambridge University Press, Cambridge. https://books.google.fr/books?id=zg5wDqT6T-UC&hl=fr&source=gbs_book_other_versions
- Tukey, J. W. (1947). Non-parametric estimation II. Statistically equivalent blocks and tolerance regions—The continuous case.
*The Annals of Mathematical Statistics*,*18*(4), 529–539.MathSciNetCrossRefzbMATHGoogle Scholar - Vapnik, V. N., & Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities.
*Theory Probab Appl*,*16*, 264–280.CrossRefzbMATHGoogle Scholar - Yatracos, Y. G. (1985). Rates of convergence of minimum distance estimators and kolmogorov’s entropy.
*The Annals of Statistics*,*13*(2), 768–774.MathSciNetCrossRefzbMATHGoogle Scholar - Yatracos, Y. G. (1988). A note on l1 consistent estimation.
*The Canadian Journal of Statistics*,*16*(3), 283–292.MathSciNetCrossRefzbMATHGoogle Scholar - Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., et al. (2016). Apache spark: A unified engine for big data processing.
*Commun ACM*,*59*(11), 56–65. https://doi.org/10.1145/2934664.CrossRefGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.