On the positivity and magnitudes of Bayesian quadrature weights
 166 Downloads
 1 Citations
Abstract
This article reviews and studies the properties of Bayesian quadrature weights, which strongly affect stability and robustness of the quadrature rule. Specifically, we investigate conditions that are needed to guarantee that the weights are positive or to bound their magnitudes. First, it is shown that the weights are positive in the univariate case if the design points locally minimise the posterior integral variance and the covariance kernel is totally positive (e.g. Gaussian and Hardy kernels). This suggests that gradientbased optimisation of design points may be effective in constructing stable and robust Bayesian quadrature rules. Secondly, we show that magnitudes of the weights admit an upper bound in terms of the fill distance and separation radius if the RKHS of the kernel is a Sobolev space (e.g. Matérn kernels), suggesting that quasiuniform points should be used. A number of numerical examples demonstrate that significant generalisations and improvements appear to be possible, manifesting the need for further research.
Keywords
Bayesian quadrature Probabilistic numerics Gaussian processes Chebyshev systems Stability1 Introduction
This question is important both conceptually and practically. On the conceptual side, positive weights are more natural, given that the weighted sample \((w_i,{\varvec{x}}_i)_{i=1}^n\) can be interpreted as an approximation of the positive probability measure \(\nu \); in fact, the Bayesian quadrature weights provide the best approximation of the representer of \(\nu \) in the reproducing kernel Hilbert space (RKHS) of the covariance kernel, provided that \({\varvec{x}}_1,\dots ,{\varvec{x}}_n\) are fixed (see Sect. 2.2). Thus, if the weights are positive, then each weight \(w_i\) can be interpreted as representing the “importance” of the associated point \({\varvec{x}}_i\) for approximating \(\nu \). This interpretation may be more acceptable to users familiar with Monte Carlo methods, encouraging them to adopt Bayesian quadrature.
On the practical side, quadrature rules with positive weights enjoy the advantage of being numerically more stable against errors in integrand evaluations. In fact, besides Monte Carlo methods, many other practically successful or in some sense optimal rules have positive weights. Some important examples include Gaussian (Gautschi 2004, Section 1.4.2) and Clenshaw–Curtis quadrature (Clenshaw and Curtis 1960) and their tensor product extensions. Other domains besides subsets of \({\mathbb {R}}^d\) have also received their share of attention. For instance, positiveweight rules on the sphere are constructed in Mhaskar et al. (2001) and interesting results connecting fill distance and positivity of the weights of quadrature rules on compact Riemannian manifolds appear in Breger et al. (2018). It is also known that, in some typical function classes, such as Sobolev spaces, optimal rates of convergence can be achieved by considering only positiveweight quadrature rules; see for instance Novak (1999, Section 1) and references therein. Therefore, if one can find conditions under which Bayesian quadrature weights are positive, then these conditions may be used as guidelines in construction of numerically stable Bayesian quadrature rules.
This article reviews existing, and derives new, results on properties of the Bayesian quadrature weights, focusing in particular on their positivity and magnitude. One of our principal aims is to stimulate new research on quadrature weights in the context of probabilistic numerics. While convergence rates of Bayesian quadrature rules have been studied extensively in recent years (Briol et al. 2019; Kanagawa et al. 2016, 2019), analysis of the weights themselves has not attracted much attention. On the other hand, the earliest work by Larkin (1970), RichterDyn (1971a) and Barrar and Loeb (1976) [see Oettershagen (2017) for a recent review] done in the 1970s on kernelbased quadrature already revealed certain interesting properties of the Bayesian quadrature weights. These results seem not wellknown in the statistics and machine learning community. Moreover, there are some useful results from the literature on scattered data approximation (De Marchi and Schaback 2010), which can be used to analyse the properties of Bayesian quadrature weights. The basics of Bayesian quadrature are reviewed in Sect. 2 while the main contents, including simulation results, of the article are presented in Sects. 3 and 4 .
In Sect. 3, we present results concerning positivity of the Bayesian quadrature weights. We discuss results on the number of the weights that must be positive, focusing on the univariate case and totally positive kernels (Definition 2). Corollary 1, the main result of this section, states that all the weights are positive if the design points are locally optimal. A practically relevant consequence of this result is that it may imply that the weights are positive if the design points are obtained by gradient descent, which is guaranteed to provide locally optimal points [see e.g. Lee et al. (2016)].
Section 4 focuses on results on the magnitudes of the weights. More specifically, we discuss the behaviour of the sum of absolute weights, \({\sum _{i=1}^n \big w_{X,i}^\text { BQ}}\big \), that strongly affects stability and robustness of Bayesian quadrature. If this quantity is small, the quadrature rule is robust against misspecification of the Gaussian process prior (Kanagawa et al. 2019) and errors in integrand evaluations (Förster 1993) and kernel means (Sommariva and Vianello 2006a, pp. 298–300). This quantity is also related to the numerical stability of the quadrature rule. Using a result on stability of kernel interpolants by De Marchi and Schaback (2010), we derive an upper bound on the sum of absolute weights for some typical cases where the Gaussian process has finite degree of smoothness and the RKHS induced by the covariance kernel is normequivalent to a Sobolev space.
2 Bayesian quadrature
This section defines a Bayesian quadrature rule as the integral of the posterior of Gaussian process used to model the integrand. We also discuss the equivalent characterisation of this quadrature rule as the worstcase optimal integration rule in the RKHS \({\mathcal {H}}(k)\) induced by the covariance kernel k of the Gaussian process.
2.1 Basics of Bayesian quadrature
2.2 Reproducing kernel Hilbert spaces
An alternative interpretation of Bayesian quadrature weights is that they are, for the given points, the worstcase optimal weights in the reproducing kernel Hilbert space \({\mathcal {H}}(k)\) induced by the covariance kernel k. The material of this section is contained in, for example, Briol et al. (2019, Section 2), Oettershagen (2017, Section 3.2) and Karvonen and Särkkä (2018a, Section 2). For a comprehensive introduction to RKHSs, see the monograph of Berlinet and ThomasAgnan (2011).
3 Positivity
This section reviews existing results on the positivity of the weights of Bayesian quadrature that can be derived in one dimension when the covariance kernel is totally positive. This assumption, given in Definition 2, is stronger than positivedefiniteness but is satisfied by, for example, the Gaussian kernel. For most of the section, we assume that \(d = 1\) and \(\varOmega = [a,b]\) for \(a < b\). Furthermore, the measure \(\nu \) is typically assumed to admit a density function with respect to the Lebesgue measure,^{1} an assumption that implies \(I_\nu (f) > 0\) if \(f(x) > 0\) for almost every \(x \in \varOmega \).

Theorem 1: At least one half of the weights of any Bayesian quadrature rule are positive.

Corollary 1: All the weights are positive when the points are selected so that the integral posterior variance in (2) is locally minimised in the sense that each of its n partial derivatives with respect to the integration points vanishes (Definition 3).
As no multivariate extension of the theory used to prove the aforementioned results appears to have been developed, we do not provide any general theoretical results on the weights in higher dimensions. However, some special cases based on, for example, tensor products are discussed in Sects. 3.7 and 3.9 and two numerical examples are used to provide some evidence for the conjectures that multivariate versions of Theorem 1 and Corollary 1 hold.
It will turn out that optimal Bayesian quadrature rules are analogous to classical Gaussian quadrature rules in the sense that, in addition to being exact for kernel interpolants [recall (4)], they also exactly integrate Hermite interpolants (see Sect. 3.2.2). We thus begin by reviewing the argument used to establish positivity of the Gaussian quadrature weights.
3.1 Gaussian quadrature
Proposition 1
Assume that \(\nu \) admits a Lebesgue density. Then the weights \(w_1,\dots ,w_n\) of the Gaussian quadrature (7) are positive.
Proof
This proof may appear to be based on the closedness of the set of polynomials under exponentiation. Closer analysis reveals a structure that can be later generalised.
3.2 Chebyshev systems and generalised Gaussian quadrature
The argument presented above works almost as such when the polynomials are replaced with generalised polynomials and the Gaussian quadrature rule with a generalised Gaussian quadrature rule. Much of the following material is covered by the introductory chapters of the monograph by Karlin and Studden (1966). In the following \(C^m([a,b])\) stands for the set of functions that are m times continuously differentiable on the open interval (a, b).
Definition 1
(Chebyshev system) A collection of functions \(\{\phi _i\}_{i=1}^m \subset C^{m1}([a,b])\) constitutes an (extended) Chebyshev system if any nontrivial linear combination of the functions, called a generalised polynomial, has at most \(m1\) zeroes, counting multiplicities.
Remark 1
Some of the results we later present, such as Proposition 3, are valid even when a less restrictive definition, that does not require differentiability of \(\phi _i\), of a Chebyshev system is used. Of course, in this case the definition is not given in terms of multiple zeroes. The above definition is used here to simplify presentation. The simplest relaxation is to require that \(\{\phi _i\}_{i=1}^m\) are merely continuous and that no linear combination can vanish at more than \(m1\) points.
By selecting \(\phi _i(x) = x^{i1}\), we see that polynomials are an example of a Chebyshev system. Perhaps the simplest example of a nontrivial Chebyshev system is given by the following example.
Example 1
3.2.1 Interpolation using a Chebyshev system
3.2.2 Hermite interpolants
3.2.3 Generalised Gaussian quadrature
Proposition 2
Assume that \(\nu \) admits a Lebesgue density. Then the weights \(w_1,\dots ,w_n\) of the generalised Gaussian quadrature rule (10) are positive.
Proof
Next we turn our attention to kernels whose translates and their derivatives constitute Chebyshev systems.
3.3 Totally positive kernels
Definition 2
The class of totally positive kernels is smaller than that of positivedefinite kernels. For the simplest case of \(q=1\) and \(m = n\) the total positivity condition is that the kernel translates \(k_{x_1},\ldots ,k_{x_n}\) constitute a Chebyshev system. This implies that the \(n \times n\) matrix \([{\varvec{K}}_{Y,X}] {:}{=}k(y_j,x_i)\), which is just the matrix \({\varvec{V}}_Y\) considered in Sect. 3.2 for the Chebyshev system \(\phi _i = k_{x_i}\), is invertible for any \({Y = \{y_1,\ldots ,y_n\} \subset [a,b]}\). Positivedefiniteness of k only guarantees that \({\varvec{K}}_{Y,X}\) is invertible when \(Y = X\).
3.4 General result on weights
The following special case of the theory developed in Karlin and Studden (1966, Chapter 2) appears in, for instance, RichterDyn (1971a, Lemma 2). Its proof is a generalisation of the proof for the case \(m=2n\) that is discussed in Sect. 3.2.
Proposition 3
Suppose that \(\{\phi _i\}_{i=1}^m \subset C^{m1}([a,b])\) constitute a Chebyshev system, that \(\nu \) admits a Lebesgue density and that \(Q(f) {:}{=}\sum _{i=1}^n w_i f(x_i)\) for \(x_1,\ldots ,x_m \in \varOmega \) is a quadrature rule such that \({Q(\phi _i) = I_\nu (\phi _i)}\) for each \(i = 1,\ldots ,m\). Then at least \(\lfloor (m+1)/2 \rfloor \) of the weights \(w_1, \ldots , w_n\) are positive.
An immediate consequence of this proposition is that a Bayesian quadrature rule based on a totally positive kernel has at least one half of its weights positive.
Theorem 1
Suppose that the kernel \(k \in C^\infty ([a,b]^2)\) is totally positive of order 1. Then, for any points, at least \(\lfloor (n+1)/2 \rfloor \) of the Bayesian quadrature weights \(w_{X,1}^\text { BQ}, \ldots , w_{X,n}^\text { BQ}\) are positive.
3.5 Weights for locally optimal points
Definition 3
When the kernel is totally positive of any order, it has been shown that any local minimiser of \({\mathbb {V}}_X^\text { BQ}\) is locally optimal in the sense of above definition. That is, no point in a point set that locally minimises the variance can be located on the boundary of the integration interval nor can any two points in the set coalesce.^{3} These results, the origins of which can be traced to the 1970s (Barrar et al. 1974; Barrar and Loeb 1976; Bojanov 1979), have been recently collated by Oettershagen (2017, Corollary 5.13).
Proposition 4
Proof
Remark 2
Theorem 2
Let \(k \in C^\infty ([a,b]^2)\) be a totally positive kernel of order 2 and \(m \le n\). Suppose that the point set \(X \in {\mathcal {S}}^n\) is locally moptimal with an index set \({\mathcal {I}}_m^* \subset \{1, \ldots , n\}\) and that the weights associated with \(q \le m\) indices in \({\mathcal {I}}_m^*\) are nonzero. Then at least \(\lfloor (n+2mq+1)/2 \rfloor \) of the weights are nonnegative, and q must satisfy \(2mn \le q\).
Proof
By (14), the Bayesian quadrature rule in the statement is exact for n kernel translates and q of their derivatives. By the total positivity of the kernel, the collection of these \(n+q\) functions constitutes a Chebyshev system. By Proposition 3, at least \(\lfloor (n+q+1)/2 \rfloor \) of the weights are positive. Since the weights associated with \(mq\) indices in \({\mathcal {I}}_m^*\) are zero, it follows that at least \(\lfloor (n+q+1)/2 \rfloor + m  q = \lfloor (n+2mq+1)/2 \rfloor \) of the weights are nonnegative. The lowerbound for q follows because \( \lfloor (n+2mq+1)/2 \rfloor \le n\) implies that \(n+2mq+1 \le 2n+1\). \(\square \)
The main result of this section follows by setting \(m=n\) in the preceding theorem and observing that this implies \(q=n\), which means that there can be no zero weights.
Corollary 1
If \(k \in C^\infty ([a,b]^2)\) is totally positive of order 2 and \(X \in {\mathcal {S}}^n\) is locally optimal, then all the Bayesian quadrature weights \(w_{X,1}^\text { BQ},\ldots ,w_{X,n}^\text { BQ}\) are positive.
Remark 3
A key consequence of Corollary 1 is the following: If \(w_{X,1}^\text { BQ},\ldots ,w_{X,n}^\text { BQ}\) contain negative values, then the design points X are not locally optimal. In other words, in this case there is still room for improvement by optimising these points using, for example, gradient descent. In this way, the signs of the weights can provide information about the quality of the design point set.
3.6 Greedily selected points
Proposition 5
Suppose that \(k \in C^\infty ([a,b]^2)\) is totally positive of order 2. If \(X_n \cup x_{n+1} \in {\mathcal {S}}^n\), then at least \(\lfloor (n+3)/2 \rfloor \) of the weights of a \(n+1\) point sequential Bayesian quadrature rule are positive.
3.7 Other kernels and point sets

The GP posterior mean for the Brownian motion kernel \(k(x,x') = \min (x,x')\) on [0, 1] is a piecewise linear interpolant. As this implies that the Lagrange cardinal functions \(u_{X,i}\) are nonnegative, it follows from the identity \(w_{X,i}^\text { BQ} = I_\nu (u_{X,i})\) that the weights are positive. See Diaconis (1988) and Ritter (2000, Lemma 8 in Section 3.2, Chapter 2) for more discussion.

Suitably selected priors give rise to Bayesian quadrature rules whose posterior mean coincides with a classical rule, such a Gaussian quadrature (Karvonen and Särkkä 2017; Karvonen et al. 2018b). Analysis of the weights and their positivity naturally reduces to that of the reproduced classical rule.

There is convincing numerical evidence that the weights are positive if the nodes for the Gaussian kernel and measure on \({\mathbb {R}}\) are selected by suitable scaling the classical Gauss–Hermite nodes (Karvonen and Särkkä 2019).

Uniform weighting (i.e. \(w_{X,i}^\text { BQ} = 1/n\)) can be achieved when certain quasiMonte Carlo point sets and shiftinvariant kernels are used (Jagadeeswaran and Hickernell 2019).
3.8 Upper bound on the sum of weights
We summarise below a simple yet generic result that has an important consequence on the stability of Bayesian quadrature in Sect. 4.
Lemma 1
Proof
The claim immediately follows from the property (4) that \(\sum _{i=1}^n w_{X,i}^\text { BQ} k_{{\varvec{x}}_j}({\varvec{x}}_i) = I_\nu (k_{{\varvec{x}}_j})\) for each \(j=1,\ldots ,n\). \(\square \)
Combined with Corollary 1, we get a bound on the sum of absolute weights \( \sum _{i=1}^n w_{X_n,i}^\text { BQ}\), which is the main topic of discussion in Sect. 4.
Corollary 2
Most importantly, Corollary 2 is applicable to the Gaussian kernel, for which the upper bound is finite. This result will be discussed in Sect. 4.4 in more detail. One may see supporting evidence in Fig. 2, where the sum of weights seems to converge to a value around 1.
3.9 Higher dimensions
As far as we are aware of, there are no extensions of the theory of Chebyshev systems to higher dimensions. Consequently, it is not possible to say much about positivity of the weights when \(d > 1\). Some simple cases can be analysed, however.
 (i)
the point set X is now a Cartesian product of onedimensional sets \(X_1 = \{x_1^1,\ldots ,x_n^1\}\subset \varOmega _1\): \(X = X_1^d\);
 (ii)
the kernel is of product form: \(k({\varvec{x}},{\varvec{x}}') = \prod _{i=1}^d k_1(x_i,x_i')\) for some kernel \(k_1\) on \(\varOmega _1\).
Locally optimal points First, we investigated positivity of weights for locally optimal points. We set \(\ell = 1\) and \(d = 2\) and used a gradientbased quasiNewton optimisation method (MATLAB’s fminunc) to find points that locally minimise the integral variance for \(n=2,\ldots ,20\). Optimisation was initialised with a set of random points. The point set output by the optimiser was then randomly perturbated and optimisation repeated for 20 times, each time initialising with the point set giving the smallest Bayesian quadrature variance so far. The weights were always computed directly from (3). However, to improve numerical stability, the kernel matrix \({\varvec{K}}_X\) was replaced by \({\varvec{K}}_X + 10^{6} {\varvec{I}}\), where \({\varvec{I}}\) is the \(n \times n\) identity matrix, during point optimisation. Some point sets generated using the same algorithm have appeared in Särkkä (2016, Section IV) [for other examples of optimal points in dimension two, see O’Hagan (1992) and Minka (2000) and, in particular, Oettershagen (2017, Chapter 6)]. The point sets we obtained appear sensible and all of them are associated with positive weights; four sets and their weights are depicted in Fig. 2. For \(n = 20\), the maximal value of a partial derivative of \({\mathbb {V}}_X^\text { BQ}\) at the computed points was \(9 \times 10^{10}\).
4 Magnitudes of weights and the stability
It is clear from Lemma 1 that if the weights are positive for every n, the stability constant remains uniformly bounded. However, the results on positivity in the preceding section are valid only when \(d = 1\) and the kernel is totally positive. This section uses a different technique to analyse the stability constant. The results are based on those in De Marchi and Schaback (2010), which are applicable to kernels that induce Sobolevequivalent RKHSs (e.g. Matérn kernels). Accordingly, we mainly focus on such kernels in this section. We begin by reviewing basic properties of Sobolev spaces in Sect. 4.1 and convergence results for Bayesian quadrature in Sect. 4.2. The main results, Theorem 5 and Corollary 3, on the magnitudes of quadrature weights and the stability constant appear in Sect. 4.3. We discuss a relevant stability issue, known as the Runge phenomenon, for infinitely smooth kernels such as the Gaussian kernel in Sect. 4.4. Finally, simulation results in Sect. 4.5 demonstrate that the obtained upper bound is conservative; there is much room for improving the results.
4.1 Kernels inducing Sobolevequivalent RKHSs
The Sobolev space \(H^r(\varOmega )\) on a general measurable domain \(\varOmega \subset {\mathbb {R}}^d\) can be defined as the restriction of \(H^r({\mathbb {R}}^d)\) onto \(\varOmega \). The kernel k satisfying (16), when seen as a kernel on \(\varOmega \), then induces an RKHS that is normequivalent to \(H^r (\varOmega )\) (Wendland 2005, Theorems 10.12, 10.46 and 10.47).^{6}
4.2 Convergence for Sobolevequivalent kernels
Assumption 3
The set \(\varOmega \subset {\mathbb {R}}^d\) is a bounded open set that satisfies an interior cone condition and has a Lipschitz boundary.
Theorem 4
The following simple result is an immediate consequence of this theorem.
Proposition 6
Suppose that the assumptions of Theorem 4 are satisfied. Then \({\big {1  \sum _{i=1}^n w_{X_n,i}^\text { BQ}\big } \lesssim h_{X_n,\varOmega }^r} \) when the filldistance is sufficiently small.
Proof
Under the assumptions, constant functions are in \(H^{r}(\varOmega )\). Setting \(f \equiv 1\) in (17) verifies the claim. \(\square \)
Note that the same argument can be used whenever a general rate of convergence for functions in an RKHS is known and constant functions are contained in the RKHS. However, this is not always the case; for example, the RKHS of the Gaussian kernel (12) does not contain polynomials (Minh 2010, Theorem 2).
Of course, it is the stability constant \({\varLambda _{X_n}^\text { BQ} = \sum _{i=1}^n \big {w_{X_n,i}^\text { BQ}}}\big \) that we analyse next whose behaviour is typically more consequential. However, the above proposition may be occasionally interesting if one desires to interpret Bayesian quadrature as a weighted Dirac approximation \(\nu _\text { BQ} {:}{=}\sum _{i=1}^n w_{X_n,i}^\text { BQ} \delta _{{\varvec{x}}_i} \approx \nu \) of a probability measure (i.e. \(\nu _\text { BQ}(\varOmega ) \approx 1\)). Note that there is also a simple way to ensure summing up to one of the weights by inclusion of a nonzero prior mean function for the Gaussian process prior; see O’Hagan (1991) and Karvonen et al. (2018b, Section 2.3).
The condition \(\varLambda _{X_n}^\text { BQ} \lesssim n^c\) means that the stability constant \(\varLambda _{X_n}^\text { BQ}\) should not grow quickly as n increases. The bound (18) shows that the error in the misspecified setting becomes small if c is small. This implies that if the stability constant \(\varLambda _{X_n}^\text { BQ}\) does not increase quickly, then the quadrature rule becomes robust against the misspecification of a prior. This provides a third motivation for understanding the behaviour of \(\varLambda _{X_n}^\text { BQ}\).
4.3 Upper bounds for absolute weights
We now analyse magnitudes of individual weights and the stability constant (15). We first derive an upper bound on the magnitude of each weight \(w_{X_n,i}^\text { BQ}\). The proof of this result is based on an upper bound on the \(L^2(\varOmega )\) norm of Lagrange functions derived in De Marchi and Schaback (2010).
Theorem 5
Proof
An important consequence of Theorem 5 is that the magnitudes of quadrature weights decrease uniformly to zero as n increases if the design points are quasiuniform and \(\nu \) has a density. In other words, none of the design points will have a constant weight that does not decay. This is similar to importance sampling, where the weights decay uniformly at rate 1 / n. As a direct corollary of Theorem 5 we obtain bounds on the stability constant \(\varLambda _{X_n}^\text { BQ}\).
Corollary 3
While the bounds of Corollary 3 are somewhat conservative (as will be demonstrated in Sect. 4.5), they are still useful in understanding the factors affecting stability and robustness of Bayesian quadrature. That is, inequality (22) shows that the stability constant can be made small if the ratio \(h_{X_n,\varOmega } / q_{X_n}\) is kept small; this is possible if the point set is sufficiently uniform.
4.4 On infinitely smooth kernels
While the theoretical results of this section only concern kernels of finite smoothness, we make a few remarks on the stability of Bayesian quadrature when using infinitely smooth kernels, such as the Gaussian kernel. When using such a kernel, Bayesian quadrature rules suffer from the famous Runge phenomenon: if equispaced points are used, then Lebesgue constants and the stability constants grow rapidly; see Oettershagen (2017, Section 4.3), Platte and Driscoll (2005) and Platte et al. (2011). This effect is demonstrated in Fig. 4, and can be seen also in Sommariva and Vianello (2006b, Table 1).
A key point is that Runge phenomenon typically occurs when the design points are quasiuniform (e.g. equispaced). This means that quasiuniformity of the points does not ensure stability of Bayesian quadrature when the kernel is infinitely smooth. Care has to be taken if a numerically stable Bayesian quadrature rule is to be constructed with such a kernel. One possibility is to use locally optimal design points from Sect. 3.5. Corollary 2 then guarantees uniform boundedness of the stability constant, at least when \(d=1\).
4.5 A numerical example
Footnotes
 1.
This can be usually relaxed to \(I_\nu \) being a positive linear functional: \(I_\nu (f) > 0\) whenever f is almost everywhere positive.
 2.
This can be generalised to the cumulative distribution function having infinitely many points of increase.
 3.
Coalescence is possible because \({\mathbb {V}}_X^\text { BQ}\) is in fact a continuous function of X defined on the whole of \(\varOmega ^n\), not merely on \({\mathcal {S}}^n\) (Oettershagen 2017, Proposition 5.5). Coalescence of some of the points would result in a quadrature rule that uses also evaluations of derivatives of the integrand.
 4.
Note that a tensor product rule based on an optimal onedimensional point set need not be locally optimal for \(\varOmega \), \(\nu \) and k.
 5.
Note that the smoothness parametrisation \(\rho = r\) is often used. With this parametrisation \(k_\rho \) would satisfy (16) with the exponent \((r+d/2)\) and its RKHS would be normequivalent to \(H^{r+d/2}({\mathbb {R}}^d)\).
 6.
The reader may ask whether \(\varOmega \) needs to have a Lipschitz boundary for this normequivalence, but this assumption is indeed not needed. The assumption that \(\varOmega \) has a Lipschitz boundary is required when using Stein’s extension theorem (Stein 1970, p. 181) for Sobolev spaces defined using weak derivatives [see the proof of Wendland (2005, Corollary 10.48)]. On the other hand, we consider here a Sobolev space defined in terms of the Fourier transform, and the normequivalence follows from the extension and restriction theorems for a generic RKHS (Wendland 2005, Theorems 10.46 and 10.47) and the expression of the RKHS norm in terms of Fourier transforms (Wendland 2005, Theorem 10.12).
Notes
Acknowledgements
Open access funding provided by Aalto University. TK was supported by the Aalto ELEC Doctoral School. MK acknowledges support by the European Research Council (StG Project PANAMA). SS was supported by the Academy of Finland project 313708. This material was developed, in part, at the Prob Num 2018 workshop hosted by the Lloyd’s Register Foundation programme on DataCentric Engineering at the Alan Turing Institute, UK, and supported by the National Science Foundation, USA, under Grant DMS1127914 to the Statistical and Applied Mathematical Sciences Institute. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the abovenamed funding bodies and research institutions.
References
 Arcangéli, R., de Silanes, M.C.L., Torrnes, J.J.: An extension of a bound for functions in Sobolev spaces, with applications to \((m, s)\)spline interpolation and smoothing. Numer. Math. 108(2), 181–211 (2007)MathSciNetzbMATHGoogle Scholar
 Atkinson, K.E.: An Introduction to Numerical Analysis, 2nd edn. Wiley, Amsterdam (1989)zbMATHGoogle Scholar
 Barrar, R.B., Loeb, H.L.: Multiple zeroes and applications to optimal linear functionals. Numer. Math. 25(3), 251–262 (1976)zbMATHGoogle Scholar
 Barrar, R.B., Loeb, H.L., Werner, H.: On the existence of optimal integration formulas for analytic functions. Numer. Math. 23(2), 105–117 (1974)MathSciNetzbMATHGoogle Scholar
 Barrow, D.L.: On multiple node Gaussian quadrature formulae. Math. Comput. 32(142), 431–439 (1978)MathSciNetzbMATHGoogle Scholar
 Berlinet, A., ThomasAgnan, C.: Reproducing Kernel Hilbert Spaces in Probability and Statistics. Springer, New York (2011)zbMATHGoogle Scholar
 Bojanov, B.D.: On the existence of optimal quadrature formulae for smooth functions. Calcolo 16(1), 61–70 (1979)MathSciNetzbMATHGoogle Scholar
 Breger, A., Ehler, M., Gräf, M.: Points on manifolds with asymptotically optimal covering radius. J. Complex. 48, 1–14 (2018)MathSciNetCrossRefGoogle Scholar
 Briol, F.X., Oates, C. J., Cockayne, J., Chen, W. Y., Girolami, M.: On the sampling problem for kernel quadrature. In: Proceedings of the 34th International Conference on Machine Learning, pp. 586–595 (2017)Google Scholar
 Briol, F.X., Oates, C.J., Girolami, M., Osborne, M.A., Sejdinovic, D.: Probabilistic integration: a role in statistical computation? Stat. Sci. 34(1), 1–22 (2019)MathSciNetzbMATHGoogle Scholar
 Burbea, J.: Total positivity of certain reproducing kernels. Pac. J. Math. 67(1), 101–130 (1976)MathSciNetzbMATHGoogle Scholar
 Chai, H., Garnett, R.: An improved Bayesian framework for quadrature of constrained integrands. arXiv:1802.04782 (2018)
 Clenshaw, C.W., Curtis, A.R.: A method for numerical integration on an automatic computer. Numer. Math. 2(1), 197–205 (1960)MathSciNetzbMATHGoogle Scholar
 Cockayne, J., Oates, C. J., Sullivan, T., Girolami, M.: Bayesian probabilistic numerical methods. SIAM Rev. arxiv:1702.03673 (2019)
 Cook, T. D., Clayton, M. K.: Sequential Bayesian quadrature. Technical report, Department of Statistics, University of Wisconsin (1998)Google Scholar
 De Marchi, S., Schaback, R.: Stability constants for kernelbased interpolation processes. Technical Report 59/08, Universita degli Studi di Verona (2008)Google Scholar
 De Marchi, S., Schaback, R.: Stability of kernelbased interpolation. Adv. Comput. Math. 32(2), 155–161 (2010)MathSciNetzbMATHGoogle Scholar
 Diaconis, P.: Bayesian numerical analysis. In: Gupta, S.S., Berger, J.O. (eds.) Statistical Decision Theory and Related Topics IV, vol. 1, pp. 163–175. SpringerVerlag, New York (1988)Google Scholar
 Fasshauer, G.E.: Meshfree Approximation Methods with MATLAB. Number 6 in Interdisciplinary Mathematical Sciences. World Scientific, Singapore (2007)Google Scholar
 Förster, K.J.: Variance in quadrature—a survey. In: Brass, H., Hämmerlin, G. (eds.) Numerical Integration IV, vol. 112, pp. 91–110. Birkhäuser, Basel (1993)Google Scholar
 Gautschi, W.: Orthogonal Polynomials: Computation and Approximation. Numerical Mathematics and Scientific Computation. Oxford University Press, Oxford (2004)zbMATHGoogle Scholar
 Gavrilov, A.V.: On best quadrature formulas in the reproducing kernel Hilbert space. Sib. Zhurnal Vychislitelnoy Mat. 1(4), 313–320 (1998). (In Russian) zbMATHGoogle Scholar
 Gavrilov, A.V.: On optimal quadrature formulas. J. Appl. Ind. Math. 1(2), 190–192 (2007)MathSciNetGoogle Scholar
 Gunter, T., Osborne, M.A., Garnett, R., Hennig, P., Roberts, S.J.: Sampling for inference in probabilistic models with fast Bayesian quadrature. Adv. Neural Inf. Process. Syst. 27, 2789–2797 (2014)Google Scholar
 Hennig, P., Osborne, M.A., Girolami, M.: Probabilistic numerics and uncertainty in computations. Proc. R. Soc. Lond. A: Math. Phys. Eng. Sci. 471(2179), 20150142 (2015)MathSciNetzbMATHGoogle Scholar
 Huszár, F., Duvenaud, D.: Optimallyweighted herding is Bayesian quadrature. In: 28th Conference on Uncertainty in Artificial Intelligence, pp. 377–385 (2012)Google Scholar
 Jagadeeswaran, R., Hickernell, F. J.: Fast automatic Bayesian cubature using lattice sampling. Stat. Comput. (2019). https://doi.org/10.1007/s11222019098959
 Kanagawa, M., Sriperumbudur, B.K., Fukumizu, K.: Convergence guarantees for kernelbased quadrature rules in misspecified settings. Adv. Neural Inf. Process. Syst. 29, 3288–3296 (2016)Google Scholar
 Kanagawa, M., Sriperumbudur, B.K., Fukumizu, K.: Convergence analysis of deterministic kernelbased quadrature rules in misspecified settings. Found. Comput. Math. (2019). https://doi.org/10.1007/s10208018094077
 Karlin, S.: Total Positivity, vol. 1. Stanford University Press, Palo Alto (1968)zbMATHGoogle Scholar
 Karlin, S., Studden, W.J.: Tchebycheff Systems: With Applications in Analysis and Statistics. Inderscience Publishers, New York (1966)zbMATHGoogle Scholar
 Karvonen, T., Särkkä, S.: Classical quadrature rules via Gaussian processes. In: 27th IEEE International Workshop on Machine Learning for Signal Processing (2017)Google Scholar
 Karvonen, T., Särkkä, S.: Fully symmetric kernel quadrature. SIAM J. Sci. Comput. 40(2), A697–A720 (2018)MathSciNetzbMATHGoogle Scholar
 Karvonen, T., Särkkä, S.: Gaussian kernel quadrature at scaled Gauss–Hermite nodes. Bit Numer Math (2019). https://doi.org/10.1007/s10543019007583
 Karvonen, T., Oates, C.J., Särkkä, S.: A BayesSard cubature method. Adv. Neural Inf. Process. Syst. 31, 5882–5893 (2018)Google Scholar
 Larkin, F.M.: Optimal approximation in Hilbert spaces with reproducing kernel functions. Math. Comput. 24(112), 911–921 (1970)MathSciNetzbMATHGoogle Scholar
 Larkin, F.M.: Gaussian measure in Hilbert space and applications in numerical analysis. Rocky Mt. J. Math. 2(3), 379–421 (1972)MathSciNetzbMATHGoogle Scholar
 Lee, J. D., Simchowitz, M., Jordan, M. I., Recht, B.: Gradient descent only converges to minimizers. In: 29th Annual Conference on Learning Theory, pp. 1246–1257 (2016)Google Scholar
 Mhaskar, H.N., Narcowich, F.J., Ward, J.D.: Spherical Marcinkiewicz–Zygmund inequalities and positive quadrature. Math. Comput. 70(235), 1113–1130 (2001)MathSciNetzbMATHGoogle Scholar
 Minh, H.Q.: Some properties of Gaussian reproducing kernel Hilbert spaces and their implications for function approximation and learning theory. Constr. Approx. 32(2), 307–338 (2010)MathSciNetzbMATHGoogle Scholar
 Minka, T.: Deriving quadrature rules from Gaussian processes. Technical report, Microsoft Research, Statistics Department, Carnegie Mellon University (2000)Google Scholar
 Novak, E.: Intractability results for positive quadrature formulas and extremal problems for trigonometric polynomials. J. Complex. 15(3), 299–316 (1999)MathSciNetzbMATHGoogle Scholar
 Oates, C.J., Niederer, S., Lee, A., Briol, F.X., Girolami, M.: Probabilistic models for integration error in the assessment of functional cardiac models. Adv. Neural Inf. Process. Syst. 30, 109–117 (2017)Google Scholar
 Oettershagen, J.: Construction of optimal cubature algorithms with applications to econometrics and uncertainty quantification. Ph.D. thesis, Institut für Numerische Simulation, Universität Bonn (2017)Google Scholar
 O’Hagan, A.: Bayes–Hermite quadrature. J. Stat. Plann. Inference 29(3), 245–260 (1991)MathSciNetzbMATHGoogle Scholar
 O’Hagan, A.: Some Bayesian numerical analysis. Bayesian Stat. 4, 345–363 (1992)MathSciNetGoogle Scholar
 Osborne, M., Garnett, R., Ghahramani, Z., Duvenaud, D.K., Roberts, S.J., Rasmussen, C.E.: Active learning of model evidence using Bayesian quadrature. Adv. Neural Inf. Process. Syst. 25, 46–54 (2012)Google Scholar
 Platte, R.B., Driscoll, T.B.: Polynomials and potential theory for Gaussian radial basis function interpolation. SIAM J. Numer. Anal. 43(2), 750–766 (2005)MathSciNetzbMATHGoogle Scholar
 Platte, R.B., Trefethen, L.N., Kuijlaars, A.B.: Impossibility of fast stable approximation of analytic functions from equispaced samples. SIAM Rev. 53(2), 308–318 (2011)MathSciNetzbMATHGoogle Scholar
 Prüher, J., Särkkä, S.: On the use of gradient information in Gaussian process quadratures. In: 26th IEEE International Workshop on Machine Learning for Signal Processing (2016)Google Scholar
 Rasmussen, C.E., Ghahramani, Z.: Bayesian Monte Carlo. Adv. Neural Inf. Process. Syst. 15, 505–512 (2002)Google Scholar
 Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. MIT Press, Cambridge (2006)zbMATHGoogle Scholar
 Richter, N.: Properties of minimal integration rules. SIAM J. Numer. Anal. 7(1), 67–79 (1970)MathSciNetzbMATHGoogle Scholar
 RichterDyn, N.: Properties of minimal integration rules. II. SIAM J. Numer. Anal. 8(3), 497–508 (1971a)MathSciNetzbMATHGoogle Scholar
 RichterDyn, N.: Minimal interpolation and approximation in Hilbert spaces. SIAM J. Numer. Anal. 8(3), 583–597 (1971b)MathSciNetzbMATHGoogle Scholar
 Ritter, K.: AverageCase Analysis of Numerical Problems. Number 1733 in Lecture Notes in Mathematics. Springer, New York (2000)zbMATHGoogle Scholar
 Särkkä, S., Hartikainen, J., Svensson, L., Sandblom, F.: On the relation between Gaussian process quadratures and sigmapoint methods. J. Adv. Inf. Fusion 11(1), 31–46 (2016)Google Scholar
 Smola, A., Gretton, A., Song, L., Schölkopf, B.: A Hilbert space embedding for distributions. In: International Conference on Algorithmic Learning Theory, pp. 13–31. Springer (2007)Google Scholar
 Sommariva, A., Vianello, M.: Numerical cubature on scattered data by radial basis functions. Computing 76(3–4), 295–310 (2006a)MathSciNetzbMATHGoogle Scholar
 Sommariva, A., Vianello, M.: Meshless cubature by Green’s formula. Appl. Math. Comput. 183(2), 1098–1107 (2006b)MathSciNetzbMATHGoogle Scholar
 Stein, E.M.: Singular Integrals and Differentiability Properties of Functions. Princeton University Press, Princeton (1970)zbMATHGoogle Scholar
 Steinwart, I., Christmann, A.: Support Vector Machines. Information Science and Statistics. Springer, New York (2008)zbMATHGoogle Scholar
 Wendland, H.: Scattered Data Approximation. Number 28 in Cambridge Monographs on Applied and Computational Mathematics. Cambridge University Press, Cambridge (2005)zbMATHGoogle Scholar
 Wendland, H., Rieger, C.: Approximate interpolation with applications to selecting smoothing parameters. Numer. Math. 101(4), 729–748 (2005)MathSciNetzbMATHGoogle Scholar
 Wu, A., Aoi, M. C., Pillow, J. W.: Exploiting gradients and Hessians in Bayesian optimization and Bayesian quadrature. Preprint. arXiv:1704.00060 (2018)
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.