Nonparametric Bayesian Inference with Kernel Mean Embedding

Fukumizu, Kenji

doi:10.1007/978-4-431-55339-7_1

Kenji Fukumizu³

Part of the book series: SpringerBriefs in Statistics ((JSSRES))

1215 Accesses

Abstract

Kernel methods have been successfully used in many machine learning problems with favorable performance in extracting nonlinear structure of high-dimensional data. Recently, nonparametric inference methods with positive definite kernels have been developed, employing the kernel mean expression of distributions. In this approach, the distribution of a variable is represented by the kernel mean, which is the mean element of the random feature vector defined by the kernel function, and relation among variables is expressed by covariance operators. This article gives an introduction to this new approach called kernel Bayesian inference, in which the Bayes’ rule is realized with the computation of kernel means and covariance expressions to estimate the kernel mean of posterior [11]. This approach provides a novel nonparametric way of Bayesian inference, expressing a distribution with weighted sample, and computing posterior with simple matrix calculation. As an example of problems for which this kernel Bayesian inference is applied effectively, nonparametric state-space model is discussed, in which it is assumed that the state transition and observation model are neither known nor estimable with a simple parametric model. This article gives detailed explanations on intuitions, derivations, and implementation issues of kernel Bayesian inference.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 16.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
As the kernel mean depends on k, it should be written by $m_X^k$ rigorously. We will, however, generally write $m_X$ for simplicity, where there is no ambiguity.
2.
These conditions guarantee existence of the covariance operator. Note also $E[k(X,X)]<\infty $ is stronger than the condition for kernel mean, $E[\sqrt{k(X,X)}]<\infty $; this is obvious from Cauchy–Schwarz inequality.
3.
Some previous literatures derived a convergence rate at unrealistic assumptions. For example, Theorem 6 in [30] assumes $k(\cdot ,y_0)\in \mathscr {R}(C_{YY})$ to achieve the rate $n^{-1/4}$, but in typical cases there is no function $f\in {\mathscr {H}_\mathscr {Y}}$ that satisfies $\int k(y,z)f(z)dP_Y(z)=k(y,y_0)$. Theorem 1.3.2 shows that if the eigenvalues decay sufficiently fast the rate approaches $n^{-1/4}$. As a relevant result, Theorem 11 in [11] shows a convergence rate of the kernel sum rule. While the conditional kernel mean is a special case of kernel sum rule with prior given by Dirac’s delta function at x, the faster rate ($n^{-1/3}$ at best) is not achievable by Theorem 1.3.2, since the former assumes that $\pi /p_X$ is a function in the RKHS and smooth enough.
4.
Although the samples are not i.i.d., we assume an appropriate mixing condition and thus the empirical covariances converge to the covariances with respect to the stationary distribution as $T\rightarrow \infty $.

References

Aronszajn, N.: Theory of reproducing kernels. Trans. Am. Math. Soc. 68(3), 337–404 (1950)
Article MathSciNet MATH Google Scholar
Baker, C.: Joint measures and cross-covariance operators. Trans. Am. Math. Soc. 186, 273–289 (1973)
Article MathSciNet MATH Google Scholar
Berlinet, A., Thomas-Agnan, C.: Reproducing kernel Hilbert Spaces in Probability and Statistics. Kluwer Academic Publisher (2004)
Google Scholar
Caponnetto, A., De Vito, E.: Optimal rates for regularized least-squares algorithm. Found. Comput. Math. 7(3), 331–368 (2007)
Article MathSciNet MATH Google Scholar
Doucet, A., Freitas, N.D., Gordon, N.: Sequential Monte Carlo Methods in Practice. Springer (2001)
Google Scholar
Fine, S., Scheinberg, K.: Efficient SVM training using low-rank kernel representations. J. Mach. Learn. Res. 2, 243–264 (2001)
MATH Google Scholar
Fukumizu, K., Bach, F., Jordan, M.: Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. J. Mach. Learn. Res. 5, 73–99 (2004)
MathSciNet MATH Google Scholar
Fukumizu, K., Bach, F., Jordan, M.: Kernel dimension reduction in regression. Ann. Stat. 37(4), 1871–1905 (2009)
Article MathSciNet MATH Google Scholar
Fukumizu, K., Gretton, A., Sun, X., Schölkopf, B.: Kernel measures of conditional dependence. In: Advances in Neural Information Processing Systems 20, pp. 489–496. MIT Press (2008)
Google Scholar
Fukumizu, K., R.Bach, F., Jordan, M.I.: Kernel dimension reduction in regression. Technical Report 715, Department of Statistics, University of California, Berkeley (2006)
Google Scholar
Fukumizu, K., Song, L., Gretton, A.: Kernel Bayes’ rule: Bayesian inference with positive definite kernels. J. Mach. Learn. Res. 14, 3753–3783 (2013)
MathSciNet MATH Google Scholar
Fukumizu, K., Sriperumbudur, B.K., Gretton, A., Schölkopf, B.: Characteristic kernels on groups and semigroups. Adv. Neural Inf. Proc. Syst. 20, 473–480 (2008)
Google Scholar
Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. In: Advances in Neural Information Processing Systems 19, pp. 513–520. MIT Press (2007)
Google Scholar
Gretton, A., Borgwardt, K.M., Rasch, M.J., Schölkopf, B., Smola, A.: A kernel two-sample test. J. Mach. Learn. Res. 13, 723–773 (2012)
MathSciNet MATH Google Scholar
Gretton, A., Fukumizu, K., Harchaoui, Z., Sriperumbudur, B.: A fast, consistent kernel two-sample test. Adv. Neural Inf. Process. Syst. 22, 673–681 (2009)
Google Scholar
Gretton, A., Fukumizu, K., Sriperumbudur, B.: Discussion of: brownian distance covariance. Ann. Appl. Stat. 3(4), 1285–1294 (2009)
Article MathSciNet MATH Google Scholar
Gretton, A., Fukumizu, K., Teo, C.H., Song, L., Schölkopf, B., Smola, A.: A kernel statistical test of independence. In: Advances in Neural Information Processing Systems 20, pp. 585–592. MIT Press (2008)
Google Scholar
Haeberlen, A., Flannery, E., Ladd, A.M., Rudys, A., Wallach, D.S., Kavraki, L.E.: Practical robust localization over large-scale 802.11 wireless networks. In: Proceedings of 10th International Conference on Mobile computing and networking (MobiCom ’04), pp. 70–84 (2004)
Google Scholar
Kanagawa, M., Fukumizu, K.: Recovering distributions from gaussian rkhs embeddings. J. Mach. Learn. Res. W&CP 3, 457–465 (2014)
Google Scholar
Kanagawa, M., Nishiyama, Y., Gretton, A., Fukumizu, K.: Monte carlo filtering using kernel embedding of distributions. In: Proceedings of 28th AAAI Conference on Artificial Intelligence (AAAI-14), pp. 1987–1903 (2014)
Google Scholar
Kwok, J.Y., Tsang, I.: The pre-image problem in kernel methods. IEEE Trans. Neural Networks 15(6), 1517–1525 (2004)
Article Google Scholar
McCalman, L.: Function embeddings for multi-modal bayesian inference. Ph.D. thesis. School of Information Technology. The University of Sydney (2013)
Google Scholar
McCalman, L., O’Callaghan, S., Ramos, F.: Multi-modal estimation with kernel embeddings for learning motion models. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 2845–2852 (2013)
Google Scholar
Mika, S., Schölkopf, B., Smola, A., Müller, K.R., Scholz, M., Rätsch, G.: Kernel PCA and de-noising in feature spaces. In: Advances in Neural Information Pecessing Systems 11, pp. 536–542. MIT Press (1999)
Google Scholar
Monbet, V., Ailliot, P., Marteau, P.: $l^1$-convergence of smoothing densities in non-parametric state space models. Stat. Infer. Stoch. Process. 11, 311–325 (2008)
Article MathSciNet MATH Google Scholar
Moulines, E., Bach, F.R., Harchaoui, Z.: Testing for homogeneity with kernel Fisher discriminant analysis. In: Advances in Neural Information Processing Systems 20, pp. 609–616. Curran Associates, Inc. (2008)
Google Scholar
Quigley, M., Stavens, D., Coates, A., Thrun, S.: Sub-meter indoor localization in unmodified environments with inexpensive sensors. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2010), pp. 2039 – 2046 (2010)
Google Scholar
Schölkopf, B., Smola, A.: Learning with Kernels. MIT Press (2002)
Google Scholar
Song, L., Fukumizu, K., Gretton, A.: Kernel embeddings of conditional distributions: a unified kernel framework for nonparametric inference in graphical models. IEEE Sig. Process. Mag. 30(4), 98–111 (2013)
Article Google Scholar
Song, L., Huang, J., Smola, A., Fukumizu, K.: Hilbert space embeddings of conditional distributions with applications to dynamical systems. In: Proceedings of the 26th International Conference on Machine Learning (ICML2009), pp. 961–968 (2009)
Google Scholar
Sriperumbudur, B.K., Fukumizu, K., Lanckriet, G.: Characteristic kernels and rkhs embedding of measures. J. Mach. Learn. Res. Universality 12, 2389–2410 (2011)
MathSciNet MATH Google Scholar
Sriperumbudur, B.K., Gretton, A., Fukumizu, K., Schölkopf, B., Lanckriet, G.: Hilbert space embeddings and metrics on probability measures. J. Mach. Learn. Res. 11, 1517–1561 (2010)
MathSciNet MATH Google Scholar
Steinwart, I., Hush, D., Scovel, C.: Optimal rates for regularized least squares regression. Proc. COLT 2009, 79–93 (2009)
Google Scholar
Thrun, S., Langford, J., Fox, D.: Monte carlo hidden markov models: Learning non-parametric models of partially observable stochastic processes. In: Proceedings of International Conference on Machine Learning (ICML 1999), pp. 415–424 (1999)
Google Scholar
Wan, E., and van der Merwe, R.: The unscented Kalman filter for nonlinear estimation. In: Adaptive Systems for Signal Processing, Communications, and Control Symposium (AS-SPCC 2000), pp. 153–158. IEEE (2000)
Google Scholar
Widom, H.: Asymptotic behavior of the eigenvalues of certain integral equations. Trans. Am. Math. Soc. 109, 278–295 (1963)
Article MathSciNet MATH Google Scholar
Widom, H.: Asymptotic behavior of the eigenvalues of certain integral equations II. Arch. Ration. Mech. Anal. 17, 215–229 (1964)
Article MathSciNet MATH Google Scholar
Williams, C.K.I., Seeger, M.: Using the Nyström method to speed up kernel machines. In: Advances in Neural Information Processing Systems, vol. 13, pp. 682–688. MIT Press (2001)
Google Scholar

Download references

Acknowledgments

The author has been supported in part by MEXT Grant-in-Aid for Scientific Research on Innovative Areas 25120012.

Author information

Authors and Affiliations

The Institute of Statistical Mathematics, Tokyo, Japan
Kenji Fukumizu

Authors

Kenji Fukumizu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kenji Fukumizu .

Editor information

Editors and Affiliations

Department of Statistical Science, University College London, London, United Kingdom
Gareth William Peters
The Institute of Statistical Mathem, Tachikawa, Tokyo, Japan
Tomoko Matsui

Appendix: Proof of Theorem 1.3.2

First, we show a lemma to derive a convergence rate of conditional kernel mean.

Lemma 1.5.1

Assume that the kernels are measurable and bounded. Let $N(\varepsilon ):=\mathrm {Tr}[C_{YY}(C_{YY}+\varepsilon I)^{-1}]$ and $\varepsilon _n$ be a constant such that $\varepsilon _n\rightarrow 0$ as $n\rightarrow \infty $. Then,

$$ \left\| (\widehat{C}^{(n)}_{YY}-C_{YY})(C_{YY}+\varepsilon _n I)^{-1}\right\| = O_p\left( \frac{1}{\varepsilon _n n} + \sqrt{\frac{N(\varepsilon _n)}{\varepsilon _n n}}\right) $$

and

$$ \left\| (\widehat{C}^{(n)}_{XY}-C_{XY})(C_{YY}+\varepsilon _n I)^{-1}\right\| = O_p\left( \frac{1}{\varepsilon _n n} + \sqrt{\frac{N(\varepsilon _n)}{\varepsilon _n n}}\right) $$

as $n\rightarrow \infty $.

Proof

The first result is shown in [4] (page 349). While the proof of the second one is similar, it is shown below for completeness.

Let $\xi _{yx}$ be an element in ${\mathscr {H}_\mathscr {Y}}\otimes {\mathscr {H}_\mathscr {X}}$ defined by

$$ \xi _{yx}:=\bigl \{(C_{YY}+\varepsilon _n I)^{-1} k(\cdot ,y)\bigr \}\otimes k(\cdot ,x). $$

With identification between $Hy\otimes {\mathscr {H}_\mathscr {X}}$ and the Hilbert–Schmidt operators from ${\mathscr {H}_\mathscr {X}}$ to ${\mathscr {H}_\mathscr {Y}}$,

$$ E[\xi _{YX}]=(C_{YY}+\varepsilon _n I)^{-1}C_{YX}. $$

Take $a>0$ such that $k(x,x)\le a^2$ and $k(y,y)\le a^2$. It follows from $\Vert f\otimes g\Vert =\Vert f\Vert \,\Vert g\Vert $ and $\Vert (C_{YY}+\varepsilon _n I)^{-1}\Vert \le 1/\varepsilon _n$ that

$$ \Vert \xi _{yx}\Vert = \bigl \Vert (C_{YY}+\varepsilon _n I)^{-1} k(\cdot ,y) \bigr \Vert \bigl \Vert k(\cdot ,x)\bigr \Vert \le \frac{1}{\varepsilon _n} \Vert k(\cdot ,y)\Vert \,\Vert k(\cdot ,x)\Vert \le \frac{a^2}{\varepsilon _n}, $$

and

$$\begin{aligned} E\Vert \xi _{YX}\Vert ^2&= E\bigl \Vert \{(C_{YY}+\varepsilon _n I)^{-1} k(\cdot ,Y)\}\otimes k(\cdot ,X)\bigr \Vert ^2 \\&= E\Vert k(\cdot ,X)\Vert ^2 \,\bigl \Vert (C_{YY}+\varepsilon _n I)^{-1} k(\cdot ,Y)\bigr \Vert ^2 \\&\le a^2 E\bigl \Vert (C_{YY}+\varepsilon _n I)^{-1} k(\cdot ,Y)\bigr \Vert ^2 \\&= a^2 E\bigl \langle (C_{YY}+\varepsilon _n I)^{-2}k(\cdot ,Y),k(\cdot ,Y)\bigr \rangle \\&= a^2 E \mathrm {Tr}\bigl [ (C_{YY}+\varepsilon _n I)^{-2}(k(\cdot ,Y)\otimes k(\cdot ,Y)^*)\bigr ] \\&=a^2 \mathrm {Tr}\bigl [ (C_{YY}+\varepsilon _n I)^{-2}C_{YY}\bigr ] \\&\le \frac{a^2}{\varepsilon _n }\mathrm {Tr}\bigl [ (C_{YY}+\varepsilon _n I)^{-1}C_{YY}\bigr ] = \frac{a^2}{\varepsilon _n } N(\varepsilon _n). \end{aligned}$$

Here $k(\cdot ,Y)^*$ is the dual element of $k(\cdot ,Y)$ and $k(\cdot ,Y)\otimes k(\cdot ,Y)^*$ is regarded as an operator on ${\mathscr {H}_\mathscr {Y}}$. In the last inequality, $(C_{YY}+\varepsilon _n I)^{-1}$ in the trace is replaced by its upper bound $\varepsilon _n^{-1} I$. Since $\frac{1}{n}\sum _{i=1}^n (C_{YY}+\varepsilon _n I)^{-1}\xi _{Y_i X_i} = (C_{YY}+\varepsilon _n I)^{-1}\widehat{C}^{(n)}_{YX}$, it follows from Proposition 2 in [4] that for all $n\in {\mathbb {N}}$ and $0<\eta <1$

$$\begin{aligned} \Pr \biggl ( \Bigg \Vert (C_{YY}+\varepsilon _n I)^{-1}\widehat{C}^{(n)}_{YX} - (C_{YY}+\varepsilon _n I)^{-1}&C_{YX} \Bigg \Vert \\&\ge 2\biggl (\frac{2a^2}{n\varepsilon _n} + \sqrt{\frac{a^2 N(\varepsilon _n)}{\varepsilon _n n}}\biggr )\log \frac{2}{\eta } \bigg ) \le \eta , \end{aligned}$$

which proves the assertion.$\square $

Proof of Theorem 1.3.2 First, we have

$$\begin{aligned}&\bigl \Vert \widehat{C}^{(n)}_{XY}(\widehat{C}^{(n)}_{YY}+\varepsilon _n I)^{-1}k_\mathscr {Y}(\cdot ,y_0) - E[k_\mathscr {X}(\cdot ,X)|Y=y_0] \bigr \Vert _{\mathscr {H}_\mathscr {X}}\nonumber \\&\le \bigl \Vert \widehat{C}^{(n)}_{XY}(\widehat{C}^{(n)}_{YY}+\varepsilon _n I)^{-1}k_\mathscr {Y}(\cdot ,y_0) - C_{XY}(C_{YY}+\varepsilon _n I)^{-1}k_\mathscr {Y}(\cdot ,y_0) \Vert _{\mathscr {H}_\mathscr {X}}\end{aligned}$$

(1.19)

$$\begin{aligned}&\qquad + \bigl \Vert C_{XY}(C_{YY}+\varepsilon _n I)^{-1}k_\mathscr {Y}(\cdot ,y_0)-E[k_\mathscr {X}(\cdot ,X)|Y=y_0] \bigr \Vert _{\mathscr {H}_\mathscr {X}}. \end{aligned}$$

(1.20)

Using the general formula $A^{-1}-B^{-1}=A^{-1}(B-A)B^{-1}$ for any invertible operators A, B, the first term in the right-hand side of the above inequality is upper bounded by

$$\begin{aligned}&\bigl \Vert (\widehat{C}^{(n)}_{XY}-C_{XY})(\widehat{C}^{(n)}_{YY}+\varepsilon _n I)^{-1}k_\mathscr {Y}(\cdot ,y_0)\bigr \Vert _{\mathscr {H}_\mathscr {X}}\nonumber \\&\qquad + \bigl \Vert C_{XY}(C_{YY}+\varepsilon _n I)^{-1}(C_{YY}-\widehat{C}^{(n)}_{YY})(\widehat{C}^{(n)}_{YY}+\varepsilon _n I)^{-1}k_\mathscr {Y}(\cdot ,y_0)\bigr \Vert _{\mathscr {H}_\mathscr {X}}\\ \le&\bigl \Vert (\widehat{C}^{(n)}_{XY}-C_{XY})(\widehat{C}^{(n)}_{YY}+\varepsilon _n I)^{-1}\bigr \Vert \,\bigl \Vert k_\mathscr {Y}(\cdot ,y_0)\bigr \Vert _{\mathscr {H}_\mathscr {Y}}\\&\qquad + \frac{1}{\sqrt{\varepsilon _n}}\Vert C_{XX}\Vert ^{1/2} \bigl \Vert (\widehat{C}^{(n)}_{YY}-C_{YY})(\widehat{C}^{(n)}_{YY}+\varepsilon _n I)^{-1}\bigr \Vert \, \bigl \Vert k_\mathscr {Y}(\cdot ,y_0)\bigr \Vert _{\mathscr {H}_\mathscr {Y}}, \end{aligned}$$

where in the second inequality the decomposition $C_{XY}=C_{XX}^{1/2}W_{XY}C_{YY}^{1/2}$ with some $W_{XY}:{\mathscr {H}_\mathscr {Y}}\rightarrow {\mathscr {H}_\mathscr {X}}$ ($\Vert W_{XY}\Vert \le 1$) [2] is used. It follows from Lemma 1.5.1 that

$$\begin{aligned} \bigl \Vert \widehat{C}^{(n)}_{XY}(\widehat{C}^{(n)}_{YY}+\varepsilon _n I)^{-1}k_\mathscr {Y}(\cdot ,y_0) - C_{XY}(C_{YY}&+\varepsilon _n I)^{-1}k_\mathscr {Y}(\cdot ,y_0) \Vert _{\mathscr {H}_\mathscr {X}}\\&=O_p\left( \varepsilon _n^{-1/2}\left\{ \frac{1}{\varepsilon _n n}+\sqrt{\frac{N(\varepsilon _n)}{\varepsilon _n n}}\right\} \right) , \end{aligned}$$

as $n\rightarrow \infty $. It is known (Proposition 3, [4]) that, under the assumption on the decay rate of the eigenvalues, $N(\varepsilon )\le \frac{b\beta }{b-1}\varepsilon ^{-1/b}$ holds with some $\beta \ge 0$. Since $\varepsilon _n^{-3/2}n^{-1} \ll \varepsilon _n^{-1-\frac{1}{2b}}n^{-1/2}$ for $b>1$ and $n\varepsilon _n \rightarrow \infty $, we have

$$\begin{aligned} \bigl \Vert \widehat{C}^{(n)}_{XY}(\widehat{C}^{(n)}_{YY}+\varepsilon _n I)^{-1}k_\mathscr {Y}(\cdot ,y_0) - C_{XY}(C_{YY}+\varepsilon _n I)^{-1}&k_\mathscr {Y}(\cdot ,y_0) \Vert _{\mathscr {H}_\mathscr {X}}\nonumber \\&=O_p\left( \varepsilon _n^{-1-\frac{1}{2b}}n^{-1/2}\right) , \end{aligned}$$

(1.21)

as $n\rightarrow \infty $.

For the second term of Eq. (1.19), let $\varTheta :=E[k(X,\tilde{X})|Y=\cdot ,\tilde{Y}=*]\in \mathscr {R}(C_{YY}\otimes C_{YY})$. Note that for any $\varphi \in {\mathscr {H}_\mathscr {Y}}$ we have

$$\begin{aligned} \langle C_{XY}\varphi ,&C_{XY}\varphi \rangle =E[k(X,\tilde{X})\varphi (Y)\varphi (\tilde{Y})]\\&\quad =E\bigl [ E[k(X,\tilde{X})|Y,\tilde{Y}]\varphi (Y)\varphi (\tilde{Y})\bigr ] =\langle (C_{YY}\otimes C_{YY})\varTheta ,\varphi \otimes \varphi \rangle _{{\mathscr {H}_\mathscr {Y}}\otimes {\mathscr {H}_\mathscr {Y}}}. \end{aligned}$$

Similarly,

$$\begin{aligned} \langle C_{XY}\varphi , E[k(\cdot ,X)|Y=y_0]\rangle _{\mathscr {H}_\mathscr {X}}=\langle E[k&(X,\tilde{X})|Y=y_0,\tilde{Y}=*], C_{YY} \varphi \rangle _{{\mathscr {H}_\mathscr {Y}}} \\&=\langle (I\otimes C_{YY})\varTheta ,k(\cdot ,y_0)\otimes \varphi \rangle _{{\mathscr {H}_\mathscr {Y}}\otimes {\mathscr {H}_\mathscr {Y}}}. \end{aligned}$$

It follows form these equalities with $\varphi =(C_{YY}+\varepsilon _n I)^{-1}k_\mathscr {Y}(\cdot ,y_0)$ that

$$\begin{aligned}&\bigl \Vert C_{XY}(C_{YY}+\varepsilon _n I)^{-1}k_\mathscr {Y}(\cdot ,y_0)-E[k_\mathscr {X}(\cdot ,X)|Y=y_0] \bigr \Vert _{\mathscr {H}_\mathscr {X}}^2 \\&=\bigl \langle \bigl \{ (C_{YY}+\varepsilon _n I)^{-1}C_{YY}\otimes (C_{YY}+\varepsilon _n I)^{-1}C_{YY} -I\otimes (C_{YY}+\varepsilon _n I)^{-1}C_{YY}\\&\qquad -(C_{YY}+\varepsilon _n I)^{-1}C_{YY}\otimes I + I\otimes I\bigr \}\varTheta , k_\mathscr {Y}(\cdot ,y_0)\otimes k_\mathscr {Y}(*,y_0)\bigr \rangle _{{\mathscr {H}_\mathscr {Y}}\otimes {\mathscr {H}_\mathscr {Y}}}. \end{aligned}$$

From the assumption $\varTheta \in \mathscr {R}(\mathbb {C}_{YY}\otimes C_{YY})$, there is $\varPsi \in {\mathscr {H}_\mathscr {Y}}\otimes {\mathscr {H}_\mathscr {Y}}$ such that $\varTheta = (C_{YY}\otimes C_{YY}) \varPsi $. Let $\{\phi _i\}$ be the eigenvectors of $C_{YY}$ with eigenvalues $\lambda _1\,{\ge }\, \lambda _2\,{\ge }\, \cdots 0$. Since the eigenvectors and eigenvalues of $C_{YY}\otimes C_{YY}$ are given by $\{\phi _i\otimes \phi _j\}_{ij}$ and $\lambda _i\lambda _j$, respectively, with the fact $(C_{YY}+\varepsilon _n I)^{-1}C_{YY}^2\phi _i=(\lambda _i^2/(1+\lambda _i))\phi _i$ and Parseval’s theorem we have

$$\begin{aligned}&\bigl \Vert \bigl \{(C_{YY}+\varepsilon _n I)^{-1}C_{YY}\otimes (C_{YY}+\varepsilon _n I)^{-1}C_{YY} -I\otimes (C_{YY}+\varepsilon _n I)^{-1}C_{YY}\\&\qquad -(C_{YY}+\varepsilon _n I)^{-1}C_{YY}\otimes I + I\otimes I\bigr \}\varTheta \bigr \Vert _{{\mathscr {H}_\mathscr {Y}}\otimes {\mathscr {H}_\mathscr {Y}}}^2 \\&= \sum _{i,j}\Bigl \{ \frac{\lambda _i^2}{\lambda _i+\varepsilon _n}\frac{\lambda _j^2}{\lambda _j+\varepsilon _n}- \frac{\lambda _i^2\lambda _j}{\lambda _i+\varepsilon _n}-\frac{\lambda _i\lambda _j^2}{\lambda _j+\varepsilon _n}+\lambda _i\lambda _j\Bigr \}^2 \langle \phi _i\otimes \phi _j,\varPsi \rangle _{{\mathscr {H}_\mathscr {X}}\otimes {\mathscr {H}_\mathscr {X}}}^2 \\&= \varepsilon _n^4 \sum _{i,j}\Bigl \{\frac{\lambda _i\lambda _j}{(\lambda _i+\varepsilon _n)(\lambda _j+\varepsilon _n)}\Bigr \}^2 \langle \phi _i\otimes \phi _j,\varPsi \rangle _{{\mathscr {H}_\mathscr {X}}\otimes {\mathscr {H}_\mathscr {X}}}^2 \le \varepsilon _n^4 \Vert \varPsi \Vert _{{\mathscr {H}_\mathscr {X}}\otimes {\mathscr {H}_\mathscr {X}}}^2, \end{aligned}$$

which shows

$$\begin{aligned} \bigl \Vert C_{XY}(C_{YY}+\varepsilon _n I)^{-1}k_\mathscr {Y}(\cdot ,y_0)-E[k_\mathscr {X}(\cdot ,X)|Y=y_0] \bigr \Vert _{\mathscr {H}_\mathscr {X}}= O(\varepsilon _n). \end{aligned}$$

(1.22)

By balancing Eqs. (1.21) and (1.22), the assertion is obtained with $\varepsilon _n=n^{-b/(4b+1)}$.

${}\square $

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Fukumizu, K. (2015). Nonparametric Bayesian Inference with Kernel Mean Embedding. In: Peters, G., Matsui, T. (eds) Modern Methodology and Applications in Spatial-Temporal Modeling. SpringerBriefs in Statistics(). Springer, Tokyo. https://doi.org/10.1007/978-4-431-55339-7_1

Download citation

DOI: https://doi.org/10.1007/978-4-431-55339-7_1
Published: 09 January 2016
Publisher Name: Springer, Tokyo
Print ISBN: 978-4-431-55338-0
Online ISBN: 978-4-431-55339-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Nonparametric Bayesian Inference with Kernel Mean Embedding

Abstract

Access this chapter

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix: Proof of Theorem 1.3.2

Appendix: Proof of Theorem 1.3.2

Lemma 1.5.1

Proof

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation