Skip to main content
Log in

Nested Kriging predictions for datasets with a large number of observations

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

This work falls within the context of predicting the value of a real function at some input locations given a limited number of observations of this function. The Kriging interpolation technique (or Gaussian process regression) is often considered to tackle such a problem, but the method suffers from its computational burden when the number of observation points is large. We introduce in this article nested Kriging predictors which are constructed by aggregating sub-models based on subsets of observation points. This approach is proven to have better theoretical properties than other aggregation methods that can be found in the literature. Contrarily to some other methods it can be shown that the proposed aggregation method is consistent. Finally, the practical interest of the proposed method is illustrated on simulated datasets and on an industrial test case with \(10^4\) observations in a 6-dimensional space.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  • Bachoc, F.: Cross validation and maximum likelihood estimations of hyper-parameters of Gaussian processes with model mispecification. Comput. Stat. Data Anal. 66, 55–69 (2013)

    Article  Google Scholar 

  • Bachoc, F., Durrande, N., Rullière, D., Chevalier, C.: Some properties of nested Kriging predictors. Technical report hal-01561747 (2017)

  • Bhatnagar, S., Prasad, H., Prashanth, L.: Stochastic Recursive Algorithms for Optimization, vol. 434. Springer, New York (2013)

    MATH  Google Scholar 

  • Cao, Y., Fleet, D.J.: Generalized product of experts for automatic and principled fusion of Gaussian process predictions. arXiv preprint arXiv:1410.7827v2, CoRR, abs/1410.7827:1–5. Modern Nonparametrics 3: Automating the Learning Pipeline workshop at NIPS, Montreal (2014)

  • Deisenroth, M.P., Ng, J.W.: Distributed Gaussian processes. In: Proceedings of the 32nd International Conference on Machine Learning, Lille, France. JMLR: W & CP vol. 37 (2015)

  • Genest, C., Zidek, J.V.: Combining probability distributions: a critique and an annotated bibliography. Stat. Sci. 1(1), 114–135 (1986)

    Article  MathSciNet  MATH  Google Scholar 

  • Golub, G.H., Van Loan, C.F.: Matrix Computations, vol. 3. JHU Press, Baltimore (2012)

    MATH  Google Scholar 

  • Guhaniyogi, R., Finley, A.O., Banerjee, S., Gelfand, A.E.: Adaptive Gaussian predictive process models for large spatial datasets. Environmetrics 22(8), 997–1007 (2011)

    Article  MathSciNet  Google Scholar 

  • Hensman, J., Fusi, N., Lawrence, N.D.: Gaussian Processes for Big Data. Uncertainty in Artificial Intelligence Conference. Paper Id 244 (2013)

  • Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Comput. 14(8), 1771–1800 (2002)

    Article  MATH  Google Scholar 

  • Katzfuss, M.: Bayesian nonstationary spatial modeling for very large datasets. Environmetrics 24(3), 189–200 (2013)

    Article  MathSciNet  Google Scholar 

  • Lemaitre, J., Chaboche, J.-L.: Mechanics of Solid Materials. Cambridge University Press, Cambridge (1994)

    MATH  Google Scholar 

  • Maurya, A.: A well-conditioned and sparse estimation of covariance and inverse covariance matrices using a joint penalty. J. Mach. Learn. Res. 17(1), 4457–4484 (2016)

    MathSciNet  MATH  Google Scholar 

  • Nickson, T., Gunter, T., Lloyd, C., Osborne, M.A., Roberts, S.: Blitzkriging: Kronecker-structured stochastic Gaussian processes. arXiv preprint arXiv:1510.07965v2, pp 1–13 (2015)

  • Ranjan, R., Gneiting, T.: Combining probability forecasts. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 72(1), 71–91 (2010)

    Article  MathSciNet  Google Scholar 

  • Roustant, O., Ginsbourger, D., Deville, Y.: DiceKriging, DiceOptim: two R packages for the analysis of computer experiments by Kriging-based metamodeling and optimization. J. Stat. Softw. 51(1), 1–55 (2012)

  • Rue, H., Held, L.: Gaussian Markov Random Fields. Theory and Applications. Chapman & Hall, London (2005)

    Book  MATH  Google Scholar 

  • Samo, Y.-L.K., Roberts, S.J.: String and membrane Gaussian processes. J. Mach. Learn. Res. 17(131), 1–87 (2016)

    MathSciNet  MATH  Google Scholar 

  • Santner, T.J., Williams, B.J., Notz, W.I.: The Design and Analysis of Computer Experiments. Springer, Berlin (2013)

    MATH  Google Scholar 

  • Satopää, V.A., Pemantle, R., Ungar, L.H.: Modeling probability forecasts via information diversity. J. Am. Stat. Assoc. 111(516), 1623–1633 (2016)

    Article  MathSciNet  Google Scholar 

  • Scott, S.L., Blocker, A.W., Bonassi, F.V., Chipman, H.A., George, E.I., McCulloch, R.E.: Bayes and big data: the consensus monte carlo algorithm. Int. J. Manag. Sci. Eng. Manag. 11(2), 78–88 (2016)

    Google Scholar 

  • Stein, M.L.: Interpolation of Spatial Data: Some Theory for Kriging. Springer, Berlin (2012)

    Google Scholar 

  • Stein, M.L.: Limitations on low rank approximations for covariance matrices of spatial data. Spat. Stat. 8, 1–19 (2014)

    Article  MathSciNet  Google Scholar 

  • Tresp, V.: A bayesian committee machine. Neural Comput. 12(11), 2719–2741 (2000)

    Article  Google Scholar 

  • Tzeng, S., Huang, H.-C., Cressie, N.: A fast, optimal spatial-prediction method for massive datasets. J. Am. Stat. Assoc. 100(472), 1343–1357 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  • van Stein, B., Wang, H., Kowalczyk, W., Bäck, T., Emmerich, M.: Optimally weighted cluster Kriging for big data regression. In: International Symposium on Intelligent Data Analysis, pp. 310–321. Springer (2015)

  • Wahba, G.: Spline Models for Observational Data, vol. 59. SIAM, Philadelphia (1990)

    Book  MATH  Google Scholar 

  • Wei, H., Du, Y., Liang, F., Zhou, C., Liu, Z., Yi, J., Xu, K., Wu, D.: A k-d tree-based algorithm to parallelize Kriging interpolation of big spatial data. GISci. Remote Sens. 52(1), 40–57 (2015)

    Article  Google Scholar 

  • Williams, C.K., Rasmussen, C.E.: Gaussian Processes for Machine Learning. MIT Press, Cambridge (2006)

    MATH  Google Scholar 

  • Winkler, R.L.: The consensus of subjective probability distributions. Manag. Sci. 15(2), B–61 (1968)

    Article  Google Scholar 

  • Winkler, R.L.: Combining probability distributions from dependent information sources. Manag. Sci. 27(4), 479–488 (1981)

    Article  MATH  Google Scholar 

  • Zhang, B., Sang, H., Huang, J.Z.: Full-scale approximations of spatio-temporal covariance models for large datasets. Stat. Sinica 25(1), 99–114 (2015)

Download references

Acknowledgements

Part of this research was conducted within the frame of the Chair in Applied Mathematics OQUAIDO, gathering partners in technological research (BRGM, CEA, IFPEN, IRSN, Safran, Storengy) and academia (Ecole Centrale de Lyon, Mines Saint-Etienne, University of Grenoble, University of Nice, University of Toulouse and CNRS) around advanced methods for Computer Experiments. The authors would like to warmly thank Dr. Géraud Blatman and EDF R&D for providing us the industrial test case. They also thank both editor and reviewers for very precise and constructive comments on this paper. This paper has been finished during a stay of D. Rullière at Vietnam Institute for Advanced Study in Mathematics, the latter author thanks the VIASM institute and DAMI research chair (Data Analytics & Models for Insurance) for their support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nicolas Durrande.

Appendix: Proof of Proposition 4

Appendix: Proof of Proposition 4

Complexities Under chosen assumption on \(\alpha \) and \(\beta \) coefficients, for a regular tree and in the case of simple Kriging sub-models, \(\mathcal {C}_\alpha =\sum _{\nu =1}^{\bar{\nu }} \sum _{i=1}^{n_\nu } \alpha c_\nu ^3 =\alpha \sum _{\nu =1}^{\bar{\nu }} c_\nu ^3 n_\nu \) and \(\mathcal {C}_\beta =\sum _{\nu =1}^{\bar{\nu }} \sum _{i=2}^{n_\nu } \sum _{j=1}^{i-1}\beta c^2_{\nu } =\frac{\beta }{2} \sum _{\nu =1}^{\bar{\nu }} n_\nu (n_{\nu }-1) c^2_{\nu }\). Notice that the sum starts from \(\nu =1\) in order to include sub-models calculation. Equilibrated trees complexities in a constant child number setting, when \(c_\nu =c\) for all \(\nu \), the tree structure ensures that \(n_{\nu }=n/c^{\nu }\), thus as \(c=n^{1/\bar{\nu }}\), we get when \(n \rightarrow +\infty \), \(\mathcal {C}_\alpha \sim \alpha n^{1+\frac{2}{\bar{\nu }}}\) and \(\mathcal {C}_\beta \sim \frac{\beta }{2} n^2\). The result for equilibrated two-layer tree where \(\bar{\nu }=2\) directly derives from this one, and in this case \(\mathcal {C}_\alpha \sim \alpha n^{2}\) and \(\mathcal {C}_\beta \sim \frac{\beta }{2} n^2\) (it derives also from the expressions of \(\mathcal {C}_\alpha \), \(\mathcal {C}_\beta \), when \(c_1=c_2=\sqrt{n}\), \(n_1=\sqrt{n}\), \(n_2=1\)). Optimal tree complexities one easily shows that under the chosen assumptions \(\mathcal {C}_\beta \sim \frac{\beta }{2}n^2\). Thus, it is indeed not possible to reduce the whole complexity to orders lower than \(O(n^2)\). However, one can choose the tree structure in order to reduce the complexity \(\mathcal {C}_\alpha \). For a regular tree, \(n_\nu =n/(c_1 \cdots c_{\nu })\) such that \(\frac{\partial }{\partial c_k} n_{\nu } = -\mathbf {1}_{{\lbrace {\nu \ge k}\rbrace }} n_\nu /c_k\). Using a Lagrange multiplier \(\ell \), one defines \(\xi (k)=c_k \frac{\partial }{\partial c_k} \left( \mathcal {C}_\alpha - \ell (c_1 \cdots c_{\bar{\nu }} -n) \right) = 3\alpha c_k^3 n_k - \alpha \sum _{\nu =k}^{\bar{\nu }}c_\nu ^3 n_\nu - \ell c_1 \cdots c_{\bar{\nu }}\). The tree structure that minimizes \(\mathcal {C}_\alpha \) is such that for all \(k<\bar{\nu }\), \(\xi (k)=\xi (k+1)=0\). Using \(c_{k+1} n_{k+1}=n_k\), one gets \(3c_{k+1}^2=2 c_{k}^3\) for all \(k<\bar{\nu }\), and setting \(c_1\cdots c_{\bar{\nu }}=n\), \(c_\nu = \delta \left( \delta ^{-\bar{\nu }} n\right) ^{\frac{\delta ^{\nu -1}}{2(\delta ^{\bar{\nu }}-1)}}\), \(\nu =1, \ldots , \bar{\nu }\), with \(\delta =\frac{3}{2}\). Setting \(\gamma =\frac{27}{4}\delta ^{-\frac{\bar{\nu }}{\delta ^{\bar{\nu }}-1}}\left( 1-\delta ^{-\bar{\nu }}\right) \). After some direct calculations this tree structure corresponds to complexities, \(\mathcal {C}_\alpha = \gamma \alpha n^{1+\frac{1}{\delta ^{\bar{\nu }}-1}}\) and \(\mathcal {C}_\beta \sim \frac{\beta }{2}n^2\). In a two-layers setting one gets \(c_1=\left( \frac{3}{2}\right) ^{1/5} n^{2/5}\) and \(c_2=\left( \frac{3}{2}\right) ^{-1/5} n^{3/5}\), which leads to \(\mathcal {C}_\alpha = \gamma \alpha n^{9/5}\) and \(\mathcal {C}_\beta = \frac{\beta }{2} n^2 - \frac{\beta }{2} \left( \frac{3}{2}\right) ^{\frac{1}{5}}n^{\frac{7}{5}}\), where \(\gamma =(\frac{2}{3})^{-2/5}+(\frac{2}{3})^{3/5}\simeq 1.96\) (eventually notice that even for values of n of order \(10^5\), terms of order like \(n^{9/5}\) are not necessarily negligible compared to those of order \(n^2\), and that \(\mathcal {C}_\beta \) is slightly affected by the choice of the tree structure, but the global complexity benefits from the optimization of \(\mathcal {C}_\alpha \)).

Storage footprint First, covariances can be stored in triangular matrices. So temporary objects M, k and K in Algorithm 1 require the storage of \(c_{\max }(c_{\max }+5)/2\) real values. For a given step \(\nu \), \(\nu \ge 2\), building all vectors \(\alpha _i\) requires the storage of \(\sum _{i=1}^{n_{\nu }} c_i^{\nu }=n_{\nu -1}\) values. At last, for a given step \(\nu \), we simultaneously need objects \(M_{\nu -1}, K_{\nu -1}, M_{\nu }, K_{\nu }\), which require the storage of \(n_{\nu -1}(n_{\nu -1}+3)/2 + n_{\nu }(n_{\nu }+3)/2\) real values. In a regular tree, as \(n_\nu \) is decreasing in \(\nu \), the storage footprint is \(\mathcal {S} = (c_{\max }(c_{\max }+5) + n_1(n_1+5) + n_2(n_2+3))/2\). Hence the equivalents for \(\mathcal {S}\) for the different tree structures, \(\mathcal {S}\sim n\) for the two-layer equilibrated tree, \(\mathcal {S}\sim \frac{1}{2}n^{2-2/\bar{\nu }}\) for the \(\bar{\nu }\)-layer, \(\bar{\nu }>2\) and the indicated result for the optimal tree. Simple orders are given in the proposition, which avoids separating the case \(\bar{\nu }=2\) and a cumbersome constant for the optimal tree.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rullière, D., Durrande, N., Bachoc, F. et al. Nested Kriging predictions for datasets with a large number of observations. Stat Comput 28, 849–867 (2018). https://doi.org/10.1007/s11222-017-9766-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-017-9766-2

Keywords

Navigation