Vecchia Approximations of Gaussian-Process Predictions

Abstract

Gaussian processes (GPs) are highly flexible function estimators used for geospatial analysis, nonparametric regression, and machine learning, but they are computationally infeasible for large datasets. Vecchia approximations of GPs have been used to enable fast evaluation of the likelihood for parameter inference. Here, we study Vecchia approximations of spatial predictions at observed and unobserved locations, including obtaining joint predictive distributions at large sets of locations. We consider a general Vecchia framework for GP predictions, which contains some novel and some existing special cases. We study the accuracy and computational properties of these approaches theoretically and numerically, proving that our new methods exhibit linear computational complexity in the total number of spatial locations. We show that certain choices within the framework can have a strong effect on uncertainty quantification and computational cost, which leads to specific recommendations on which methods are most suitable for various settings. We also apply our methods to a satellite dataset of chlorophyll fluorescence, showing that the new methods are faster or more accurate than existing methods and reduce unrealistic artifacts in prediction maps. Supplementary materials accompanying this paper appear on-line.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

References

  1. Banerjee, S., Carlin, B. P., and Gelfand, A. E. (2004). Hierarchical Modeling and Analysis for Spatial Data. Chapman & Hall, Boca Raton.

    Google Scholar 

  2. Banerjee, S., Gelfand, A. E., Finley, A. O., and Sang, H. (2008). Gaussian predictive process models for large spatial data sets. Journal of the Royal Statistical Society, Series B, 70(4):825–848.

    MathSciNet  MATH  Google Scholar 

  3. Cressie, N. and Johannesson, G. (2008). Fixed rank kriging for very large spatial data sets. Journal of the Royal Statistical Society, Series B, 70(1):209–226.

    MathSciNet  MATH  Google Scholar 

  4. Cressie, N. and Wikle, C. K. (2011). Statistics for Spatio-Temporal Data. Wiley, Hoboken, NJ.

    Google Scholar 

  5. Curriero, F. C. and Lele, S. (1999). A composite likelihood approach to semivariogram estimation. Journal of Agricultural, Biological, and Environmental Statistics, 4(1):9–28.

    MathSciNet  Google Scholar 

  6. Datta, A., Banerjee, S., Finley, A. O., and Gelfand, A. E. (2016). Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets. Journal of the American Statistical Association, 111(514):800–812.

    MathSciNet  Google Scholar 

  7. Du, J., Zhang, H., and Mandrekar, V. S. (2009). Fixed-domain asymptotic properties of tapered maximum likelihood estimators. The Annals of Statistics, 37:3330–3361.

    MathSciNet  MATH  Google Scholar 

  8. Eidsvik, J., Shaby, B. A., Reich, B. J., Wheeler, M., and Niemi, J. (2014). Estimation and prediction in spatial models with block composite likelihoods using parallel computing. Journal of Computational and Graphical Statistics, 23(2):295–315.

    MathSciNet  Google Scholar 

  9. Erisman, A. M. and Tinney, W. F. (1975). On computing certain elements of the inverse of a sparse matrix. Communications of the ACM, 18(3):177–179.

    MathSciNet  MATH  Google Scholar 

  10. Finley, A. O., Datta, A., Cook, B. C., Morton, D. C., Andersen, H. E., and Banerjee, S. (2019). Efficient algorithms for Bayesian nearest neighbor Gaussian processes. Journal of Computational and Graphical Statistics, 28(2):401–414.

    MathSciNet  Google Scholar 

  11. Finley, A. O., Sang, H., Banerjee, S., and Gelfand, A. E. (2009). Improving the performance of predictive process modeling for large datasets. Computational Statistics & Data Analysis, 53(8):2873–2884.

    MathSciNet  MATH  Google Scholar 

  12. Foreman-Mackey, D., Agol, E., Ambikasaran, S., and Angus, R. (2017). Fast and scalable Gaussian process modeling with applications to astronomical time series. The Astronomical Journal, 154:220.

    Google Scholar 

  13. Furrer, R., Genton, M. G., and Nychka, D. (2006). Covariance tapering for interpolation of large spatial datasets. Journal of Computational and Graphical Statistics, 15(3):502–523.

    MathSciNet  Google Scholar 

  14. Gneiting, T. and Katzfuss, M. (2014). Probabilistic forecasting. Annual Review of Statistics and Its Application, 1(1):125–151.

    Google Scholar 

  15. Gramacy, R. B. and Apley, D. W. (2015). Local Gaussian process approximation for large computer experiments. Journal of Computational and Graphical Statistics, 24(2):561–578.

    MathSciNet  Google Scholar 

  16. Guan, K., Berry, J. A., Zhang, Y., Joiner, J., Guanter, L., Badgley, G., and Lobell, D. B. (2016). Improving the monitoring of crop productivity using spaceborne solar-induced fluorescence. Global Change Biology, 22(2):716–726.

    Google Scholar 

  17. Guinness, J. (2018). Permutation methods for sharpening Gaussian process approximations. Technometrics, 60(4):415–429.

    MathSciNet  Google Scholar 

  18. Guinness, J. (2019). Spectral density estimation for random fields via periodic embeddings. Biometrika, 106(2):267–286.

    MathSciNet  MATH  Google Scholar 

  19. Heaton, M. J., Datta, A., Finley, A. O., Furrer, R., Guinness, J., Guhaniyogi, R., Gerber, F., Gramacy, R. B., Hammerling, D., Katzfuss, M., Lindgren, F., Nychka, D. W., Sun, F., and Zammit-Mangion, A. (2019). A case study competition among methods for analyzing large spatial data. Journal of Agricultural, Biological, and Environmental Statistics, 24(3):398–425.

    MathSciNet  MATH  Google Scholar 

  20. Higdon, D. (1998). A process-convolution approach to modelling temperatures in the North Atlantic Ocean. Environmental and Ecological Statistics, 5(2):173–190.

    Google Scholar 

  21. Jones, D. R., Schonlau, M., and W. J. Welch (1998). Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13:455–492.

    MathSciNet  MATH  Google Scholar 

  22. Jurek, M. and Katzfuss, M. (2018). Multi-resolution filters for massive spatio-temporal data. arXiv:1810.04200.

  23. Katzfuss, M. (2017). A multi-resolution approximation for massive spatial datasets. Journal of the American Statistical Association, 112(517):201–214.

    MathSciNet  Google Scholar 

  24. Katzfuss, M. and Cressie, N. (2011). Spatio-temporal smoothing and EM estimation for massive remote-sensing data sets. Journal of Time Series Analysis, 32(4):430–446.

    MathSciNet  MATH  Google Scholar 

  25. Katzfuss, M. and Gong, W. (2019). A class of multi-resolution approximations for large spatial datasets. Statistica Sinica. https://doi.org/10.5705/ss.202018.0285.

  26. Katzfuss, M. and Guinness, J. (2019). A general framework for Vecchia approximations of Gaussian processes. Statistical Science, accepted.

  27. Katzfuss, M., Guinness, J., and Lawrence, E. (2020a). Scaled Vecchia approximation for fast computer-model emulation. arXiv:2005.00386.

  28. Katzfuss, M., Jurek, M., Zilber, D., Gong, W., Guinness, J., Zhang, J., and Schäfer, F. (2020b). GPvecchia: Fast Gaussian-process inference using Vecchia approximations. R package version 0.1.3.

  29. Kaufman, C. G., Schervish, M. J., and Nychka, D. W. (2008). Covariance tapering for likelihood-based estimation in large spatial data sets. Journal of the American Statistical Association, 103(484):1545–1555.

    MathSciNet  MATH  Google Scholar 

  30. Kelly, B. C., Becker, A. C., Sobolewska, M., Siemiginowska, A., and Uttley, P. (2014). Flexible and scalable methods for quantifying stochastic variability in the era of massive time-domain astronomical data sets. Astrophysical Journal, 788(1):33.

    Google Scholar 

  31. Kennedy, M. C. and O’Hagan, A. (2001). Bayesian calibration of computer models. Journal of the Royal Statistical Society: Series B, 63(3):425–464.

    MathSciNet  MATH  Google Scholar 

  32. Le, N. D. and Zidek, J. V. (2006). Statistical Analysis of Environmental Space-Time Processes. Springer, Berlin.

    Google Scholar 

  33. Li, S., Ahmed, S., Klimeck, G., and Darve, E. (2008). Computing entries of the inverse of a sparse matrix using the FIND algorithm. Journal of Computational Physics, 227(22):9408–9427.

    MathSciNet  MATH  Google Scholar 

  34. Lin, L., Yang, C., Meza, J., Lu, J., Ying, L., and Weinan, E. (2011). SelInv—An algorithm for selected inversion of a sparse symmetric matrix. ACM Transactions on Mathematical Software, 37(4):40.

    MathSciNet  MATH  Google Scholar 

  35. Lindgren, F., Rue, H., and Lindström, J. (2011). An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach. Journal of the Royal Statistical Society: Series B, 73(4):423–498.

    MathSciNet  MATH  Google Scholar 

  36. Nychka, D. W., Bandyopadhyay, S., Hammerling, D., Lindgren, F., and Sain, S. R. (2015). A multi-resolution Gaussian process model for the analysis of large spatial data sets. Journal of Computational and Graphical Statistics, 24(2):579–599.

    MathSciNet  Google Scholar 

  37. OCO-2 Science Team, Gunson, M., and Eldering, A. (2015). OCO-2 Level 2 bias-corrected solar-induced fluorescence and other select fields from the IMAP-DOAS algorithm aggregated as daily files, retrospective processing V7r. https://disc.gsfc.nasa.gov/datacollection/OCO2_L2_Lite_SIF_7r.html

  38. Quiñonero-Candela, J. and Rasmussen, C. E. (2005). A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6:1939–1959.

    MathSciNet  MATH  Google Scholar 

  39. Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press, Cambridge.

    Google Scholar 

  40. Rue, H. and Held, L. (2005). Gaussian Markov Random Fields: Theory and Applications. CRC Press, Boca Raton.

    Google Scholar 

  41. Sang, H., Jun, M., and Huang, J. Z. (2011). Covariance approximation for large multivariate spatial datasets with an application to multiple climate model errors. Annals of Applied Statistics, 5(4):2519–2548.

    MathSciNet  MATH  Google Scholar 

  42. Schäfer, F., Katzfuss, M., and Owhadi, H. (2020). Sparse Cholesky factorization by Kullback–Leibler minimization. arXiv:2004.14455.

  43. Schäfer, F., Sullivan, T. J., and Owhadi, H. (2017). Compression, inversion, and approximate PCA of dense kernel matrices at near-linear computational complexity. arXiv:1706.02205.

  44. Snelson, E. and Ghahramani, Z. (2007). Local and global sparse Gaussian process approximations. In Artificial Intelligence and Statistics 11 (AISTATS).

  45. Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical Bayesian optimization of machine learning algorithms. In Neural Information Processing Systems.

  46. Stein, M. L., Chi, Z., and Welty, L. (2004). Approximating likelihoods for large spatial data sets. Journal of the Royal Statistical Society: Series B, 66(2):275–296.

    MathSciNet  MATH  Google Scholar 

  47. Stroud, J. R., Stein, M. L., and Lysen, S. (2017). Bayesian and maximum likelihood estimation for Gaussian processes on an incomplete lattice. Journal of Computational and Graphical Statistics, 26(1):108–120.

    MathSciNet  Google Scholar 

  48. Sun, Y., Frankenberg, C., Jung, M., Joiner, J., Guanter, L., Köhler, P., and Magney, T. (2018). Overview of solar-induced chlorophyll fluorescence (sif) from the orbiting carbon observatory-2: Retrieval, cross-mission comparison, and global monitoring for gpp. Remote Sensing of Environment, 209:808–823.

    Google Scholar 

  49. Sun, Y., Frankenberg, C., Wood, J. D., Schimel, D. S., Jung, M., Guanter, L., Drewry, D., Verma, M., Porcar-Castell, A., Griffis, T. J., et al. (2017). Oco-2 advances photosynthesis observation from space via solar-induced chlorophyll fluorescence. Science, 358(6360):eaam5747.

    Google Scholar 

  50. Sun, Y. and Stein, M. L. (2016). Statistically and computationally efficient estimating equations for large spatial datasets. Journal of Computational and Graphical Statistics, 25(1):187–208.

    MathSciNet  Google Scholar 

  51. Tzeng, S. and Huang, H.-C. (2018). Resolution adaptive fixed rank kriging. Technometrics, 60(2):198–208.

    MathSciNet  Google Scholar 

  52. Vecchia, A. (1988). Estimation and model identification for continuous spatial processes. Journal of the Royal Statistical Society, Series B, 50(2):297–312.

    MathSciNet  Google Scholar 

  53. Vecchia, A. (1992). A new method of prediction for spatial regression models with correlated errors. Journal of the Royal Statistical Society, Series B, 54(3):813–830.

    MathSciNet  MATH  Google Scholar 

  54. Wang, Y., Khardon, R., and Protopapas, P. (2012). Nonparametric Bayesian estimation of periodic light curves. Astrophysical Journal, 756(1):67.

    Google Scholar 

  55. Wikle, C. K. and Cressie, N. (1999). A dimension-reduced approach to space-time Kalman filtering. Biometrika, 86(4):815–829.

    MathSciNet  MATH  Google Scholar 

  56. Zhang, B., Sang, H., and Huang, J. Z. (2019). Smoothed full-scale approximation of Gaussian process models for computation of large spatial datasets. Statistica Sinica, 29:1711–1737.

    MathSciNet  MATH  Google Scholar 

  57. Zilber, D. and Katzfuss, M. (2019). Vecchia-Laplace approximations of generalized Gaussian processes for big non-Gaussian spatial data. arXiv:1906.07828.

Download references

Acknowledgements

Katzfuss’ research was partially supported by National Science Foundation (NSF) Grant DMS–1521676 and NSF CAREER Grant DMS–1654083. Guinness’ research was partially supported by NSF Grant DMS–1613219 and NIH Grant No. R01ES027892. The authors would like to thank Anirban Bhattacharya, David Jones, Jennifer Hoeting, and several anonymous reviewers for helpful comments and suggestions. Jingjie Zhang and Marcin Jurek contributed to the R package GPvecchia, and Florian Schäfer provided C code for the exact maxmin ordering.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Matthias Katzfuss.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1544 KB)

Supplementary material 2 (zip 11555 KB)

Appendices

Vector and Indexing Notation

As an example, define \(\mathbf{y} = (\mathbf{y} _1,\mathbf{y} _2,\mathbf{y} _3,\mathbf{y} _4,\mathbf{y} _5)\) as a vector of vectors. Vectors of indices are used for defining subvectors. For example, if \(o = (4,1,2)\) is a vector of indices, then \(\mathbf{y} _o = (\mathbf{y} _4,\mathbf{y} _1,\mathbf{y} _2)\). Unions of vectors are vectors and are defined when the two vectors have the same type and when the ordering is specified. For example, if \(\mathbf{z} = (\mathbf{z} _1,\mathbf{z} _2)\), then \(\mathbf{y} _o \cup \mathbf{z} = (\mathbf{y} _4,\mathbf{y} _1, \mathbf{z} _1, \mathbf{y} _2, \mathbf{z} _2)\) defines the union of \(\mathbf{y} _o\) and \(\mathbf{z} \). Intersection is defined similarly and uses the \(\cap \) notation. The ordering of the elements of the union or intersection can be defined alternatively via an index function \(\#\) taking in an element and a vector and returning the index occupied by the element in the vector. Continuing the example above, \(\#(\mathbf{y} _4, \mathbf{y} ) = 4\), whereas \(\#(\mathbf{y} _4, \mathbf{y} _o \cup \mathbf{z} ) = 1\). A full description of the vector and indexing notation can be found in Katzfuss and Guinness (2019, App. A).

Computing \(\mathbf{U }\)

We recapitulate here the formulas for computing \(\mathbf{U }\) from Katzfuss and Guinness (2019). Let g(i) denote the vector of indices of the elements in \(\mathbf{x} \) on which \(x_i\) conditions. Also define \(C(x_i,x_j)\) as the covariance between \(x_i\) and \(x_j\) implied by the true model in Sect. 2; that is, \(C(y_i,y_j) = C(z_i,y_j) = K(\mathbf{s} _i,\mathbf{s} _j)\) and \(C(z_i,z_j) = K(\mathbf{s} _i,\mathbf{s} _j) + \mathbb {1}_{i=j} \tau ^2_i\). Then, the (ji)th element of \(\mathbf{U }\) can be calculated as

$$\begin{aligned} \mathbf{U }_{ji} = {\left\{ \begin{array}{ll} d_i^{-1/2}, &{} i=j,\\ -b_{i}^{(j)} d_i^{-1/2}, &{} j \in g(i), \\ 0, &{}\text {otherwise}, \end{array}\right. } \end{aligned}$$
(9)

where \(\mathbf{b} _i'= C(x_i,\mathbf{x} _{g(i)}) C(\mathbf{x} _{g(i)},\mathbf{x} _{g(i)})^{-1}\), \(d_i = C(x_i,x_i) - \mathbf{b} _i' C(\mathbf{x} _{g(i)},x_i)\), and \(b_i^{(j)}\) denotes the kth element of \(\mathbf{b} _i\) if j is the kth element in g(i) (i.e., \(b_i^{(j)}\) is the element of \(\mathbf{b} _i\) corresponding to \(x_j\)).

Prediction with Unknown Parameters

In practice, most GP models depend on an unknown parameter vector \(\varvec{\theta }\), which we will make explicit here. Likelihood approximation for parameter inference is discussed in detail in Katzfuss and Guinness (2019), but we will review it briefly here. Integration of \(\widehat{f}(\mathbf{x} |\varvec{\theta })\) in (3) with respect to \(\mathbf{y} \) results in the following Vecchia likelihood (Katzfuss and Guinness 2019, Prop. 2):

$$\begin{aligned} \textstyle -2 \log \widehat{f}(\mathbf{z} _o|\varvec{\theta })= & {} -2 \sum _{i} \log \mathbf{U }_{ii} + 2\sum _{i} \log \mathbf{V} _{ii} + {\tilde{\mathbf{z }}}'{\tilde{\mathbf{z }}} \nonumber \\&- (\mathbf{V} ^{-1}\mathbf{U }_{\ell ,\bullet }{\tilde{\mathbf{z }}})'(\mathbf{V} ^{-1}\mathbf{U }_{\ell ,\bullet }{\tilde{\mathbf{z }}}) + n \log (2\pi ), \end{aligned}$$
(10)

where \(\mathbf{U }\) and \(\mathbf{V} \) implicitly depend on \(\varvec{\theta }\), and \({\tilde{\mathbf{z }}} = \mathbf{U }_{r,\bullet }'\mathbf{z} _o\). The computational cost for evaluating this Vecchia likelihood is often low, and Katzfuss and Guinness (2019) provide conditions on the g(i) under which the cost is guaranteed to be linear in n.

This allows for various forms of likelihood-based parameter inference. In a frequentist setting, we can compute \({\hat{\varvec{\theta }}} = \hbox {arg max}_{\varvec{\theta }} \log \widehat{f}(\mathbf{z} _o|\varvec{\theta })\) and then compute summaries of the posterior predictive distribution \(\widehat{f}(\mathbf{y} |\mathbf{z} _o) = \mathcal {N}_n(\varvec{\mu }({\hat{\varvec{\theta }}}),\varvec{\Sigma }({\hat{\varvec{\theta }}}))\) as described in Sect. 3.3. This often has low computational cost but ignores uncertainty in \({\hat{\varvec{\theta }}}\). An example is given in Sect. 6.

In a Bayesian setting, given a prior distribution \(f(\varvec{\theta })\), a Metropolis–Hastings sampler can be used for parameters whose posterior or full-conditional distribution are not available in closed form. At the \((l+1)\)th step of the algorithm, one would propose a new value \(\varvec{\theta }^{(P)} \sim q(\varvec{\theta }|\varvec{\theta }^{(l)})\) and accept it with probability \(\min (1,h(\varvec{\theta }^{(P)},\varvec{\theta }^{(l)})/h(\varvec{\theta }^{(l)},\varvec{\theta }^{(P)}))\), where \(h(\varvec{\theta },{\tilde{\varvec{\theta }}}) = f(\varvec{\theta })\widehat{f}(\mathbf{z} _o|\varvec{\theta }) q({\tilde{\varvec{\theta }}}|\varvec{\theta })\). After burn-in and thinning, this results in a sample, say, \(\varvec{\theta }^{(1)},\ldots ,\varvec{\theta }^{(L)}\), leading to a Gaussian-mixture prediction: \(\widehat{f}(\mathbf{y} |\mathbf{z} _o) = (1/L) \sum _{l=1}^L \mathcal {N}_n(\varvec{\mu }(\varvec{\theta }^{(l)}),\varvec{\Sigma }(\varvec{\theta }^{(l)}))\). See Katzfuss and Guinness (2019, App. E) for an example for the use of RF-full in this setting. For more complicated Bayesian hierarchical models, inference can be carried out using a Gibbs sampler in which \(\mathbf{y} \) is sampled from its full-conditional distribution as described in item 4. in Sect. 3.3.

Numerical Nonzeros in \(\mathbf{V} \)

For simplicity, we focus here on RF-full, although numerical nonzeros can similarly occur for RF-stand, LF-full, and LF-ind (see Figure S1). For RF-full, the upper triangle of \(\mathbf{W} \) is at least as dense as the upper triangle of \(\mathbf{V} \). Specifically, for \(j<i\), \(\mathbf{V} _{ji}=\mathbf{U }_{\ell _j\ell _i}=0\) unless \(j \in q_y(i)\). From Katzfuss and Guinness (Katzfuss and Guinness (2019), Prop. 3.2), we have that \(\mathbf{W} _{ji} =0\) unless \(j \in q_y(i)\) or \(\exists k > i\) such that \(i,j \in q_y(k)\). Thus, for any pair \(j<i\) such that \(j \notin q_y(i)\) but \(i,j \in q_y(k)\) for some \(k>i\), we generally have \(\mathbf{V} _{ji} = 0\) and \(\mathbf{W} _{ji} \ne 0\).

From the standard Cholesky algorithm, we can derive that the algorithm for \(\mathbf{V} = {{\,\mathrm{rchol}\,}}(\mathbf{W} )\) computes \(\mathbf{V} _{ji}\) as

$$\begin{aligned} \textstyle \mathbf{V} _{ji} = (\mathbf{W} _{ji} - \sum _{k=i+1}^n\mathbf{V} _{ik}\mathbf{V} _{jk})/\mathbf{V} _{ii}. \end{aligned}$$

Thus, for any pair \(j<i\) such that \(\mathbf{V} _{ji}=0\) but \(\mathbf{W} _{ji} \ne 0\), we know that \(\mathbf{W} _{ji} = \sum _{k=i+1}^n\mathbf{V} _{ik}\mathbf{V} _{jk}\) theoretically, but due to potential numerical error it is not guaranteed that this equation holds exactly. A numerical nonzero is introduced in \(\mathbf{V} \) whenever a rounding error occurs in \(\sum _{k=i+1}^n\mathbf{V} _{ik}\mathbf{V} _{jk}\), which relies on (potentially many) previous calculations in the Cholesky algorithm. Such numerical nonzeros are avoided by extracting \(\mathbf{V} =\mathbf{U }_{\ell \ell }\) as a submatrix of \(\mathbf{U }\) (as proposed in Sect. 4.1), instead of explicitly carrying out the Cholesky factorization \(\mathbf{V} = {{\,\mathrm{rchol}\,}}(\mathbf{W} )\).

Implications for Selected Inverse

When \(\mathbf{V} \) is computed by copying a submatrix from \(\mathbf{U }\) to avoid numerical nonzeros, the selected inverse of this \(\mathbf{V} \) is not guaranteed to return the exact posterior variances of \(\mathbf{y} \), unless \(\mathbf{V} \) is “padded” with zeros, which results in additional costs. This is because the selected inverse operates on the symbolic nonzero elements; that is, it operates on all elements that have to be computed in the Cholesky, even if they cancel to zero numerically (which is the case for many entries in our case). Denoting by \(\mathbf{S} \) the selected inverse of \(\mathbf{W} \) based on \(\mathbf{V} \), a close look at the Takahashi recursions reveals that for all jk with \(j,k \in q_y(i)\), we need \(\mathbf{S} _{ji}\) and \(\mathbf{S} _{kj}\). The latter element is only calculated if \(j \in q_y(k)\). However, if \(j \notin q_y(k)\), \(\mathbf{S} _{kj} ={{\,\mathrm{cov}\,}}(y_j,y_k|\mathbf{z} )\) will typically be very small (if m is reasonably large), because their corresponding locations will likely be far away from each other and data can be observed in between. In our experiments, the additional approximation error introduced by the selected inverse was negligible relative to the error introduced by the Vecchia approximation itself. When m becomes large enough for the Vecchia approximation to be accurate, the additional approximation error introduced by SelInv goes to zero as well.

If the exact variances implied by the Vecchia approximation \(\widehat{f}(\mathbf{y} |\mathbf{z} _o)\) are desired, they can be computed as \({{\,\mathrm{diag}\,}}({{\,\mathrm{var}\,}}(\mathbf{y} _p|\mathbf{z} _o)) = ((\mathbf{V} ^{-1}\mathbf{I} _{\bullet p})\circ (\mathbf{V} ^{-1}\mathbf{I} _{\bullet p}))'\mathbf 1 _{n}\). For RF-full and RF-stand, this requires \(\mathcal {O}(nmn_P)\) time, as described in Sect. 4.3.3, and so the overall computational complexity would not be increased if \(n_P = \mathcal {O}(m^2)\).

Proofs

In this section, we provide proofs for the propositions stated throughout the article.

Proof of Proposition 1

Part 1 of this proposition is equivalent to Thm. 1 in Guinness (2018). We prove the statement here again in a different way, which can be easily extended to prove Part 2.

First, consider a generic Vecchia approximation of a vector \(\mathbf{x} \) of length n: \(\widehat{f}(\mathbf{x} ) = \prod _{i=1}^n f(x_i|\mathbf{x} _{g(i)}) = \prod _{i=1}^n \mathcal {N}(x_i|\mu _{i|g(i)},\sigma ^2_{i|g(i)})\). Then,

$$\begin{aligned} \textstyle (-2) {{\,\mathrm{\mathbb {E}}\,}}^\mathbf{x} \log \widehat{f}(\mathbf{x} ) = \sum _{i=1}^n \log {{\,\mathrm{var}\,}}(x_i|\mathbf{x} _{g(i)}) + \sum _{i=1}^n {{\,\mathrm{\mathbb {E}}\,}}^\mathbf{x} w_i^2 + n \log (2\pi ), \end{aligned}$$

where \(w_i = (x_i - \mu _{i|g(i)})/\sigma _{i|g(i)}\). We have \({{\,\mathrm{\mathbb {E}}\,}}(w_i) = 0\), because \({{\,\mathrm{\mathbb {E}}\,}}(\mu _{i|g(i)}) = {{\,\mathrm{\mathbb {E}}\,}}{{\,\mathrm{\mathbb {E}}\,}}(x_i|\mathbf{x} _{g(i)}) = {{\,\mathrm{\mathbb {E}}\,}}(x_i)\). We also have \({{\,\mathrm{var}\,}}(w_i)=1\), because

$$\begin{aligned} {{\,\mathrm{var}\,}}(x_i-\mu _{i|g(i)})= & {} {{\,\mathrm{var}\,}}{{\,\mathrm{\mathbb {E}}\,}}\big (x_i-{{\,\mathrm{\mathbb {E}}\,}}(x_i|\mathbf{x} _{g(i)})|\mathbf{x} _{g(i)}\big ) + {{\,\mathrm{var}\,}}\big (x_i-{{\,\mathrm{\mathbb {E}}\,}}(x_i|\mathbf{x} _{g(i)})|\mathbf{x} _{g(i)}\big ) \\= & {} 0 + \sigma ^2_{i|g(i)}. \end{aligned}$$

Hence, \(w_i \sim \mathcal {N}(0,1)\), and so \({{\,\mathrm{\mathbb {E}}\,}}^\mathbf{x} w_i^2 = 1\) and

$$\begin{aligned} \textstyle (-2) {{\,\mathrm{\mathbb {E}}\,}}^\mathbf{x} \log \widehat{f}(\mathbf{x} ) = \sum _{i=1}^n \log {{\,\mathrm{var}\,}}(x_i|\mathbf{x} _{g(i)}) + n + n \log (2\pi ). \end{aligned}$$

Because the exact density \(f(\mathbf{x} )\) is a special case of Vecchia with \(g(i) = (1,\ldots ,i-1)\), we have

$$\begin{aligned} \textstyle \text {KL}\big (f(\mathbf{x} )\Vert \widehat{f}(\mathbf{x} )\big ) = {{\,\mathrm{\mathbb {E}}\,}}^\mathbf{x} \log f(\mathbf{x} ) - {{\,\mathrm{\mathbb {E}}\,}}^\mathbf{x} \log \widehat{f}(\mathbf{x} ) = \frac{1}{2} \sum _{i=1}^n \log \frac{{{\,\mathrm{var}\,}}(x_i|\mathbf{x} _{g(i)})}{{{\,\mathrm{var}\,}}(x_i|\mathbf{x} _{1:i-1})}. \end{aligned}$$
(11)

For \(g_1(i) \subset g_2(i)\), we can write, say, \(g_2(i) = g_1(i) \cup c(i)\). Using the law of total variance,

$$\begin{aligned} {{\,\mathrm{var}\,}}(x_i|\mathbf{x} _{g_1(i)}) = {{\,\mathrm{var}\,}}(x_i|\mathbf{x} _{g_1(i)}, \mathbf{x} _{c(i)}) + {{\,\mathrm{var}\,}}({{\,\mathrm{\mathbb {E}}\,}}(x_i|\mathbf{x} _{g_1(i)}, \mathbf{x} _{c(i)})|\mathbf{x} _{c(i)}) \ge {{\,\mathrm{var}\,}}(x_i|\mathbf{x} _{g_2(i)}).\nonumber \\ \end{aligned}$$
(12)

Now, Part 1 follows by combining (11) and (12) with the assumption that \(g_1(i) \subset g_2(i)\) for all i.

For Part 2, we consider \(\mathbf{x} =(\mathbf{x} ^{(1)},\mathbf{x} ^{(2)})\) with \(\mathbf{x} ^{(1)} = (x_1,\ldots ,x_{n_1})\) and \(\mathbf{x} ^{(2)} = (x_{n_1+1},\ldots ,x_{n_1+n_2})\). Then,

$$\begin{aligned} \textstyle&\text {CKL}\big (f(\mathbf{x} ^{(2)}|\mathbf{x} ^{(1)})\Vert \widehat{f}(\mathbf{x} ^{(2)}|\mathbf{x} ^{(1)})\big )\\&\quad = \int f(\mathbf{x} ^{(1)}) \int f(\mathbf{x} ^{(2)}|\mathbf{x} ^{(1)}) \log \big (f(\mathbf{x} ^{(2)}|\mathbf{x} ^{(1)})/\widehat{f}(\mathbf{x} ^{(2)}|\mathbf{x} ^{(1)})\big ) d\mathbf{x} ^{(2)} d\mathbf{x} ^{(1)} \\&\quad = {{\,\mathrm{\mathbb {E}}\,}}^\mathbf{x} \log f(\mathbf{x} ^{(2)}|\mathbf{x} ^{(1)}) - {{\,\mathrm{\mathbb {E}}\,}}^\mathbf{x} \log \widehat{f}(\mathbf{x} ^{(2)}|\mathbf{x} ^{(1)}) \\&\quad = \frac{1}{2} \sum _{i=n_1+1}^{n_1+n_2} \log \frac{{{\,\mathrm{var}\,}}(x_i|\mathbf{x} _{g(i)})}{{{\,\mathrm{var}\,}}(x_i|\mathbf{x} _{1:i-1})}, \end{aligned}$$

where the last equality can be shown almost identically to (11) above, noting that \(\widehat{f}(\mathbf{x} ^{(2)}|\mathbf{x} ^{(1)}) = \widehat{f}(\mathbf{x} )/\widehat{f}(\mathbf{x} ^{(1)}) = \prod _{i=n_1+1}^{n_1+n_2} f(x_i|\mathbf{x} _{g(i)})\). Part 2 follows by combining this result and (12) with the assumption that \(g_1(i) \subset g_2(i)\) for all \(i=n_1+1,\ldots ,n_1+n_2\). \(\square \)

Proof of Proposition 2

For all parts of this proposition, note that all variables in the model are conditionally independent of \(z_j\) given \(y_j\), and so conditioning on \(y_j\) is equivalent to conditioning on \(y_j\) and \(z_j\). For Part 1, we can thus verify easily that g(i) is equivalent to \((1,\ldots ,i-1)\), for all i and all methods under consideration, and so Part 1 follows from (11). Using Proposition 1, the proof for all other parts simply consists of showing that certain conditioning vectors contain certain other conditioning vectors (similar to Katzfuss and Guinness 2019, Prop. 4). For example, for Part 3, all response-first methods are based on the ordering \(\mathbf{x} = (\mathbf{z} _o,\mathbf{y} _o,\mathbf{y} _p)\) with nearest neighbor conditioning (under some restrictions for RF-stand and RF-ind), and so it is easy to see that \(g_m(i) \subset g_{m+1}(i)\) for all \(i \in \ell _p=(n+1,\ldots ,n+n_P)\). LF-full and LF-ind can equivalently be defined based on the ordering \((\mathbf{y} _o,\mathbf{z} _o,\mathbf{y} _p)\), in which case we also have \(g_m(i) \subset g_{m+1}(i)\) for all \(i \in \ell _p=(n+1,\ldots ,n+n_P)\). For Part 5, note that in RF-stand and RF-ind, \(\mathbf{y} _p\) does not condition on \(\mathbf{y} _o\). Hence, the distribution \(\widehat{f}(\mathbf{y} _p|\mathbf{z} _o)\) is equivalent to the distribution obtained under a simplified Vecchia approximation based on \(\mathbf{x} =(\mathbf{z} _o,\mathbf{y} _p)\) (i.e., with \(\mathbf{y} _o\) removed completely). It is straightforward to show that Part 5 holds for this simplified approximation. For Part 6, note that g(i) is the same for RF-full and RF-stand for all \(i=1,\ldots ,n\). For \(i \in p\), letting \(a(i) = q_y^{\text {RF-full}}(i) \cap o\), we have \(\mathbf{x} _{g^{\text {RF-full}}(i)}=(\mathbf{y} _{q_y^{\text {RF-stand}}},\mathbf{y} _{a(i)})=(\mathbf{y} _{q_y^{\text {RF-stand}}},\mathbf{y} _{a(i)},\mathbf{z} _{a(i)})\) and \(\mathbf{x} _{g^{\text {RF-stand}}(i)}=(\mathbf{y} _{q_y^{\text {RF-stand}}},\mathbf{z} _{a(i)})\), and so \(g^{\text {RF-stand}}(i) \subset g^{\text {RF-full}}(i)\) for all \(i \in p\). For Part 7, note that a GP with exponential covariance in 1-D is a Markov process, and so in (6) with \(q_y(i)=i-1\) and left-to-right ordering, we have \(f(y_i|\mathbf{y} _{q_y(i)})=f(y_i|y_{1},\ldots ,y_{i-1})\) and hence \(\widehat{f}(\mathbf{x} ) = f(\mathbf{x} )\). \(\square \)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Katzfuss, M., Guinness, J., Gong, W. et al. Vecchia Approximations of Gaussian-Process Predictions. JABES (2020). https://doi.org/10.1007/s13253-020-00401-7

Download citation

Keywords

  • Computational complexity
  • Kriging
  • Large datasets
  • Sparsity
  • Spatial statistics