Abstract
Spatial prediction and variable selection for the study area are both important issues in geostatistics. If spatially varying means exist among different subareas, globally fitting a spatial regression model for observations over the study area may be not suitable. To alleviate deviations from spatial model assumptions, this paper proposes a methodology to locally select variables for each subarea based on a locally empirical conditional Akaike information criterion. In this situation, the global spatial dependence of observations is considered and the local characteristics of each subarea are also identified. It results in a composite spatial predictor which provides a more accurate spatial prediction for the response variables of interest in terms of the mean squared prediction errors. Further, the corresponding prediction variance is also evaluated based on a resampling method. Statistical inferences of the proposed methodology are justified both theoretically and numerically. Finally, an application of a mercury data set for lakes in Maine, USA is analyzed for illustration.
Similar content being viewed by others
References
Akaike H (1973) Information theory and an extension of the maximum likelihood principle. In: Petrov V, Csáki F (eds) International symposium on information theory. Akademiai Kiádo, Budapest, pp 267–281
Assuncão R (2003) Space varying coefficient models for small area data. Environmetrics 14:453–473
Borra S, Di Ciaccio A (2010) Measuring the prediction error: a comparison of cross-validation, bootstrap and covariance penalty methods. Comput Stat Data Anal 54:2976–2989
Bradley JR, Cressie N, Shi T (2015) Comparing and selecting spatial predictors using local criteria. Test 24:1–28
Chen CS, Huang HC (2012) Geostatistical model averaging based on conditional information criteria. Environ Ecol Stat 19:23–35
Chilés JP, Delfinder JP (1999) Geostatistics: modeling spatial uncertainty. Wiley, New York
Cressie N, Johannesson G (2008) Fixed rank kriging for very large data sets. J R Stat Soc Ser B 70:209–226
Cressie N, Lahiri SN (1996) Asymptotics for REML estimation of spatial covariance parameters. J Stat Plan Inference 50:327–341
Davison A, Hinkley D (1997) Bootstrap methods and their application. Cambridge series in statistical and probabilistic mathematics. Cambridge University Press, Cambridge
Efron B (2004) The estimation of prediction error: covariance penalties and cross-validation. J Am Stat Assoc 99:619–632
Efron B (2014) Estimation and accuracy after model selection. J Am Stat Assoc 109:991–1007
Fahrmeir L, Kneib T, Lang S (2004) Penalized structured additive regression for space-time data: a Bayesian perspective. Stat Sin 14:731–761
Fouedjio F (2016) Second-order non-stationary modeling approaches for univariate geostatistical data. Stoch Environ Res Risk Assess. doi:10.1007/s00477-016-1274-y
Furrer R, Genton MG, Nychka D (2006) Covariance tapering for interpolation of large spatial datasets. J Comput Graph Stat 15:502–523
García-Soidán P, Menezes R, Rubiños Ó (2014) Bootstrap approaches for spatial data. Stoch Environ Res Risk Assess 28:1207–1219
Ghosh D, Yuan Z (2009) An improved model averaging scheme for logistic regression. J Multivar Anal 100:1670–1681
Hoeting JA, Davis RA, Merton AA, Thompson SE (2006) Model selection for geostatistical models. Ecol Appl 16:87–98
Jiang W, Simon R (2007) A comparison of bootstrap methods and an adjusted bootstrap approach for estimating the prediction error in microarray classification. Stat Med 26:5320–5334
Kaufman CG, Schervish MJ, Nychka DW (2008) Covariance tapering for likelihood-based estimation in large spatial data sets. J Am Stat Assoc 103:1545–1555
Lloyd CD (2011) Local models for spatial analysis, 2nd edn. CRC Press, Boca Raton
Matérn B (2013) Spatial variation. Springer, Berlin
McGilchrist CA (1989) Bias of ML and REML estimators in regression models with ARMA errors. J Stat Comput Simul 32:127–136
Paciorek C, Schervish M (2006) Spatial modelling using a new class of nonstationary covariance functions. Environmetrics 17:483–506
Patterson HD, Thompson R (1971) Recovery of inter-block information when block sizes are unequal. Biometrika 58:545–554
Peck R, Haugh LD, Goodman A, (eds) (1998) Statistical case studies: a collaboration between academe and industry. In: ASA-SIAM series on statistics and applied probability 3 and 4
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464
Shen X, Huang HC, Ye J (2004) Adaptive model selection and assessment for exponential family models. Technometrics 46:306–317
Tutmez B, Kaymak U, Tercan AE (2012) Local spatial regression models: a comparative analysis on soil contamination. Stoch Environ Res Risk Assess 26:1013–1023
Vaida F, Blanchard S (2005) Conditional Akaike information for mixed-effects models. Biometrika 92:351–370
Yang HD, Chen CS (2017) On estimation and prediction of geostatistical regression models via a corrected Stein’s unbiased risk estimator. Environmetrics 28:e2424. doi:10.1002/env.2424
Acknowledgements
We thank the Editor, an associate editor, and two anonymous referees for their helpful comments and suggestions. This work was supported by the Ministry of Science and Technology of Taiwan under Grant MOST 104-2118-M-018-002-MY2.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
Proof of Theorem 1 Taking the expectation on both sides of (14), we have the following results for any \(M\in \mathcal {M}\) and \(a=1,\dots ,A\).
where \([\varvec{H}_M(\varvec{\theta })]_{ii}\) denotes the ith diagonal element of matrix \(\varvec{H}_M(\varvec{\theta })\), \(k_a\equiv \sum\nolimits _{\{i:\varvec{s}_i\in D_a\}}\left[ \varvec{H}_M(\varvec{\theta })\right] _{ii}\) is associated with subarea \(D_a\) and is a constant when model parameters are known, the fifth equality is based on \(Z(\varvec{s}_i)=S(\varvec{s}_i)+\varepsilon (\varvec{s}_i)\), and the sixth equality follows from \(\varepsilon (\varvec{s}_i)\sim N(0,\sigma ^2_{\varepsilon })\) and \(S(\varvec{s}_i)\) are independent and \(n_a>0\) represents the number of observations in the subarea \(D_a\). Therefore, the proof of (15) remains to show
Because \(Z(\varvec{s}_i)=S(\varvec{s}_i)+\varepsilon (\varvec{s}_i)\), we have
where the last equality follows from \(Cov\left( S(\varvec{s}_i),\hat{S}_M(\varvec{s}_i;\varvec{\theta })\big |\varvec{S}\right) =0\). In addition, because \(\varvec{Z}|\varvec{S}\sim N(\varvec{S},\sigma ^2_{\varepsilon }\varvec{I})\), we have the following result for all \(i=1,\dots ,n\).
It follows from (11), (12), (35), and (36) that
where \(\left[ \varvec{H}_M(\varvec{\theta })\right] _i\) denotes the ith row of matrix \(\varvec{H}_M(\varvec{\theta })\), the third equality is based on \(\varvec{Z}=\varvec{S}+\varvec{\varepsilon }\), and the fourth equality follows from \(E\left( \left[ \varvec{H}_M(\varvec{\theta })\right] _i \varvec{\varepsilon }\big |\varvec{S}\right) =0\). Thus, we obtain the desired result based on (34) and (37). This completes the proof.
Proof of Corollary 1 From the definitions of \(CAIC(M;\varvec{\theta })\) and \(LCAIC(M;D_a;\varvec{\theta })\) in (13) and (14), we know that
Taking the expectation on both sides of the above equality, we have
where the second equality is based on the result of (15) in Theorem 1, \(n_a>0\) is the number of observations in the subarea \(D_a\) for \(a=1,\dots ,A\), and \( \sum\nolimits _{a=1}^A n_a=n\) is the number of all observations in the study area D. Thus, we obtain the desired result, which completes the proof.
Rights and permissions
About this article
Cite this article
Chen, CS., Chen, CS. A composite spatial predictor via local criteria under a misspecified model. Stoch Environ Res Risk Assess 32, 341–355 (2018). https://doi.org/10.1007/s00477-017-1438-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00477-017-1438-4