Abstract
Many imputation methods are based on a statistical model that assumes the variable of interest is a noisy observation of a function of the auxiliary variables or covariates. Misspecification of this function may lead to severe errors in estimation and to misleading conclusions. Imputation techniques can therefore benefit from flexible formulations that can capture a wide range of patterns. We consider the use of smoothing splines within an additive model framework to estimate the functional dependence between the variable of interest and the auxiliary variables. The estimator obtained allows us to build an imputation model in the case of multiple auxiliary variables. The performance of our method is assessed via numerical experiments involving simulated and real data.
Similar content being viewed by others
References
Andreis F, Conti PL, Mecatti F (2018) On the role of weights rounding in applications of resampling based on pseudopopulations. Stat Neerl
Andridge RR, Little RJA (2010) A review of dot deck imputation for survey non-response. Int Stat Rev 78:40–64
Bellman R (1961) Adaptive control processes: a guided tour. Princeton University Press, Princeton
Berg E, Kim J-K, Skinner C (2016) Imputation under informative sampling. J Surv Stat Methodol 4(4):436–462
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Central Statistical Office (1993) Family expenditure survey, 1992 [computer file]. Technical report, Colchester, Essex: UK Data Archive [distributor]. SN: 3064. https://doi.org/10.5255/UKDA-SN-3064-1
Chauvet G, Deville J-C, Haziza D (2011) On balanced random imputation in surveys. Biometrika 98:459–471
Da Silva DN, Opsomer JD (2006) A kernel smoothing method of adjusting for unit non-response in sample surveys. Can J Stat 34(4):563–579
Da Silva DN, Opsomer JD (2009) Nonparametric propensity weighting for survey nonresponse through local polynomial regression. Surv Methodol 35(2):165–176
Eubank RL (1999) Nonparametric regression and spline smoothing, 2nd edn. Marcel Dekker, New York
Giommi A (1987) Nonparametric methods for estimating individual response probabilities. Surv Methodol 13(2):127–134
Green PJ, Silverman BW (1994) Nonparametric regression and generalized linear models. Chapman & Hall, Boca Raton
Gross ST (1980) Mean estimation in sample surveys. In: Proceedings of the survey research methods section. American Statistical Association, pp 181–184
Hastie TJ, Tibshirani RJ (1986) Generalized additive models. Stat Sci 1(3):297–318
Hastie TJ, Tibshirani RJ (1990) Generalized additive models. Chapman & Hall, Boca Raton
Haziza D (2009) Imputation and inference in the presence of missing data. In: Rao C (ed) Handbook of statistics, volume 29 of handbook of statistics. Elsevier, Amsterdam, pp 215–246
Haziza D, Rao JNK (2005) Inference for domain means and totals under imputation for missing data. Can J Stat 33:149–161
Lee TCM (2003) Smoothing parameter selection for smoothing splines: a simulation study. Comput Stat Data Anal 42(1):139–148
Mashreghi Z, Léger C, Haziza D (2014) Bootstrap methods for imputed data from regression, ratio and hot-deck imputation. Can J Stat 42(1):142–167
Ning J, Cheng P (2012) A comparison study of nonparametric imputation methods. Stat Comput 22:273–285
Niyonsenga T (1994) Nonparametric estimation of response probabilities in sampling theory. Surv Methodol 20(2):177–184
Niyonsenga T (1997) Response probability estimation. J Stat Plan Inference 59:111–126
Qin J, Leung D, Shao J (2002) Estimation with survey data under nonignorable nonresponse or informative sampling. J Am Stat Assoc 97(457):193–200
Rubin DB (1976) Inference and missing data. Biometrika 63:581–592
Särndal C-E (1992) Methods for estimating the precision of survey estimates when imputation has been used. Surv Methodol 18(2):241–252
Shao J, Sitter RR (1996) Bootstrap for imputed survey data. J Am Stat Assoc 91:1278–1288
Sitter RR (1992a) Comparing three bootstrap methods for survey data. Can J Stat 20:135–154
Sitter RR (1992b) A resampling procedure for complex survey data. J Am Stat Assoc 87(416):755–765
Stekhoven DJ (2013) missForest: nonparametric missing value imputation using random forest. R package version 1:4
Stekhoven D, Buehlmann P (2012) Missforest—nonparametric missing value imputation for mixed-type data. Bioinformatics 28(1):112–118
Stones CJ (1985) Additive regression and other nonparametric models. Ann Stat 13(2):689–705
Wang Y (2011) Smoothing splines: methods and applications. Chapman & Hall, Boca Raton
Wood S (2003) Thin plate regression splines. J R Stat Soc Ser B (Stat Methodol) 65(1):95–114
Wood S (2008) Fast stable direct fitting and smoothness selection for generalized additive models. J R Stat Soc Ser B (Stat Methodol) 70(3):495–518
Wood S (2014) mgcv: mixed GAM computation vehicle with GCV/AIC/REML smoothness estimation. R package version 1.7-28. http://CRAN.R-project.org/package=mgcv
Zhang G, Christensen F, Zheng W (2013) Nonparametric regression estimators in complex surveys. J Stat Comput Simul 85(5):1026–1034
Acknowledgements
The authors thank Yves Tillé for his constructive suggestions. This research was supported by the Swiss National Science Foundation and the Natural Science and Engineering Research Council of Canada.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Caren Hasler’s address when the research was conducted is “Institute of Statistics, University of Neuchâtel, Av. de Bellevaux 51, 2000 Neuchâtel, Switzerland”.
Appendix: Bootstrap variance when a randomization is applied
Appendix: Bootstrap variance when a randomization is applied
We repeated the simulations for the bootstrap variance of Sect. 6.1 with sampling fraction \(f = 0.3\) in order to study the impact of randomization on the quality of variance estimates. For the bootstrap variance under SRSWOR, Procedure 1 (MMB) was applied where, in step 1, a sample of size 900 was selected, that is \(n_h' = f \cdot n_h = 900\), \(h=1\) and a randomization was applied in step 2, and Procedure 3 (extended BWO) was applied (k was non-integer) where a randomization was applied in step 1. For the bootstrap variance under stratified sampling, Procedure 1 (MMB) was applied where, in step 1, a sample of size 187 was selected in each stratum, that is \(n_h' = \lfloor f \cdot n_h \rfloor = 187\), where \(\lfloor \cdot \rfloor \) is the floor function, for each stratum h and a randomization was applied in step 2, and Procedure 3 (extended BWO) was applied where a randomization was applied in step 1. Note that randomization was applied in all four cases.
Table 5 shows the result. Under SRS, whether the functional dependence between the variable of interest and the auxiliary variables is additive (populations 1 and 2) or not (populations 3, 4, 5), the bootstrap variance is close to the variance obtained by simulation and it leads to very good coverage rates (between 92% and 94%) across all five populations considered. Under stratified sampling, the bootstrap variance is greater than the variance obtained by simulations in four out of the five populations considered. This difference is greater when the functional dependence between the variable of interest and the auxiliary variables is additive and strong (populations 1 and 2). We explain this phenomenon in what follows.
When a randomization is applied to round the non-integer \(k_h\) and/or \(n_h'\) as it is the case here, the bootstrap variance contains two parts: the variance due to the randomization and the variance of the total estimator. When there is a strong additive functional dependence between the variable of interest and the auxiliary variables, the variance of the total estimator is small. An important portion of the bootstrap variance is due to randomization and the bootstrap variance overestimates the variance of the total. As the additive functional dependence between the variable of interest and the auxiliary variables weakens, the variance of the total estimator increases and the portion of the bootstrap variance due to randomization decreases. The bootstrap variance gets closer to the variance of the total. When stratified sampling is applied, the portion of the variance due to randomization may be particularly important because randomization is applied within each stratum. This explains the difference between the bootstrap variance and the variance obtained by simulations under stratified sampling in Table 5. The simulations run on the real data of Sect. 6.2 confirm this explanation. In this setting, there is a moderate additive functional dependence between the variable of interest and the auxiliary variables. Stratified sampling was used and the randomization procedure was applied to round the non-integer quantities. The obtained bootstrap variance is close to the variance obtained by simulations and yields a coverage rate of 94%.
As shown by these results, randomization affects the quality of the variance estimates. We refer the reader to Andreis et al. (2018) about weights rounding problems in resampling. We repeated the simulation in this section and rounded the non-integer \(k_h\) and \(n_h'\) to the nearest integer instead of applying randomization. This yields very similar results.
Rights and permissions
About this article
Cite this article
Hasler, C., Craiu, R.V. Nonparametric imputation method for nonresponse in surveys. Stat Methods Appl 29, 25–48 (2020). https://doi.org/10.1007/s10260-019-00458-w
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10260-019-00458-w