Nonparametric imputation method for nonresponse in surveys

Hasler, Caren; Craiu, Radu V.

doi:10.1007/s10260-019-00458-w

Nonparametric imputation method for nonresponse in surveys

Original Paper
Published: 04 April 2019

Volume 29, pages 25–48, (2020)
Cite this article

Statistical Methods & Applications Aims and scope Submit manuscript

267 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Many imputation methods are based on a statistical model that assumes the variable of interest is a noisy observation of a function of the auxiliary variables or covariates. Misspecification of this function may lead to severe errors in estimation and to misleading conclusions. Imputation techniques can therefore benefit from flexible formulations that can capture a wide range of patterns. We consider the use of smoothing splines within an additive model framework to estimate the functional dependence between the variable of interest and the auxiliary variables. The estimator obtained allows us to build an imputation model in the case of multiple auxiliary variables. The performance of our method is assessed via numerical experiments involving simulated and real data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semiparametric estimation in generalized additive partial linear models with nonignorable nonresponse data

Article 30 December 2023

Using Auxiliary Information and Nonparametric Methods in Weighting Adjustments

B-Spline Estimation in a Survey Sampling Framework

References

Andreis F, Conti PL, Mecatti F (2018) On the role of weights rounding in applications of resampling based on pseudopopulations. Stat Neerl
Andridge RR, Little RJA (2010) A review of dot deck imputation for survey non-response. Int Stat Rev 78:40–64
Article Google Scholar
Bellman R (1961) Adaptive control processes: a guided tour. Princeton University Press, Princeton
Book Google Scholar
Berg E, Kim J-K, Skinner C (2016) Imputation under informative sampling. J Surv Stat Methodol 4(4):436–462
Article Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article Google Scholar
Central Statistical Office (1993) Family expenditure survey, 1992 [computer file]. Technical report, Colchester, Essex: UK Data Archive [distributor]. SN: 3064. https://doi.org/10.5255/UKDA-SN-3064-1
Chauvet G, Deville J-C, Haziza D (2011) On balanced random imputation in surveys. Biometrika 98:459–471
Article MathSciNet Google Scholar
Da Silva DN, Opsomer JD (2006) A kernel smoothing method of adjusting for unit non-response in sample surveys. Can J Stat 34(4):563–579
Article MathSciNet Google Scholar
Da Silva DN, Opsomer JD (2009) Nonparametric propensity weighting for survey nonresponse through local polynomial regression. Surv Methodol 35(2):165–176
Google Scholar
Eubank RL (1999) Nonparametric regression and spline smoothing, 2nd edn. Marcel Dekker, New York
MATH Google Scholar
Giommi A (1987) Nonparametric methods for estimating individual response probabilities. Surv Methodol 13(2):127–134
Google Scholar
Green PJ, Silverman BW (1994) Nonparametric regression and generalized linear models. Chapman & Hall, Boca Raton
Book Google Scholar
Gross ST (1980) Mean estimation in sample surveys. In: Proceedings of the survey research methods section. American Statistical Association, pp 181–184
Hastie TJ, Tibshirani RJ (1986) Generalized additive models. Stat Sci 1(3):297–318
Article MathSciNet Google Scholar
Hastie TJ, Tibshirani RJ (1990) Generalized additive models. Chapman & Hall, Boca Raton
MATH Google Scholar
Haziza D (2009) Imputation and inference in the presence of missing data. In: Rao C (ed) Handbook of statistics, volume 29 of handbook of statistics. Elsevier, Amsterdam, pp 215–246
Google Scholar
Haziza D, Rao JNK (2005) Inference for domain means and totals under imputation for missing data. Can J Stat 33:149–161
Article Google Scholar
Lee TCM (2003) Smoothing parameter selection for smoothing splines: a simulation study. Comput Stat Data Anal 42(1):139–148
Article MathSciNet Google Scholar
Mashreghi Z, Léger C, Haziza D (2014) Bootstrap methods for imputed data from regression, ratio and hot-deck imputation. Can J Stat 42(1):142–167
Article MathSciNet Google Scholar
Ning J, Cheng P (2012) A comparison study of nonparametric imputation methods. Stat Comput 22:273–285
Article MathSciNet Google Scholar
Niyonsenga T (1994) Nonparametric estimation of response probabilities in sampling theory. Surv Methodol 20(2):177–184
Google Scholar
Niyonsenga T (1997) Response probability estimation. J Stat Plan Inference 59:111–126
Article MathSciNet Google Scholar
Qin J, Leung D, Shao J (2002) Estimation with survey data under nonignorable nonresponse or informative sampling. J Am Stat Assoc 97(457):193–200
Article MathSciNet Google Scholar
Rubin DB (1976) Inference and missing data. Biometrika 63:581–592
Article MathSciNet Google Scholar
Särndal C-E (1992) Methods for estimating the precision of survey estimates when imputation has been used. Surv Methodol 18(2):241–252
Google Scholar
Shao J, Sitter RR (1996) Bootstrap for imputed survey data. J Am Stat Assoc 91:1278–1288
Article MathSciNet Google Scholar
Sitter RR (1992a) Comparing three bootstrap methods for survey data. Can J Stat 20:135–154
Article MathSciNet Google Scholar
Sitter RR (1992b) A resampling procedure for complex survey data. J Am Stat Assoc 87(416):755–765
Article MathSciNet Google Scholar
Stekhoven DJ (2013) missForest: nonparametric missing value imputation using random forest. R package version 1:4
Stekhoven D, Buehlmann P (2012) Missforest—nonparametric missing value imputation for mixed-type data. Bioinformatics 28(1):112–118
Article Google Scholar
Stones CJ (1985) Additive regression and other nonparametric models. Ann Stat 13(2):689–705
Article MathSciNet Google Scholar
Wang Y (2011) Smoothing splines: methods and applications. Chapman & Hall, Boca Raton
Book Google Scholar
Wood S (2003) Thin plate regression splines. J R Stat Soc Ser B (Stat Methodol) 65(1):95–114
Article MathSciNet Google Scholar
Wood S (2008) Fast stable direct fitting and smoothness selection for generalized additive models. J R Stat Soc Ser B (Stat Methodol) 70(3):495–518
Article MathSciNet Google Scholar
Wood S (2014) mgcv: mixed GAM computation vehicle with GCV/AIC/REML smoothness estimation. R package version 1.7-28. http://CRAN.R-project.org/package=mgcv
Zhang G, Christensen F, Zheng W (2013) Nonparametric regression estimators in complex surveys. J Stat Comput Simul 85(5):1026–1034
Article MathSciNet Google Scholar

Download references

Acknowledgements

The authors thank Yves Tillé for his constructive suggestions. This research was supported by the Swiss National Science Foundation and the Natural Science and Engineering Research Council of Canada.

Author information

Caren Hasler
Present address: Department of Computer and Mathematical Sciences, University of Toronto Scarborough, 1265 Military Trail, Toronto, ON, M1C 1A4, Canada

Authors and Affiliations

Institute of Statistics, University of Neuchâtel, Av. de Bellevaux 51, 2000, Neuchâtel, Switzerland
Caren Hasler
Department of Statistical Sciences, University of Toronto, 100 St. Georges Street, Toronto, ON, M5S 3G3, Canada
Radu V. Craiu

Authors

Caren Hasler
View author publications
You can also search for this author in PubMed Google Scholar
Radu V. Craiu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Caren Hasler.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Caren Hasler’s address when the research was conducted is “Institute of Statistics, University of Neuchâtel, Av. de Bellevaux 51, 2000 Neuchâtel, Switzerland”.

Appendix: Bootstrap variance when a randomization is applied

We repeated the simulations for the bootstrap variance of Sect. 6.1 with sampling fraction \(f = 0.3\) in order to study the impact of randomization on the quality of variance estimates. For the bootstrap variance under SRSWOR, Procedure 1 (MMB) was applied where, in step 1, a sample of size 900 was selected, that is \(n_h' = f \cdot n_h = 900\), \(h=1\) and a randomization was applied in step 2, and Procedure 3 (extended BWO) was applied (k was non-integer) where a randomization was applied in step 1. For the bootstrap variance under stratified sampling, Procedure 1 (MMB) was applied where, in step 1, a sample of size 187 was selected in each stratum, that is \(n_h' = \lfloor f \cdot n_h \rfloor = 187\), where \(\lfloor \cdot \rfloor \) is the floor function, for each stratum h and a randomization was applied in step 2, and Procedure 3 (extended BWO) was applied where a randomization was applied in step 1. Note that randomization was applied in all four cases.

Table 5 Monte Carlo variance of the total, Monte carlo expectation of the bootstrap variance and coverage rate associated with AM imputation for two different sampling designs and five populations

Full size table

Table 5 shows the result. Under SRS, whether the functional dependence between the variable of interest and the auxiliary variables is additive (populations 1 and 2) or not (populations 3, 4, 5), the bootstrap variance is close to the variance obtained by simulation and it leads to very good coverage rates (between 92% and 94%) across all five populations considered. Under stratified sampling, the bootstrap variance is greater than the variance obtained by simulations in four out of the five populations considered. This difference is greater when the functional dependence between the variable of interest and the auxiliary variables is additive and strong (populations 1 and 2). We explain this phenomenon in what follows.

When a randomization is applied to round the non-integer \(k_h\) and/or \(n_h'\) as it is the case here, the bootstrap variance contains two parts: the variance due to the randomization and the variance of the total estimator. When there is a strong additive functional dependence between the variable of interest and the auxiliary variables, the variance of the total estimator is small. An important portion of the bootstrap variance is due to randomization and the bootstrap variance overestimates the variance of the total. As the additive functional dependence between the variable of interest and the auxiliary variables weakens, the variance of the total estimator increases and the portion of the bootstrap variance due to randomization decreases. The bootstrap variance gets closer to the variance of the total. When stratified sampling is applied, the portion of the variance due to randomization may be particularly important because randomization is applied within each stratum. This explains the difference between the bootstrap variance and the variance obtained by simulations under stratified sampling in Table 5. The simulations run on the real data of Sect. 6.2 confirm this explanation. In this setting, there is a moderate additive functional dependence between the variable of interest and the auxiliary variables. Stratified sampling was used and the randomization procedure was applied to round the non-integer quantities. The obtained bootstrap variance is close to the variance obtained by simulations and yields a coverage rate of 94%.

As shown by these results, randomization affects the quality of the variance estimates. We refer the reader to Andreis et al. (2018) about weights rounding problems in resampling. We repeated the simulation in this section and rounded the non-integer \(k_h\) and \(n_h'\) to the nearest integer instead of applying randomization. This yields very similar results.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hasler, C., Craiu, R.V. Nonparametric imputation method for nonresponse in surveys. Stat Methods Appl 29, 25–48 (2020). https://doi.org/10.1007/s10260-019-00458-w

Download citation

Accepted: 23 March 2019
Published: 04 April 2019
Issue Date: March 2020
DOI: https://doi.org/10.1007/s10260-019-00458-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Nonparametric imputation method for nonresponse in surveys

Abstract

Access this article

Similar content being viewed by others

Semiparametric estimation in generalized additive partial linear models with nonignorable nonresponse data

Using Auxiliary Information and Nonparametric Methods in Weighting Adjustments

B-Spline Estimation in a Survey Sampling Framework

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix: Bootstrap variance when a randomization is applied

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Nonparametric imputation method for nonresponse in surveys

Abstract

Access this article

Similar content being viewed by others

Semiparametric estimation in generalized additive partial linear models with nonignorable nonresponse data

Using Auxiliary Information and Nonparametric Methods in Weighting Adjustments

B-Spline Estimation in a Survey Sampling Framework

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix: Bootstrap variance when a randomization is applied

Appendix: Bootstrap variance when a randomization is applied

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation