Abstract
National statistical agencies and other data custodians collect and hold a vast amount of survey and census data, containing information vital for research and policy analysis. However, the problem of allowing analysis of these data, while protecting respondent confidentiality, has proved challenging to address. In this paper we will focus on the remote analysis approach, under which a confidential dataset is held in a secure environment under the direct control of the data custodian agency. A computer system within the secure environment accepts a query from an analyst, runs it on the data, then returns the results to the analyst. In particular, the analyst does not have direct access to the data at all, and cannot view any microdata records. We further focus on the fitting of linear regression models to confidential data in the presence of outliers and influential points, such as are often present in business data. We propose a new method for protecting confidentiality in linear regression via a remote analysis system, that provides additional confidentiality protection for outliers and influential points in the data. The method we describe in this paper was designed for the prototype DataAnalyser system developed by the Australian Bureau of Statistics, however the method would be suitable for similar remote analysis systems.
Similar content being viewed by others
References
Australian Bureau of Statistics: Data Analyser (n.d). http://www.abs.gov.au/ausstats/abs@.nsf/Lookup/by%20Subject/1406.0.55.006~User%20Guide~Main%20Features~What%20is%20DataAnalyser~1. Accessed 19 Nov 2015
Brandt, M., Franconi, L., Gurke, C., Hundepool, A., Lucarelli, M., Mol, J., Ritchie, F., Seri, G., Welpton, R.: Guidelines for the checking of outputs based on microdata research. ESSnet SDC, A Network of Excellence in the European Statistical System in the Field of Statistical Disclosure Control (2010). http://www.neon.vb.cbs.nl/casc/ESSnet/guidelines_on_outputchecking
Chambers, R., Dunstan, R.: Estimating distribution functions from survey data. Biometrika 73, 597–604 (1986)
Chaudhuri, K., Monteleoni, C.: Privacy-preserving logistic regression. In: Proceedings of the 22nd Annual Conference on Neural Information Processing Systems (NIPS), pp. 289–296 (2008)
Chaudhuri, K., Monteleoni, C., Sarwate, A.: Differentially private empirical risk minimization. J. Mach. Learn. Res. 12, 1069–1109 (2011)
Chipperfield, J., Lucie, S.: Analysis of micro-data: controlling the risk of disclosure. Research Paper—Methodology Advisory Committee 1352.0.55.110, Australian Bureau of Statistics (2010)
Chipperfield, J.O., O’Keefe, C.M.: Disclosure-protected inference using generalised linear models. Int. Stat. Rev. 82, 371–391 (2014)
Commonwealth of Australia: Census and Statistics Act 1905. website (1905). http://corrigan.austlii.edu.au/au/legis/cth/consol_act/casa1905241/
Drechsler, J., Reiter, J.: An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Comput. Stat. Data Anal. 55, 3232–3243 (2011)
Du, Z., Wiens, D.P.: Jackknifing, weighting, diagnostics and variance estimation in generalized m-estimation. Stat. Probab. Lett. 46(3), 287–299 (2000)
Duncan, G., Elliot, M., Salazar-Gonzàlez, J.J.: Statistical Confidentiality. Springer, New York (2011)
Duncan, G.T., Keller-McNulty, S.A., Stokes, S.L.: Disclosure risk vs data utility: The R–U confidentiality map. Technical Report LA-UR-01-6428, Los Alamos National Laboratory (2001)
Dwork, C., Lei, J.: Differential privacy and robust statistics. In: Proceedings of the 41st ACM Symposium on Theory of Computing (STOC), pp. 371–380 (2009)
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: 3rd IACR Theory of Cryptography Conference, pp. 265–284 (2006)
Dwork, C., Smith, A.: Differential privacy for statistics: what we know and what we want to learn. J. Priv. Confid. 1, 135–154 (2009)
Elliot, M., Mackey, E., Purdam, K.: Formalizing the selection of key variables in disclosure risk. In: Proceedings of the 58th Congress of the International Statistical Institute, ISI 2011 (2011)
Gomatam, S., Karr, A., Reiter, J., Sanil, A.: Data dissemination and disclosure limitation in a world without microdata: a risk-utility framework for remote access systems. Stat. Sci. 20, 163–177 (2005)
Hampel, F., Ronchetti, E., Rousseeuw, P., Stahel, W.: Robust Statistics: The Approach Based on Influence Functions. Wiley-Interscience, New York (1986)
Huber, P.: Robust Statistics. Wiley Series in Probability and Mathematical Statistics. Wiley, New York (1981)
Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Nordholt, E., Spicer, K., de Wolf, P.P.: Statistical Disclosure Control. Wiley Series in Survey Methodology. Wiley, Chichester (2012)
Karr, A.F., Kohnen, C.N., Oganian, A., Reiter, J.P., Sanil, A.P.: A framework for evaluating the utility of data altered to protect confidentiality. Am. Stat. 60(3), 224–232 (2006)
Kifer, D., Machanavajjhala, A.: No free lunch in data privacy. In: Proceedings of SIGMOD ’11, pp. 193–204. Athens (2011)
Lee, J., Kim, I., O’Keefe, C.M.: Applicability of regression-tree-based synthetic data methods for business data. In: IEEE International Conference on Data Mining Workshops: Privacy Aspects of Data Mining, PADM 2011, pp. 651–658. Vancouver (2011)
Little, R.: Statistical analysis of masked data. J. Off. Stat. 9, 407–426 (1993)
Lucero, J., Zayatz, L., Singh, L., You, J., DePersio, M., Freiman, M.: The current stage of the microdata analysis system at the U.S. Census Bureau. In: Proceedings of the 58th Congress of the International Statistical Institute, ISI 2011 (2011)
Mallows, C.: On Some Topics in Robustness. Bell Telephone Laboratories, Murray Hill (1975)
Maronna, R., Martin, R., Yohai, V.: Robust Statistics: Theory and Methods. Wiley Series in Probability and Statistics. Wiley, Chichester (2006)
O’Keefe, C.M.: Remote analysis in action—design and implementation of a demonstration remote analysis system. In: Proceedings of New Techniques and Technologies in Statistics NTTS 2011, Brussels. Eurostat (2011). http://www.ntts2011.eu
O’Keefe, C.M., Chipperfield, J.O.: A summary of attack methods and protective measures for fully automated remote analysis systems. Int. Stat. Rev. 81, 426–455 (2013)
O’Keefe, C.M., Good, N.: Risk and utility of alternative regression diagnostics in remote analysis servers. In: Proceedings of the 55th Session of the ISI International Statistical Institute (2007)
O’Keefe, C.M., Good, N.: Regression output from a remote analysis system. Data Knowl. Eng. 68, 1175–1186 (2009)
O’Keefe, C.M., Gould, P., Churches, T.: Comparison of two remote access systems recently developed and implemented in australia. In: Domingo-Ferrer, J. (ed.) Privacy in Statistical Databases PSD2014, LNCS, vol. 8744, pp. 299–311. Springer, Berlin (2014)
O’Keefe, C.M., Rubin, D.B.: Individual privacy versus public good: protecting confidentiality in health research. Stat. Med. 34, 3081–3103 (2015)
O’Keefe, C.M., Shlomo, N.: Comparison of remote analysis with statistical disclosure control for protecting the confidentiality of business data. Trans. Data Priv. 5, 403–432 (2012)
O’Keefe, C.M., Westcott, M., Ickowicz, A., O’Sullivan, M., Churches, T.: Protecting confidentiality in statistical analysis outputs from a virtual data centre. Working Paper (2013). Joint UNECE/Eurostat work session on statistical data confidentiality, Ottawa. http://www.unece.org/stats/documents/2013.10.confidentiality.html Accessed 23 Oct 2014
Reiter, J.: Model diagnostics for remote-access regression systems. Stat. Comput. 13, 371–380 (2003)
Reiter, J.: Releasing multiply imputed, synthetic public-use microdata: an illustration and empirical study. J. R. Stat. Soc. A 168, 185–205 (2005)
Reiter, J.: Using CART to generate partially synthetic public use microdata. J. Off. Stat. 21, 441–462 (2005)
Reiter, J., Kohnen, C.: Categorical data regression diagnostics for remote systems. J. Stat. Comput. Simul. 75, 889–903 (2005)
Rousseeuw, P., Leroy, A.: Robust Regression and Outlier Detection. Wiley-Interscience, New York (1987)
Rubin, D.: Discussion: statistical disclosure limitation. J. Off. Stat. 9, 462–468 (1993)
Sarathy, R., Muralidhar, K.: Evaluating Laplace noise addition to satisfy differential privacy for numeric data. Trans. Data Priv. 4, 1–17 (2011)
Smith, A.: Efficient, differentially private point estimators. ArXiv:0809.4794v1 (2008)
Sparks, R., Carter, C., Donnelly, J., O’Keefe, C.M., Duncan, J., Keighley, T., McAullay, D.: Remote access methods for exploratory data analysis and statistical modelling: Privacy-Preserving Analytics™. Comput. Methods Prog. Biomed. 91, 208–222 (2008)
Staudte, R., Sheather, S.: Robust Estimation and Testing. Wiley, New York (1990)
Street, J.O., Carroll, R.J., Ruppert, D.: A note on computing robust regression estimates via iteratively reweighted least squares. Am. Stat. 42(2), 152–154 (1988)
Street, J.O., Carroll, R.J., Ruppert, D.: Correction. Am. Stat. 43(1), 69 (1989)
Thompson, G., Broadfoot, S., Elazar, D.: Methodology for automatic confidentialisation of statistical outputs from remote servers at the Australian Bureau of Statistics. Joint UNECE/Eurostat work session on statistical data confidentiality (Ottawa, 28–30 Oct 2013). 37pp
Acknowledgments
This work was done partly while Christine O’Keefe was on secondment to the Australian Bureau of Statistics, and partly while Soonmin Kwon and Soomin Song were Industrial Trainees in CSIRO. The authors thank Mark Westcott for reviewing an earlier draft and prompting a correction. The authors also thank the anonymous referee for comments and questions that have led to an improved exposition and paper. Views expressed in this paper are those of the authors and do not necessarily represent those of the Australian Bureau of Statistics. Where quoted or used, they should be attributed clearly to the authors.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
O’Keefe, C.M., Ayre, T., Lucie, S. et al. Perturbed robust linear estimating equations for confidentiality protection in remote analysis. Stat Comput 27, 775–787 (2017). https://doi.org/10.1007/s11222-016-9653-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-016-9653-2