Skip to main content
Log in

Perturbed robust linear estimating equations for confidentiality protection in remote analysis

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

National statistical agencies and other data custodians collect and hold a vast amount of survey and census data, containing information vital for research and policy analysis. However, the problem of allowing analysis of these data, while protecting respondent confidentiality, has proved challenging to address. In this paper we will focus on the remote analysis approach, under which a confidential dataset is held in a secure environment under the direct control of the data custodian agency. A computer system within the secure environment accepts a query from an analyst, runs it on the data, then returns the results to the analyst. In particular, the analyst does not have direct access to the data at all, and cannot view any microdata records. We further focus on the fitting of linear regression models to confidential data in the presence of outliers and influential points, such as are often present in business data. We propose a new method for protecting confidentiality in linear regression via a remote analysis system, that provides additional confidentiality protection for outliers and influential points in the data. The method we describe in this paper was designed for the prototype DataAnalyser system developed by the Australian Bureau of Statistics, however the method would be suitable for similar remote analysis systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Australian Bureau of Statistics: Data Analyser (n.d). http://www.abs.gov.au/ausstats/abs@.nsf/Lookup/by%20Subject/1406.0.55.006~User%20Guide~Main%20Features~What%20is%20DataAnalyser~1. Accessed 19 Nov 2015

  • Brandt, M., Franconi, L., Gurke, C., Hundepool, A., Lucarelli, M., Mol, J., Ritchie, F., Seri, G., Welpton, R.: Guidelines for the checking of outputs based on microdata research. ESSnet SDC, A Network of Excellence in the European Statistical System in the Field of Statistical Disclosure Control (2010). http://www.neon.vb.cbs.nl/casc/ESSnet/guidelines_on_outputchecking

  • Chambers, R., Dunstan, R.: Estimating distribution functions from survey data. Biometrika 73, 597–604 (1986)

    Article  MathSciNet  MATH  Google Scholar 

  • Chaudhuri, K., Monteleoni, C.: Privacy-preserving logistic regression. In: Proceedings of the 22nd Annual Conference on Neural Information Processing Systems (NIPS), pp. 289–296 (2008)

  • Chaudhuri, K., Monteleoni, C., Sarwate, A.: Differentially private empirical risk minimization. J. Mach. Learn. Res. 12, 1069–1109 (2011)

    MathSciNet  MATH  Google Scholar 

  • Chipperfield, J., Lucie, S.: Analysis of micro-data: controlling the risk of disclosure. Research Paper—Methodology Advisory Committee 1352.0.55.110, Australian Bureau of Statistics (2010)

  • Chipperfield, J.O., O’Keefe, C.M.: Disclosure-protected inference using generalised linear models. Int. Stat. Rev. 82, 371–391 (2014)

    Article  MathSciNet  Google Scholar 

  • Commonwealth of Australia: Census and Statistics Act 1905. website (1905). http://corrigan.austlii.edu.au/au/legis/cth/consol_act/casa1905241/

  • Drechsler, J., Reiter, J.: An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Comput. Stat. Data Anal. 55, 3232–3243 (2011)

    Article  MathSciNet  Google Scholar 

  • Du, Z., Wiens, D.P.: Jackknifing, weighting, diagnostics and variance estimation in generalized m-estimation. Stat. Probab. Lett. 46(3), 287–299 (2000)

    Article  MATH  Google Scholar 

  • Duncan, G., Elliot, M., Salazar-Gonzàlez, J.J.: Statistical Confidentiality. Springer, New York (2011)

    Book  MATH  Google Scholar 

  • Duncan, G.T., Keller-McNulty, S.A., Stokes, S.L.: Disclosure risk vs data utility: The R–U confidentiality map. Technical Report LA-UR-01-6428, Los Alamos National Laboratory (2001)

  • Dwork, C., Lei, J.: Differential privacy and robust statistics. In: Proceedings of the 41st ACM Symposium on Theory of Computing (STOC), pp. 371–380 (2009)

  • Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: 3rd IACR Theory of Cryptography Conference, pp. 265–284 (2006)

  • Dwork, C., Smith, A.: Differential privacy for statistics: what we know and what we want to learn. J. Priv. Confid. 1, 135–154 (2009)

    Google Scholar 

  • Elliot, M., Mackey, E., Purdam, K.: Formalizing the selection of key variables in disclosure risk. In: Proceedings of the 58th Congress of the International Statistical Institute, ISI 2011 (2011)

  • Gomatam, S., Karr, A., Reiter, J., Sanil, A.: Data dissemination and disclosure limitation in a world without microdata: a risk-utility framework for remote access systems. Stat. Sci. 20, 163–177 (2005)

    Article  MATH  Google Scholar 

  • Hampel, F., Ronchetti, E., Rousseeuw, P., Stahel, W.: Robust Statistics: The Approach Based on Influence Functions. Wiley-Interscience, New York (1986)

    MATH  Google Scholar 

  • Huber, P.: Robust Statistics. Wiley Series in Probability and Mathematical Statistics. Wiley, New York (1981)

    Google Scholar 

  • Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Nordholt, E., Spicer, K., de Wolf, P.P.: Statistical Disclosure Control. Wiley Series in Survey Methodology. Wiley, Chichester (2012)

    Book  Google Scholar 

  • Karr, A.F., Kohnen, C.N., Oganian, A., Reiter, J.P., Sanil, A.P.: A framework for evaluating the utility of data altered to protect confidentiality. Am. Stat. 60(3), 224–232 (2006)

    Article  MathSciNet  Google Scholar 

  • Kifer, D., Machanavajjhala, A.: No free lunch in data privacy. In: Proceedings of SIGMOD ’11, pp. 193–204. Athens (2011)

  • Lee, J., Kim, I., O’Keefe, C.M.: Applicability of regression-tree-based synthetic data methods for business data. In: IEEE International Conference on Data Mining Workshops: Privacy Aspects of Data Mining, PADM 2011, pp. 651–658. Vancouver (2011)

  • Little, R.: Statistical analysis of masked data. J. Off. Stat. 9, 407–426 (1993)

    Google Scholar 

  • Lucero, J., Zayatz, L., Singh, L., You, J., DePersio, M., Freiman, M.: The current stage of the microdata analysis system at the U.S. Census Bureau. In: Proceedings of the 58th Congress of the International Statistical Institute, ISI 2011 (2011)

  • Mallows, C.: On Some Topics in Robustness. Bell Telephone Laboratories, Murray Hill (1975)

    Google Scholar 

  • Maronna, R., Martin, R., Yohai, V.: Robust Statistics: Theory and Methods. Wiley Series in Probability and Statistics. Wiley, Chichester (2006)

    Book  MATH  Google Scholar 

  • O’Keefe, C.M.: Remote analysis in action—design and implementation of a demonstration remote analysis system. In: Proceedings of New Techniques and Technologies in Statistics NTTS 2011, Brussels. Eurostat (2011). http://www.ntts2011.eu

  • O’Keefe, C.M., Chipperfield, J.O.: A summary of attack methods and protective measures for fully automated remote analysis systems. Int. Stat. Rev. 81, 426–455 (2013)

    Article  MathSciNet  Google Scholar 

  • O’Keefe, C.M., Good, N.: Risk and utility of alternative regression diagnostics in remote analysis servers. In: Proceedings of the 55th Session of the ISI International Statistical Institute (2007)

  • O’Keefe, C.M., Good, N.: Regression output from a remote analysis system. Data Knowl. Eng. 68, 1175–1186 (2009)

    Article  Google Scholar 

  • O’Keefe, C.M., Gould, P., Churches, T.: Comparison of two remote access systems recently developed and implemented in australia. In: Domingo-Ferrer, J. (ed.) Privacy in Statistical Databases PSD2014, LNCS, vol. 8744, pp. 299–311. Springer, Berlin (2014)

    Google Scholar 

  • O’Keefe, C.M., Rubin, D.B.: Individual privacy versus public good: protecting confidentiality in health research. Stat. Med. 34, 3081–3103 (2015)

    Article  MathSciNet  Google Scholar 

  • O’Keefe, C.M., Shlomo, N.: Comparison of remote analysis with statistical disclosure control for protecting the confidentiality of business data. Trans. Data Priv. 5, 403–432 (2012)

    MathSciNet  Google Scholar 

  • O’Keefe, C.M., Westcott, M., Ickowicz, A., O’Sullivan, M., Churches, T.: Protecting confidentiality in statistical analysis outputs from a virtual data centre. Working Paper (2013). Joint UNECE/Eurostat work session on statistical data confidentiality, Ottawa. http://www.unece.org/stats/documents/2013.10.confidentiality.html Accessed 23 Oct 2014

  • Reiter, J.: Model diagnostics for remote-access regression systems. Stat. Comput. 13, 371–380 (2003)

    Article  MathSciNet  Google Scholar 

  • Reiter, J.: Releasing multiply imputed, synthetic public-use microdata: an illustration and empirical study. J. R. Stat. Soc. A 168, 185–205 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  • Reiter, J.: Using CART to generate partially synthetic public use microdata. J. Off. Stat. 21, 441–462 (2005)

    Google Scholar 

  • Reiter, J., Kohnen, C.: Categorical data regression diagnostics for remote systems. J. Stat. Comput. Simul. 75, 889–903 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  • Rousseeuw, P., Leroy, A.: Robust Regression and Outlier Detection. Wiley-Interscience, New York (1987)

    Book  MATH  Google Scholar 

  • Rubin, D.: Discussion: statistical disclosure limitation. J. Off. Stat. 9, 462–468 (1993)

    Google Scholar 

  • Sarathy, R., Muralidhar, K.: Evaluating Laplace noise addition to satisfy differential privacy for numeric data. Trans. Data Priv. 4, 1–17 (2011)

    MathSciNet  Google Scholar 

  • Smith, A.: Efficient, differentially private point estimators. ArXiv:0809.4794v1 (2008)

  • Sparks, R., Carter, C., Donnelly, J., O’Keefe, C.M., Duncan, J., Keighley, T., McAullay, D.: Remote access methods for exploratory data analysis and statistical modelling: Privacy-Preserving Analytics™. Comput. Methods Prog. Biomed. 91, 208–222 (2008)

    Article  Google Scholar 

  • Staudte, R., Sheather, S.: Robust Estimation and Testing. Wiley, New York (1990)

    Book  MATH  Google Scholar 

  • Street, J.O., Carroll, R.J., Ruppert, D.: A note on computing robust regression estimates via iteratively reweighted least squares. Am. Stat. 42(2), 152–154 (1988)

    Google Scholar 

  • Street, J.O., Carroll, R.J., Ruppert, D.: Correction. Am. Stat. 43(1), 69 (1989)

    Google Scholar 

  • Thompson, G., Broadfoot, S., Elazar, D.: Methodology for automatic confidentialisation of statistical outputs from remote servers at the Australian Bureau of Statistics. Joint UNECE/Eurostat work session on statistical data confidentiality (Ottawa, 28–30 Oct 2013). 37pp

Download references

Acknowledgments

This work was done partly while Christine O’Keefe was on secondment to the Australian Bureau of Statistics, and partly while Soonmin Kwon and Soomin Song were Industrial Trainees in CSIRO. The authors thank Mark Westcott for reviewing an earlier draft and prompting a correction. The authors also thank the anonymous referee for comments and questions that have led to an improved exposition and paper. Views expressed in this paper are those of the authors and do not necessarily represent those of the Australian Bureau of Statistics. Where quoted or used, they should be attributed clearly to the authors.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christine M. O’Keefe.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

O’Keefe, C.M., Ayre, T., Lucie, S. et al. Perturbed robust linear estimating equations for confidentiality protection in remote analysis. Stat Comput 27, 775–787 (2017). https://doi.org/10.1007/s11222-016-9653-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-016-9653-2

Keywords

Navigation