Skip to main content
Log in

Information preserving regression-based tools for statistical disclosure control

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

This paper presents a unified framework for regression-based statistical disclosure control for microdata. A basic method, known as information preserving statistical obfuscation (IPSO), produces synthetic data that preserve variances, covariances and fitted values. The data are then generated conditionally according to the multivariate normal distribution. Generalizations of the IPSO method are described in the literature, and these methods aim to generate data more similar to the original data. This paper describes these methods in a concise and interpretable way, which is close to efficient implementation. Decomposing the residual data into orthogonal scores and corresponding loadings is an essential part of the framework. Both QR decomposition (Gram–Schmidt orthogonalization) and singular value decomposition (principal components) may be used. Within this framework, new and generalized methods are presented. In particular, a method is described by means of which the correlations to the original principal component scores can be controlled exactly. It is shown that a suggested method of random orthogonal matrix masking can be implemented without generating an orthogonal matrix. Generalized methodology for hierarchical categories is presented within the context of microaggregation. Some information can then be preserved at the lowest level and more information at higher levels. The presented methodology is also applicable to tabular data. One possibility is to replace the content of primary and secondary suppressed cells with generated values. It is proposed replacing suppressed cell frequencies with decimal numbers, and it is argued that this can be a useful method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Benedetto, G., Stinson, M.H., Abowd, J.M.: The Creation and Use of the SIPP Synthetic Beta. Technical Report, United States Census Bureau (2013)

  • Burridge, J.: Information preserving statistical obfuscation. Stat. Comput. 13(4), 321–327 (2003). https://doi.org/10.1023/A:1025658621216

    Article  MathSciNet  Google Scholar 

  • Calvino, A.: A simple method for limiting disclosure in continuous microdata based on principal component analysis. J. Off. Stat. 33(1), 15–41 (2017). https://doi.org/10.1515/JOS-2017-0002

    Article  Google Scholar 

  • Chan, T.F.: Rank revealing QR factorizations. Linear Algebra Appl. 88–9, 67–82 (1987). https://doi.org/10.1016/0024-3795(87)90103-0

    MathSciNet  MATH  Google Scholar 

  • de Wolf, P.P., Giessing, S.: Adjusting the tau-ARGUS modular approach to deal with linked tables. Data Knowl. Eng. 68(11), 1160–1174 (2009). https://doi.org/10.1016/j.datak.2009.06.005

    Article  Google Scholar 

  • Demmel, J., Gu, M., Eisenstat, S., Slapnicar, I., Veselic, K., Drmac, Z.: Computing the singular value decomposition with high relative accuracy. Linear Algebra Appl. 299(1–3), 21–80 (1999). https://doi.org/10.1016/S0024-3795(99)00134-2

    Article  MathSciNet  MATH  Google Scholar 

  • Domingo-Ferrer, J., Gonzalez-Nicolas, U.: Hybrid microdata using microaggregation. Inf. Sci. 180(15), 2834–2844 (2010). https://doi.org/10.1016/j.ins.2010.04.005

    Article  Google Scholar 

  • Domingo-Ferrer, J., Mateo-Sanz, J.M.: Practical data-oriented microaggregation for statistical disclosure control. IEEE Trans. Knowl. Data Eng. 14(1), 189–201 (2002)

    Article  Google Scholar 

  • Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control. Springer, New York (2011)

    Book  MATH  Google Scholar 

  • Duncan, G.T., Pearson, R.W.: Enhancing access to microdata while protecting confidentiality: prospects for the future. Stat. Sci. 6(3), 219–239 (1991)

    Article  Google Scholar 

  • Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Nordholt, E.S., Spicer, K., de Wolf, P.P.: Statistical Disclosure Control. Wiley, Hoboken (2012). https://doi.org/10.1002/9781118348239.ch1

    Book  Google Scholar 

  • Hundepool, A., de Wolf, P.P., Bakker, J., Reedijk, A., Franconi, L., Polettini, S., Capobianchi, A., Domingo, J.: mu-ARGUS User’s Manual, Version 5.1. Technical Report, Statistics Netherlands (2014)

  • Jarmin, R.S., Louis, T.A., Miranda, J.: Expanding the role of synthetic data at the U.S. Census Bureau. Stat. J. IAOS 30(1–3), 117–121 (2014)

    Google Scholar 

  • Jolliffe, I.: Principal Component Analysis, 2nd edn. Springer, New York (2002)

    MATH  Google Scholar 

  • Klein, M.D., Datta, G.S.: Statistical disclosure control via sufficiency under the multiple linear regression model. J. Stat. Theory Pract. 12(1), 100–110 (2018)

    Article  MathSciNet  Google Scholar 

  • Langsrud, Ø.: Rotation tests. Stat. Comput. 15(1), 53–60 (2005). https://doi.org/10.1007/s11222-005-4789-5

    Article  MathSciNet  Google Scholar 

  • Loong, B., Rubin, D.B.: Multiply-imputed synthetic data: advice to the imputer. J. Off. Stat. 33(4), 1005–1019 (2017). https://doi.org/10.1515/JOS-2017-0047

    Article  Google Scholar 

  • Mateo-Sanz, J., Martinez-Balleste, A., Domingo-Ferrer, J.: Fast generation of accurate synthetic microdata. In: DomingoFerrer, J., Torra, V. (eds.) Privacy in Statistical Databases, Proceedings, . Conference on Privacy in Statistical DataBases (PSD 2004), Barcelona, Spain, 09–11 June 2004, vol. 3050, pp. 298–306 (2004)

  • Muralidhar, K., Sarathy, R.: Generating sufficiency-based non-synthetic perturbed data. Trans. Data Priv. 1(1), 17–33 (2008)

    MathSciNet  Google Scholar 

  • Reiter, J.P., Raghunathan, T.E.: The multiple adaptations of multiple imputation. J. Am. Stat. Assoc. 102(480), 1462–1471 (2007). https://doi.org/10.1198/016214507000000932

    Article  MathSciNet  MATH  Google Scholar 

  • Salazar-Gonzalez, J.J.: Statistical confidentiality: optimization techniques to protect tables. Comput. Oper. Res. 35(5), 1638–1651 (2008). https://doi.org/10.1016/j.cor.2006.09.007

    Article  Google Scholar 

  • Strang, G.: Linear Algebra and Its Applications, 3rd edn. Harcourt Brace Jovanovich, San Diego (1988)

    MATH  Google Scholar 

  • Templ, M., Meindl, B.: Robustification of microdata masking methods and the comparison with existing methods. In: Domingo-Ferrer, J., Saygın, Y. (eds.) Privacy in Statistical Databases, Proceedings, UNESCO Chair in Data Privacy International Conference (PSD 2008), Istanbul, Turkey, 24–26 Sept 2008, pp. 113–126. Springer, Berlin (2008)

  • Templ, M., Kowarik, A., Meindl, B.: Statistical disclosure control for micro-data using the R Package sdcMicro. J. Stat. Softw. 67(4), 1–37 (2015)

    Article  Google Scholar 

  • Ting, D., Fienberg, S.E., Trottini, M.: Random orthogonal matrix masking methodology for microdata release. Int. J. Inf. Comput. Secur. 2(1), 86–105 (2008). https://doi.org/10.1504/IJICS.2008.016823

    Google Scholar 

  • Wedderburn, R.W.M.: Random Rotations and Multivariate Normal Simulation. Research Report, Rothamsted Experimental Station (1975)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Øyvind Langsrud.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Generalized QR decomposition

The QR decomposition of a \(n \times m\) matrix \(\varvec{A}\) with rank r can be written as

$$\begin{aligned} \varvec{A} = \varvec{Q} \varvec{R} \end{aligned}$$
(24)

where \(\varvec{Q}\) is a \(n \times r\) matrix whose columns form an orthonormal basis for the column space of \(\varvec{A}\). This decomposition can be viewed as the matrix formulation of the Gram–Schmidt orthogonalization process. The Cholesky decomposition of \(\varvec{A}^T\!\varvec{A}\) can be read from the QR decomposition of \(\varvec{A}\) as \(\varvec{R}^T\!\varvec{R}\).

In this paper, in order to allow linearly dependent columns of \(\varvec{A}\) (\(r < m\)), we refer to a generalized variant of QR decomposition. In such cases, a usual decomposition (Chan 1987) is

$$\begin{aligned} \varvec{A} \varvec{\tilde{P}} = \varvec{Q} \varvec{\tilde{R}} \end{aligned}$$
(25)

where \(\varvec{\tilde{P}}\) is a permutation matrix that reorders the columns (pivoting) in order to make a decomposition so that \(\varvec{\tilde{R}}\) is upper triangular.

To make the decomposition unique, we require the diagonal entries of \(\varvec{\tilde{R}}\) to be positive. Furthermore, we require \(\varvec{\tilde{P}}\) to keep the order of the columns as close to the original order as possible (minimal pivoting). We now have \(\varvec{A} = \varvec{Q} \varvec{\tilde{R}} \varvec{\tilde{P}}^{T}\) and in generalized QR decomposition (24) we use

$$\begin{aligned} \varvec{R} = \varvec{\tilde{R}} \varvec{\tilde{P}}^{T} \end{aligned}$$
(26)

The QR decomposition of a composite matrix can be written as

$$\begin{aligned} \left[ \varvec{A}_{1}\; \varvec{A}_{2} \right] = \left[ \varvec{Q}_{1}\; \varvec{Q}_{2} \right] \left[ \varvec{R}_{1}^{T}\; \varvec{R}_{2}^{T} \right] ^{T} \end{aligned}$$
(27)

Now \(\varvec{Q}_{1}\) can be computed by QR decomposition of \(\varvec{A}_{1}\). The matrix \(\varvec{Q}_{2}\) can be computed by QR decomposition of \(\varvec{A}_{2} - \varvec{Q}_{1}\varvec{Q}_{1}^{T}\varvec{A}_{2}\), which is the residual part after regressing \(\varvec{A}_{2}\) onto \(\varvec{A}_{1}\).

Appendix 2: The singular value decomposition

The singular value decomposition (SVD) of a \(n \times m\) matrix \(\varvec{A}\) with rank r can be written as

$$\begin{aligned} \varvec{A} = \varvec{U} \varvec{\varLambda } \varvec{V}^{T} \end{aligned}$$
(28)

where \(\varvec{\varLambda }\) is a \(r \times r\) diagonal matrix of strictly positive singular values in descending order. This is the rank-revealing version of the decomposition (Demmel et al. 1999). Other variants of SVD allow some singular values to be zero, but these can be omitted. The columns of \(\varvec{U}\) form an orthonormal basis for the column space of \(\varvec{A}\) and the columns of \(\varvec{V}\) form an orthonormal basis for the row space.

The singular values are the square root of the eigenvalues of \(\varvec{A}^T\!\varvec{A}\) and \(\varvec{A}\varvec{A}^T\). The eigen decompositions of these two symmetric matrices can be read directly from the SVD of \(\varvec{A}\) as \(\varvec{V} \varvec{\varLambda }^2 \varvec{V}^{T}\) and \(\varvec{U} \varvec{\varLambda }^2 \varvec{U}^{T}\). It is also worth mentioning that an alternative to the ordinary Cholesky decomposition, \(\varvec{A}^T\!\varvec{A}=\varvec{R}^T\!\varvec{R}\), is to let \(\varvec{\varLambda } \varvec{V}^{T}\) play the role of \(\varvec{R}\).

To make the SVD unique, we can require all column sums of \(\varvec{V}\) to be positive. In cases with equal singular values, the decomposition is not unique regardless.

There is a close relationship between SVD and PCA. In PCA, the variables are usually centered to zero means and in many cases standardized to equal variances prior to decomposition. If \(\varvec{A}\) is such a centered/standardized matrix, then \(\varvec{U} \varvec{\varLambda }\) is the matrix of PCA scores and \(\varvec{V}\) is the matrix of PCA loadings.

The Moore–Penrose generalized inverse of \(\varvec{A}\) can be written as

$$\begin{aligned} \varvec{A}^{\dagger } = \varvec{V} \varvec{\varLambda }^{-1} \varvec{U}^{T} \end{aligned}$$
(29)

We have

$$\begin{aligned} \varvec{A}^{\dagger } = (\varvec{A}^{T}\varvec{A})^{\dagger }\varvec{A}^{T} = \varvec{A}^{T} (\varvec{A}\varvec{A}^{T})^{\dagger } \end{aligned}$$
(30)

When \(\varvec{A}\) is invertible, \(\varvec{A}^{\dagger }= \varvec{A}^{-1}\). When \(\varvec{A}^{T}\varvec{A}\) or \(\varvec{A}\varvec{A}^{T}\) is invertible, this means, respectively, that \(\varvec{A}^{\dagger } = (\varvec{A}^{T}\varvec{A})^{-1}\varvec{A}^{T}\) or \(\varvec{A}^{\dagger } = \varvec{A}^{T} (\varvec{A}\varvec{A}^{T})^{-1}\).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Langsrud, Ø. Information preserving regression-based tools for statistical disclosure control. Stat Comput 29, 965–976 (2019). https://doi.org/10.1007/s11222-018-9848-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-018-9848-9

Keywords

Navigation