Grouping of Variables to Facilitate SDL Methods in Multivariate Data Sets

  • Anna OganianEmail author
  • Ionut Iacob
  • Goran Lesaja
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11126)


Data sets that are subject to Statistical Disclosure Limitation (SDL) often have many variables of different types that need to be altered for disclosure limitation. To produce a good quality public data set, the data protector needs to account for the relationships between the variables. Hence, ideally SDL methods should not be univariate, that is, treating each variable independently of others, but multivariate, handling many variables at the same time. However, if a data set has many variables, as most government survey data do, the task of developing and implementing a multivariate approach for SDL becomes difficult. In this paper we propose a pre-masking data processing procedure which consists of clustering the variables of high dimensional data sets, so that different groups of variables can be masked independently, thus reducing the complexity of SDL. We consider different hierarchical clustering methods, including our version of hierarchical clustering algorithm, that we call K-Link, and outline how the data protector can define an appropriate number of clusters for these methods. We implemented and applied these methods to two genuine multivariate data sets. The results of the experiments show that K-Link has a potential to solve this problem efficiently. The success of the method, however, depends on the correlation structure of the data. For the data sets where most of the variables are correlated, clustering of variables and subsequent independent application of SDL methods to different clusters may lead to attenuated correlation in the masked data, even for efficient clustering methods. Thereby, the proposed approach is a trade-off between the computational complexity of multivariate SDL methods and data utility loss due to independent treatment of different clusters by SDL methods.


Statistical Disclosure Limitation (SDL) Hierarchical clustering Dimensionality reduction 



The authors would like to thank Van Parsons from the National Center for Health Statistics (US) for providing cleansed version of NHIS public use sample file for our experiments and for his valuable suggestions. Also the authors would like to express their appreciation to Donald Malec also from the National Center for Health Statistics for his careful reading of the paper and many useful suggestions. The findings and conclusions in this paper are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.


  1. 1.
    Census: US census (1990) data set. UCI Machine Learning Repository (2017).
  2. 2.
    Chavent, M., Kuentz-Simonet, V., Liquet, B., Saracco, J.: ClustOfVar: an R package for the clustering of variables. J. Stat. Softw. 50(i13), 1–16 (2012)Google Scholar
  3. 3.
  4. 4.
    Dheeru, D., Karra Taniskidou, E.: UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences (2017).
  5. 5.
    Dhillon, I., Marcotte, E., Roshan, U.: Diametrical clustering for identifying anticorrelated gene clusters. Bioinformatics 19(13), 1612–1619 (2003)CrossRefGoogle Scholar
  6. 6.
    Everitt, B., Landau, S., Leese, M., Stahl, D.: Cluster Analysis. Series in Probability and Statistics, 5th edn. Wiley, Hoboken (2011)CrossRefGoogle Scholar
  7. 7.
    Fraley, C., Raftery, A.: MCLUST version 3 for R: Normal mixture modeling and model-based clustering. Technical report, Department of Statistics, University of Washington (2006).
  8. 8.
    Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On clustering validation techniques. J. Intell. Inf. Syst. 17, 107–145 (2001)CrossRefGoogle Scholar
  9. 9.
    Höppner, K., Klawonn, F., Runkler, T.: Fuzzy Cluster Analysis: Methods for Classification, Data Analysis and Image Recognition. Wiley, New York (1999)zbMATHGoogle Scholar
  10. 10.
    Hundepool, A., et al.: Handbook on Statistical Disclosure Control (version 1.2). ESSNET, SDC project (2010).
  11. 11.
    Hundepool, A., et al.: Statistical Disclosure Control. Wiley, Hoboken (2012)Google Scholar
  12. 12.
    Kaufman, L., Roussew, P.: Finding Groups in Data - An Introduction to Cluster Analysis. Wiley, Hoboken (1990)Google Scholar
  13. 13.
    Kim, J.J.: A method for limiting disclosure in microdata based on random noise and transformation. In: Proceedings of the ASA Section on Survey Research Methodology, pp. 303–308 (2002)Google Scholar
  14. 14.
    Lin, C., Chen, M.: A robust and efficient clustering algorithm based on cohesion selfmerging. In: Proceedings of Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, pp. 582–587 (2002)Google Scholar
  15. 15.
    Milligan, G.: A Monte Carlo study of thirty internal criterion measures for cluster analysis. Psychometrika 46(2), 187–199 (1981)CrossRefGoogle Scholar
  16. 16.
    NHIS: National Health Interview Survey. National Center for Health Statistics (2015).
  17. 17.
    Qiu, W.: Separation index, variables selection and sequential algorithm for cluster analysis. Ph.D. thesis, The University of British Columbia (2004)Google Scholar
  18. 18.
    Vigneau, E., Qannari, E.: Clustering of variables around latent components. Commun. Stat. Simul. Comput. 32(4), 1131–1150 (2003)MathSciNetCrossRefGoogle Scholar

Copyright information

© This is a U.S. government work and not under copyright protection in the U.S.; foreign copyright protection may apply 2018

Authors and Affiliations

  1. 1.National Center for Health StatisticsHyattsvilleUSA
  2. 2.Department of Mathematical SciencesGeorgia Southern UniversityStatesboroUSA

Personalised recommendations