Privacy Preserving Feature Selection via Voted Wrapper Method for Horizontally Distributed Medical Data

  • Yunmei Lu
  • Yanqing ZhangEmail author
Part of the Emerging Topics in Statistics and Biostatistics book series (ETSB)


Feature selection plays a crucial step for data mining algorithms via eliminating the curse of dimensionality. Many feature selection approaches are developed for analyzing centralized data on the same location. In recent years, multi-source biomedical data mining methods have been developed to analyze different distributed databases at different locations such as different hospitals. However, a major concern is privacy of sensitive personal medical records in different hospitals. Therefore, as the needs for new privacy preserving distributed data mining algorithms increase, it is necessary to develop new privacy preserving feature selection algorithms for biomedical data mining. In this paper, a privacy preserving feature selection method named “Privacy Preserving Feature Selection algorithm via Voted Wrapper methods (PPFSVW)” is developed. This method was tested on six benchmark datasets under two testing scenarios. Our experimental results indicate that the proposed algorithm workflow can work effectively to improve the classification performance regarding accuracy via selecting informative features and genes. Besides, the proposed method can make the classifier achieve higher or same level classification accuracy with fewer features compared with those sophisticated methods, such as SVM-RFE, RSVM and SVM-t. More importantly, the individual private information can be protected during the whole feature selection process.


Privacy preserving Horizontally distributed data mining Support vector machine SVM Feature selection PAN-SVM 



This work is part of the Ph.D. dissertation of Yunmei Lu, who would like to express her great gratitude to all of her committee members, Prof. Yanqing Zhang, Prof. Yi Pan, Prof. Rajshekhar Sunderraman and Prof. Yichuan Zhao, for their guidance and support. This work would have not been possible without their guidance and support. The authors also would like to thank the reviewers of this paper for their constructive comments and suggestions. Yunmei Lu is grateful to the continued financial support from the Department of Computer Science and the Molecular Basis of Disease (MBD) fellowship at GSU.


  1. 1.
    Agrawal, R., & Srikant, R. (2000). Privacy-preserving data mining. ACM Sigmod Record, 29, 439–450.CrossRefGoogle Scholar
  2. 2.
    Bayardo, R. J., & Agrawal, R. (2005). Data privacy through optimal k-anonymization. In Data engineering, 2005. ICDE 2005. Proceedings 21st international conference on (pp. 217–228). Piscataway, NJ: IEEE.Google Scholar
  3. 3.
    Machanavajjhala, A., Kifer, D., Gehrke, J., & Venkitasubramaniam, M. (2007). l-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data, 1, 3.CrossRefGoogle Scholar
  4. 4.
    Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.zbMATHGoogle Scholar
  5. 5.
    Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46, 389–422.CrossRefGoogle Scholar
  6. 6.
    Díaz-Uriarte, R., & Alvarez de Andrés, S. (2006). Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7, 3.CrossRefGoogle Scholar
  7. 7.
    Zhang, X., Lu, X., Shi, Q., Xu, X.-Q., Hon-chiu, E. L., Harris, L. N., et al. (2006). Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data. BMC Bioinformatics, 7, 197.CrossRefGoogle Scholar
  8. 8.
    Sharma, A., Imoto, S., & Miyano, S. (2012). A top-r feature selection algorithm for microarray gene expression data. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9, 754–764.CrossRefGoogle Scholar
  9. 9.
    Chen-An Tsai, C.-H. H., Chang, C.-W., & Chen, C.-H. (2012). Recursive feature selection with significant variables of support vectors. Computational and Mathematical Methods in Medicine, 2012, 12.MathSciNetzbMATHGoogle Scholar
  10. 10.
    Miranda, J., Montoya, R., & Weber, R. (2005). Linear penalization support vector Machines for Feature Selection. In S. K. Pal, S. Bandyopadhyay, & S. Biswas (Eds.), Proceedings of the pattern recognition and machine intelligence: First international conference, PReMI 2005, Kolkata, India, December 20–22, 2005 (pp. 188–192). Berlin: Springer.Google Scholar
  11. 11.
    Bradley, P. S., & Mangasarian, O. L. (1998). Feature selection via concave minimization and support vector machines. In Proceedings of the fifteenth international conference on machine learning. San Francisco, CA: M. Kaufmann Publishers.Google Scholar
  12. 12.
    Kholod, I., Kuprianov, M., & Petukhov, I. (2016). Distributed data mining based on actors for internet of things. In 2016 5th Mediterranean Conference on Embedded Computing (MECO) (pp. 480–484). Piscataway, NJ: IEEE.CrossRefGoogle Scholar
  13. 13.
    Bendechache, M., & Kechadi, M. T. (2015). Distributed clustering algorithm for spatial data mining. In Spatial Data Mining and Geographical Knowledge Services (ICSDM), 2015 2nd IEEE international conference on (pp. 60–65). Piscataway, NJ: IEEE.CrossRefGoogle Scholar
  14. 14.
    Parmar, K., Vaghela, D., & Sharma, P. (2015). Performance prediction of students using distributed data mining. In Innovations in Information, Embedded and Communication Systems (ICIIECS), 2015 international conference on (pp. 1–5). Piscataway, NJ: IEEE.Google Scholar
  15. 15.
    Lu, Y., & Zhang, Y. (2017). Privacy preserving feature selection on horizontally distributed datasets. In 2017 5th International Conference on Bioinformatics and Computational Biology (ICBCB 2017) (Accepted). Hong Kong, China: ACM.Google Scholar
  16. 16.
    Lu, Y., Phoungphol, P., & Zhang, Y. (2014). Privacy aware non-linear support vector machine for multi-source big data. In 2014 IEEE 13th international conference on trust, security and privacy in computing and communications (pp. 783–789). Piscataway, NJ: IEEE.CrossRefGoogle Scholar
  17. 17.
    Gavison, R., & Gavison, R. (1984). Privacy and the limits of law philosophical dimensions of privacy. New York: Cambridge University Press.Google Scholar
  18. 18.
    Pinkas, B. (2002). Cryptographic techniques for privacy-preserving data mining. ACM Sigkdd Explorations Newsletter, 4, 12–19.MathSciNetCrossRefGoogle Scholar
  19. 19.
    Yao, A. C.-C. (1986). How to generate and exchange secrets. In Foundations of Computer Science, 1986, 27th Annual Symposium on (pp. 162–167). Piscataway, NJ: IEEE.Google Scholar
  20. 20.
    Goldreich, O. (2004). Foundations of cryptography: Volume 2, basic applications. New York: Cambridge University Press.CrossRefGoogle Scholar
  21. 21.
    Paillier, P. (1999). Public-key cryptosystems based on composite degree residuosity classes. In Proceedings of the 17th international conference on theory and application of cryptographic techniques. Prague, Czech Republic: Springer.Google Scholar
  22. 22.
    Clifton, C., Kantarcioglu, M., Vaidya, J., Lin, X., & Zhu, M. Y. (2002). Tools for privacy preserving distributed data mining. ACM Sigkdd Explorations Newsletter, 4, 28–34.CrossRefGoogle Scholar
  23. 23.
    Drineas, P., & Mahoney, M. W. (2005). On the Nystrom method for approximating a gram matrix for improved kernel-based learning. Journal of Machine Learning Research, 6, 2153–2175.MathSciNetzbMATHGoogle Scholar
  24. 24.
    Zhang, K., Tsang, I. W., & Kwok, J. T. (2008). Improved Nystrom low rank approximation and error analysis. In Presented at the Proceedings of the 25th international conference on Machine learning. Helsinki, Finland: ACM.Google Scholar
  25. 25.
    Kumar, S., Mohri, M., & Talwalkar, A. (2012). Sampling methods for the Nyström method. Journal of Machine Learning Research, 13, 981–1006.MathSciNetzbMATHGoogle Scholar
  26. 26.
    Harbrecht, H., Peters, M., & Schneider, R. (2012). On the low-rank approximation by the pivoted Cholesky decomposition. Applied Numerical Mathematics, 62, 428–440.MathSciNetCrossRefGoogle Scholar
  27. 27.
    Zhang, K., Lan, L., Wang, Z., & Moerchen, F. (2012). Scaling up kernel SVM on limited resources: A low-rank linearization approach. International Conference on Artificial Intelligence and Statistics (AISTATS), 22, 1425–1434.Google Scholar
  28. 28.
    Franc, V., & Sonnenburg, S. (2009). Optimized cutting plane algorithm for large-scale risk minimization. Journal of Machine Learning Research, 10, 2157–2192.MathSciNetzbMATHGoogle Scholar
  29. 29.
    LIBSVM. (2016). LIBSVM data. Retrieved from
  30. 30.
    Bache, K., & Lichman, M. (2013). UCI machine learning repository. Retrieved from
  31. 31.
    Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., et al. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286, 531–537.CrossRefGoogle Scholar
  32. 32.
    Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., et al. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of the United States of America, 96, 6745–6750.CrossRefGoogle Scholar
  33. 33.
    Zhu, Z., Ong, Y. S., & Dash, M. (2007). Markov blanket-embedded genetic algorithm for gene selection. Pattern Recognition, 49, 3236–3248.CrossRefGoogle Scholar
  34. 34.
    Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A., et al. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403, 503–511.CrossRefGoogle Scholar
  35. 35.
    Ambroise, C., & McLachlan, G. J. (2002). Selection bias in gene extraction on the basis of microarray gene-expression data. Proceedings of the National Academy of Sciences of the United States of America, 99, 6562–6566.CrossRefGoogle Scholar
  36. 36.
    Amir Ben-Dor, L. B., Friedman, N., Nachman, I., Schummer, M., & Yakhini, Z. (2000, April). Tissue classification with gene expression profiles. Journal of Computational Biology, 7, 559–583.CrossRefGoogle Scholar
  37. 37.
    Ben-Dor, L. B. A., Friedman, N., Nachman, I., Schummer, M., & Yakhini, Z. (2007). Journal of Computational Biology, 7, 559–583.CrossRefGoogle Scholar
  38. 38.
    Furlanello, C., Serafini, M., Merler, S., & Jurman, G. (2003). Entropy-based gene ranking without selection bias for the predictive classification of microarray data. BMC Bioinformatics, 4, 54.CrossRefGoogle Scholar
  39. 39.
    Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2, 1–27.CrossRefGoogle Scholar
  40. 40.
    Maldonado, S., Weber, R., & Basak, J. (2011). Simultaneous feature selection and classification using kernel-penalized support vector machines. Information Sciences, 181, 115–128.CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Department of Computer ScienceGeorgia State UniversityAtlantaUSA

Personalised recommendations