Skip to main content

Privacy Preserving Feature Selection via Voted Wrapper Method for Horizontally Distributed Medical Data

  • Chapter
  • First Online:
Book cover Statistical Modeling in Biomedical Research

Part of the book series: Emerging Topics in Statistics and Biostatistics ((ETSB))

  • 1056 Accesses

Abstract

Feature selection plays a crucial step for data mining algorithms via eliminating the curse of dimensionality. Many feature selection approaches are developed for analyzing centralized data on the same location. In recent years, multi-source biomedical data mining methods have been developed to analyze different distributed databases at different locations such as different hospitals. However, a major concern is privacy of sensitive personal medical records in different hospitals. Therefore, as the needs for new privacy preserving distributed data mining algorithms increase, it is necessary to develop new privacy preserving feature selection algorithms for biomedical data mining. In this paper, a privacy preserving feature selection method named “Privacy Preserving Feature Selection algorithm via Voted Wrapper methods (PPFSVW)” is developed. This method was tested on six benchmark datasets under two testing scenarios. Our experimental results indicate that the proposed algorithm workflow can work effectively to improve the classification performance regarding accuracy via selecting informative features and genes. Besides, the proposed method can make the classifier achieve higher or same level classification accuracy with fewer features compared with those sophisticated methods, such as SVM-RFE, RSVM and SVM-t. More importantly, the individual private information can be protected during the whole feature selection process.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 74.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Agrawal, R., & Srikant, R. (2000). Privacy-preserving data mining. ACM Sigmod Record, 29, 439–450.

    Article  Google Scholar 

  2. Bayardo, R. J., & Agrawal, R. (2005). Data privacy through optimal k-anonymization. In Data engineering, 2005. ICDE 2005. Proceedings 21st international conference on (pp. 217–228). Piscataway, NJ: IEEE.

    Google Scholar 

  3. Machanavajjhala, A., Kifer, D., Gehrke, J., & Venkitasubramaniam, M. (2007). l-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data, 1, 3.

    Article  Google Scholar 

  4. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.

    MATH  Google Scholar 

  5. Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46, 389–422.

    Article  Google Scholar 

  6. Díaz-Uriarte, R., & Alvarez de Andrés, S. (2006). Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7, 3.

    Article  Google Scholar 

  7. Zhang, X., Lu, X., Shi, Q., Xu, X.-Q., Hon-chiu, E. L., Harris, L. N., et al. (2006). Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data. BMC Bioinformatics, 7, 197.

    Article  Google Scholar 

  8. Sharma, A., Imoto, S., & Miyano, S. (2012). A top-r feature selection algorithm for microarray gene expression data. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9, 754–764.

    Article  Google Scholar 

  9. Chen-An Tsai, C.-H. H., Chang, C.-W., & Chen, C.-H. (2012). Recursive feature selection with significant variables of support vectors. Computational and Mathematical Methods in Medicine, 2012, 12.

    MathSciNet  MATH  Google Scholar 

  10. Miranda, J., Montoya, R., & Weber, R. (2005). Linear penalization support vector Machines for Feature Selection. In S. K. Pal, S. Bandyopadhyay, & S. Biswas (Eds.), Proceedings of the pattern recognition and machine intelligence: First international conference, PReMI 2005, Kolkata, India, December 20–22, 2005 (pp. 188–192). Berlin: Springer.

    Google Scholar 

  11. Bradley, P. S., & Mangasarian, O. L. (1998). Feature selection via concave minimization and support vector machines. In Proceedings of the fifteenth international conference on machine learning. San Francisco, CA: M. Kaufmann Publishers.

    Google Scholar 

  12. Kholod, I., Kuprianov, M., & Petukhov, I. (2016). Distributed data mining based on actors for internet of things. In 2016 5th Mediterranean Conference on Embedded Computing (MECO) (pp. 480–484). Piscataway, NJ: IEEE.

    Chapter  Google Scholar 

  13. Bendechache, M., & Kechadi, M. T. (2015). Distributed clustering algorithm for spatial data mining. In Spatial Data Mining and Geographical Knowledge Services (ICSDM), 2015 2nd IEEE international conference on (pp. 60–65). Piscataway, NJ: IEEE.

    Chapter  Google Scholar 

  14. Parmar, K., Vaghela, D., & Sharma, P. (2015). Performance prediction of students using distributed data mining. In Innovations in Information, Embedded and Communication Systems (ICIIECS), 2015 international conference on (pp. 1–5). Piscataway, NJ: IEEE.

    Google Scholar 

  15. Lu, Y., & Zhang, Y. (2017). Privacy preserving feature selection on horizontally distributed datasets. In 2017 5th International Conference on Bioinformatics and Computational Biology (ICBCB 2017) (Accepted). Hong Kong, China: ACM.

    Google Scholar 

  16. Lu, Y., Phoungphol, P., & Zhang, Y. (2014). Privacy aware non-linear support vector machine for multi-source big data. In 2014 IEEE 13th international conference on trust, security and privacy in computing and communications (pp. 783–789). Piscataway, NJ: IEEE.

    Chapter  Google Scholar 

  17. Gavison, R., & Gavison, R. (1984). Privacy and the limits of law philosophical dimensions of privacy. New York: Cambridge University Press.

    Google Scholar 

  18. Pinkas, B. (2002). Cryptographic techniques for privacy-preserving data mining. ACM Sigkdd Explorations Newsletter, 4, 12–19.

    Article  MathSciNet  Google Scholar 

  19. Yao, A. C.-C. (1986). How to generate and exchange secrets. In Foundations of Computer Science, 1986, 27th Annual Symposium on (pp. 162–167). Piscataway, NJ: IEEE.

    Google Scholar 

  20. Goldreich, O. (2004). Foundations of cryptography: Volume 2, basic applications. New York: Cambridge University Press.

    Book  Google Scholar 

  21. Paillier, P. (1999). Public-key cryptosystems based on composite degree residuosity classes. In Proceedings of the 17th international conference on theory and application of cryptographic techniques. Prague, Czech Republic: Springer.

    Google Scholar 

  22. Clifton, C., Kantarcioglu, M., Vaidya, J., Lin, X., & Zhu, M. Y. (2002). Tools for privacy preserving distributed data mining. ACM Sigkdd Explorations Newsletter, 4, 28–34.

    Article  Google Scholar 

  23. Drineas, P., & Mahoney, M. W. (2005). On the Nystrom method for approximating a gram matrix for improved kernel-based learning. Journal of Machine Learning Research, 6, 2153–2175.

    MathSciNet  MATH  Google Scholar 

  24. Zhang, K., Tsang, I. W., & Kwok, J. T. (2008). Improved Nystrom low rank approximation and error analysis. In Presented at the Proceedings of the 25th international conference on Machine learning. Helsinki, Finland: ACM.

    Google Scholar 

  25. Kumar, S., Mohri, M., & Talwalkar, A. (2012). Sampling methods for the Nyström method. Journal of Machine Learning Research, 13, 981–1006.

    MathSciNet  MATH  Google Scholar 

  26. Harbrecht, H., Peters, M., & Schneider, R. (2012). On the low-rank approximation by the pivoted Cholesky decomposition. Applied Numerical Mathematics, 62, 428–440.

    Article  MathSciNet  Google Scholar 

  27. Zhang, K., Lan, L., Wang, Z., & Moerchen, F. (2012). Scaling up kernel SVM on limited resources: A low-rank linearization approach. International Conference on Artificial Intelligence and Statistics (AISTATS), 22, 1425–1434.

    Google Scholar 

  28. Franc, V., & Sonnenburg, S. (2009). Optimized cutting plane algorithm for large-scale risk minimization. Journal of Machine Learning Research, 10, 2157–2192.

    MathSciNet  MATH  Google Scholar 

  29. LIBSVM. (2016). LIBSVM data. Retrieved from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

  30. Bache, K., & Lichman, M. (2013). UCI machine learning repository. Retrieved from http://archive.ics.uci.edu/ml

  31. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., et al. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286, 531–537.

    Article  Google Scholar 

  32. Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., et al. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of the United States of America, 96, 6745–6750.

    Article  Google Scholar 

  33. Zhu, Z., Ong, Y. S., & Dash, M. (2007). Markov blanket-embedded genetic algorithm for gene selection. Pattern Recognition, 49, 3236–3248.

    Article  Google Scholar 

  34. Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A., et al. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403, 503–511.

    Article  Google Scholar 

  35. Ambroise, C., & McLachlan, G. J. (2002). Selection bias in gene extraction on the basis of microarray gene-expression data. Proceedings of the National Academy of Sciences of the United States of America, 99, 6562–6566.

    Article  Google Scholar 

  36. Amir Ben-Dor, L. B., Friedman, N., Nachman, I., Schummer, M., & Yakhini, Z. (2000, April). Tissue classification with gene expression profiles. Journal of Computational Biology, 7, 559–583.

    Article  Google Scholar 

  37. Ben-Dor, L. B. A., Friedman, N., Nachman, I., Schummer, M., & Yakhini, Z. (2007). Journal of Computational Biology, 7, 559–583.

    Article  Google Scholar 

  38. Furlanello, C., Serafini, M., Merler, S., & Jurman, G. (2003). Entropy-based gene ranking without selection bias for the predictive classification of microarray data. BMC Bioinformatics, 4, 54.

    Article  Google Scholar 

  39. Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2, 1–27.

    Article  Google Scholar 

  40. Maldonado, S., Weber, R., & Basak, J. (2011). Simultaneous feature selection and classification using kernel-penalized support vector machines. Information Sciences, 181, 115–128.

    Article  Google Scholar 

Download references

Acknowledgements

This work is part of the Ph.D. dissertation of Yunmei Lu, who would like to express her great gratitude to all of her committee members, Prof. Yanqing Zhang, Prof. Yi Pan, Prof. Rajshekhar Sunderraman and Prof. Yichuan Zhao, for their guidance and support. This work would have not been possible without their guidance and support. The authors also would like to thank the reviewers of this paper for their constructive comments and suggestions. Yunmei Lu is grateful to the continued financial support from the Department of Computer Science and the Molecular Basis of Disease (MBD) fellowship at GSU.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yanqing Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Lu, Y., Zhang, Y. (2020). Privacy Preserving Feature Selection via Voted Wrapper Method for Horizontally Distributed Medical Data. In: Zhao, Y., Chen, DG. (eds) Statistical Modeling in Biomedical Research. Emerging Topics in Statistics and Biostatistics . Springer, Cham. https://doi.org/10.1007/978-3-030-33416-1_8

Download citation

Publish with us

Policies and ethics