Privacy Preserving Feature Selection via Voted Wrapper Method for Horizontally Distributed Medical Data

Lu, Yunmei; Zhang, Yanqing

doi:10.1007/978-3-030-33416-1_8

Yunmei Lu⁸ &
Yanqing Zhang⁸

Part of the book series: Emerging Topics in Statistics and Biostatistics ((ETSB))

1056 Accesses

Abstract

Feature selection plays a crucial step for data mining algorithms via eliminating the curse of dimensionality. Many feature selection approaches are developed for analyzing centralized data on the same location. In recent years, multi-source biomedical data mining methods have been developed to analyze different distributed databases at different locations such as different hospitals. However, a major concern is privacy of sensitive personal medical records in different hospitals. Therefore, as the needs for new privacy preserving distributed data mining algorithms increase, it is necessary to develop new privacy preserving feature selection algorithms for biomedical data mining. In this paper, a privacy preserving feature selection method named “Privacy Preserving Feature Selection algorithm via Voted Wrapper methods (PPFSVW)” is developed. This method was tested on six benchmark datasets under two testing scenarios. Our experimental results indicate that the proposed algorithm workflow can work effectively to improve the classification performance regarding accuracy via selecting informative features and genes. Besides, the proposed method can make the classifier achieve higher or same level classification accuracy with fewer features compared with those sophisticated methods, such as SVM-RFE, RSVM and SVM-t. More importantly, the individual private information can be protected during the whole feature selection process.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 74.99; Price excludes VAT (USA)

Softcover Book: USD 99.00; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Agrawal, R., & Srikant, R. (2000). Privacy-preserving data mining. ACM Sigmod Record, 29, 439–450.
Article Google Scholar
Bayardo, R. J., & Agrawal, R. (2005). Data privacy through optimal k-anonymization. In Data engineering, 2005. ICDE 2005. Proceedings 21st international conference on (pp. 217–228). Piscataway, NJ: IEEE.
Google Scholar
Machanavajjhala, A., Kifer, D., Gehrke, J., & Venkitasubramaniam, M. (2007). l-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data, 1, 3.
Article Google Scholar
Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.
MATH Google Scholar
Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46, 389–422.
Article Google Scholar
Díaz-Uriarte, R., & Alvarez de Andrés, S. (2006). Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7, 3.
Article Google Scholar
Zhang, X., Lu, X., Shi, Q., Xu, X.-Q., Hon-chiu, E. L., Harris, L. N., et al. (2006). Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data. BMC Bioinformatics, 7, 197.
Article Google Scholar
Sharma, A., Imoto, S., & Miyano, S. (2012). A top-r feature selection algorithm for microarray gene expression data. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9, 754–764.
Article Google Scholar
Chen-An Tsai, C.-H. H., Chang, C.-W., & Chen, C.-H. (2012). Recursive feature selection with significant variables of support vectors. Computational and Mathematical Methods in Medicine, 2012, 12.
MathSciNet MATH Google Scholar
Miranda, J., Montoya, R., & Weber, R. (2005). Linear penalization support vector Machines for Feature Selection. In S. K. Pal, S. Bandyopadhyay, & S. Biswas (Eds.), Proceedings of the pattern recognition and machine intelligence: First international conference, PReMI 2005, Kolkata, India, December 20–22, 2005 (pp. 188–192). Berlin: Springer.
Google Scholar
Bradley, P. S., & Mangasarian, O. L. (1998). Feature selection via concave minimization and support vector machines. In Proceedings of the fifteenth international conference on machine learning. San Francisco, CA: M. Kaufmann Publishers.
Google Scholar
Kholod, I., Kuprianov, M., & Petukhov, I. (2016). Distributed data mining based on actors for internet of things. In 2016 5th Mediterranean Conference on Embedded Computing (MECO) (pp. 480–484). Piscataway, NJ: IEEE.
Chapter Google Scholar
Bendechache, M., & Kechadi, M. T. (2015). Distributed clustering algorithm for spatial data mining. In Spatial Data Mining and Geographical Knowledge Services (ICSDM), 2015 2nd IEEE international conference on (pp. 60–65). Piscataway, NJ: IEEE.
Chapter Google Scholar
Parmar, K., Vaghela, D., & Sharma, P. (2015). Performance prediction of students using distributed data mining. In Innovations in Information, Embedded and Communication Systems (ICIIECS), 2015 international conference on (pp. 1–5). Piscataway, NJ: IEEE.
Google Scholar
Lu, Y., & Zhang, Y. (2017). Privacy preserving feature selection on horizontally distributed datasets. In 2017 5th International Conference on Bioinformatics and Computational Biology (ICBCB 2017) (Accepted). Hong Kong, China: ACM.
Google Scholar
Lu, Y., Phoungphol, P., & Zhang, Y. (2014). Privacy aware non-linear support vector machine for multi-source big data. In 2014 IEEE 13th international conference on trust, security and privacy in computing and communications (pp. 783–789). Piscataway, NJ: IEEE.
Chapter Google Scholar
Gavison, R., & Gavison, R. (1984). Privacy and the limits of law philosophical dimensions of privacy. New York: Cambridge University Press.
Google Scholar
Pinkas, B. (2002). Cryptographic techniques for privacy-preserving data mining. ACM Sigkdd Explorations Newsletter, 4, 12–19.
Article MathSciNet Google Scholar
Yao, A. C.-C. (1986). How to generate and exchange secrets. In Foundations of Computer Science, 1986, 27th Annual Symposium on (pp. 162–167). Piscataway, NJ: IEEE.
Google Scholar
Goldreich, O. (2004). Foundations of cryptography: Volume 2, basic applications. New York: Cambridge University Press.
Book Google Scholar
Paillier, P. (1999). Public-key cryptosystems based on composite degree residuosity classes. In Proceedings of the 17th international conference on theory and application of cryptographic techniques. Prague, Czech Republic: Springer.
Google Scholar
Clifton, C., Kantarcioglu, M., Vaidya, J., Lin, X., & Zhu, M. Y. (2002). Tools for privacy preserving distributed data mining. ACM Sigkdd Explorations Newsletter, 4, 28–34.
Article Google Scholar
Drineas, P., & Mahoney, M. W. (2005). On the Nystrom method for approximating a gram matrix for improved kernel-based learning. Journal of Machine Learning Research, 6, 2153–2175.
MathSciNet MATH Google Scholar
Zhang, K., Tsang, I. W., & Kwok, J. T. (2008). Improved Nystrom low rank approximation and error analysis. In Presented at the Proceedings of the 25th international conference on Machine learning. Helsinki, Finland: ACM.
Google Scholar
Kumar, S., Mohri, M., & Talwalkar, A. (2012). Sampling methods for the Nyström method. Journal of Machine Learning Research, 13, 981–1006.
MathSciNet MATH Google Scholar
Harbrecht, H., Peters, M., & Schneider, R. (2012). On the low-rank approximation by the pivoted Cholesky decomposition. Applied Numerical Mathematics, 62, 428–440.
Article MathSciNet Google Scholar
Zhang, K., Lan, L., Wang, Z., & Moerchen, F. (2012). Scaling up kernel SVM on limited resources: A low-rank linearization approach. International Conference on Artificial Intelligence and Statistics (AISTATS), 22, 1425–1434.
Google Scholar
Franc, V., & Sonnenburg, S. (2009). Optimized cutting plane algorithm for large-scale risk minimization. Journal of Machine Learning Research, 10, 2157–2192.
MathSciNet MATH Google Scholar
LIBSVM. (2016). LIBSVM data. Retrieved from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
Bache, K., & Lichman, M. (2013). UCI machine learning repository. Retrieved from http://archive.ics.uci.edu/ml
Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., et al. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286, 531–537.
Article Google Scholar
Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., et al. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of the United States of America, 96, 6745–6750.
Article Google Scholar
Zhu, Z., Ong, Y. S., & Dash, M. (2007). Markov blanket-embedded genetic algorithm for gene selection. Pattern Recognition, 49, 3236–3248.
Article Google Scholar
Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A., et al. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403, 503–511.
Article Google Scholar
Ambroise, C., & McLachlan, G. J. (2002). Selection bias in gene extraction on the basis of microarray gene-expression data. Proceedings of the National Academy of Sciences of the United States of America, 99, 6562–6566.
Article Google Scholar
Amir Ben-Dor, L. B., Friedman, N., Nachman, I., Schummer, M., & Yakhini, Z. (2000, April). Tissue classification with gene expression profiles. Journal of Computational Biology, 7, 559–583.
Article Google Scholar
Ben-Dor, L. B. A., Friedman, N., Nachman, I., Schummer, M., & Yakhini, Z. (2007). Journal of Computational Biology, 7, 559–583.
Article Google Scholar
Furlanello, C., Serafini, M., Merler, S., & Jurman, G. (2003). Entropy-based gene ranking without selection bias for the predictive classification of microarray data. BMC Bioinformatics, 4, 54.
Article Google Scholar
Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2, 1–27.
Article Google Scholar
Maldonado, S., Weber, R., & Basak, J. (2011). Simultaneous feature selection and classification using kernel-penalized support vector machines. Information Sciences, 181, 115–128.
Article Google Scholar

Download references

Acknowledgements

This work is part of the Ph.D. dissertation of Yunmei Lu, who would like to express her great gratitude to all of her committee members, Prof. Yanqing Zhang, Prof. Yi Pan, Prof. Rajshekhar Sunderraman and Prof. Yichuan Zhao, for their guidance and support. This work would have not been possible without their guidance and support. The authors also would like to thank the reviewers of this paper for their constructive comments and suggestions. Yunmei Lu is grateful to the continued financial support from the Department of Computer Science and the Molecular Basis of Disease (MBD) fellowship at GSU.

Author information

Authors and Affiliations

Department of Computer Science, Georgia State University, Atlanta, GA, USA
Yunmei Lu & Yanqing Zhang

Authors

Yunmei Lu
View author publications
You can also search for this author in PubMed Google Scholar
Yanqing Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanqing Zhang .

Editor information

Editors and Affiliations

Math and Statistics, 1342, Georgia State University, Atlanta, GA, USA
Yichuan Zhao
School of Social Work, University of North Carolina, Chapel Hill, NC, USA
Ding-Geng (Din) Chen

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Lu, Y., Zhang, Y. (2020). Privacy Preserving Feature Selection via Voted Wrapper Method for Horizontally Distributed Medical Data. In: Zhao, Y., Chen, DG. (eds) Statistical Modeling in Biomedical Research. Emerging Topics in Statistics and Biostatistics . Springer, Cham. https://doi.org/10.1007/978-3-030-33416-1_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-33416-1_8
Published: 20 March 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-33415-4
Online ISBN: 978-3-030-33416-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics