Abstract
Feature selection plays a crucial step for data mining algorithms via eliminating the curse of dimensionality. Many feature selection approaches are developed for analyzing centralized data on the same location. In recent years, multi-source biomedical data mining methods have been developed to analyze different distributed databases at different locations such as different hospitals. However, a major concern is privacy of sensitive personal medical records in different hospitals. Therefore, as the needs for new privacy preserving distributed data mining algorithms increase, it is necessary to develop new privacy preserving feature selection algorithms for biomedical data mining. In this paper, a privacy preserving feature selection method named “Privacy Preserving Feature Selection algorithm via Voted Wrapper methods (PPFSVW)” is developed. This method was tested on six benchmark datasets under two testing scenarios. Our experimental results indicate that the proposed algorithm workflow can work effectively to improve the classification performance regarding accuracy via selecting informative features and genes. Besides, the proposed method can make the classifier achieve higher or same level classification accuracy with fewer features compared with those sophisticated methods, such as SVM-RFE, RSVM and SVM-t. More importantly, the individual private information can be protected during the whole feature selection process.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Agrawal, R., & Srikant, R. (2000). Privacy-preserving data mining. ACM Sigmod Record, 29, 439–450.
Bayardo, R. J., & Agrawal, R. (2005). Data privacy through optimal k-anonymization. In Data engineering, 2005. ICDE 2005. Proceedings 21st international conference on (pp. 217–228). Piscataway, NJ: IEEE.
Machanavajjhala, A., Kifer, D., Gehrke, J., & Venkitasubramaniam, M. (2007). l-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data, 1, 3.
Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.
Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46, 389–422.
Díaz-Uriarte, R., & Alvarez de Andrés, S. (2006). Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7, 3.
Zhang, X., Lu, X., Shi, Q., Xu, X.-Q., Hon-chiu, E. L., Harris, L. N., et al. (2006). Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data. BMC Bioinformatics, 7, 197.
Sharma, A., Imoto, S., & Miyano, S. (2012). A top-r feature selection algorithm for microarray gene expression data. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9, 754–764.
Chen-An Tsai, C.-H. H., Chang, C.-W., & Chen, C.-H. (2012). Recursive feature selection with significant variables of support vectors. Computational and Mathematical Methods in Medicine, 2012, 12.
Miranda, J., Montoya, R., & Weber, R. (2005). Linear penalization support vector Machines for Feature Selection. In S. K. Pal, S. Bandyopadhyay, & S. Biswas (Eds.), Proceedings of the pattern recognition and machine intelligence: First international conference, PReMI 2005, Kolkata, India, December 20–22, 2005 (pp. 188–192). Berlin: Springer.
Bradley, P. S., & Mangasarian, O. L. (1998). Feature selection via concave minimization and support vector machines. In Proceedings of the fifteenth international conference on machine learning. San Francisco, CA: M. Kaufmann Publishers.
Kholod, I., Kuprianov, M., & Petukhov, I. (2016). Distributed data mining based on actors for internet of things. In 2016 5th Mediterranean Conference on Embedded Computing (MECO) (pp. 480–484). Piscataway, NJ: IEEE.
Bendechache, M., & Kechadi, M. T. (2015). Distributed clustering algorithm for spatial data mining. In Spatial Data Mining and Geographical Knowledge Services (ICSDM), 2015 2nd IEEE international conference on (pp. 60–65). Piscataway, NJ: IEEE.
Parmar, K., Vaghela, D., & Sharma, P. (2015). Performance prediction of students using distributed data mining. In Innovations in Information, Embedded and Communication Systems (ICIIECS), 2015 international conference on (pp. 1–5). Piscataway, NJ: IEEE.
Lu, Y., & Zhang, Y. (2017). Privacy preserving feature selection on horizontally distributed datasets. In 2017 5th International Conference on Bioinformatics and Computational Biology (ICBCB 2017) (Accepted). Hong Kong, China: ACM.
Lu, Y., Phoungphol, P., & Zhang, Y. (2014). Privacy aware non-linear support vector machine for multi-source big data. In 2014 IEEE 13th international conference on trust, security and privacy in computing and communications (pp. 783–789). Piscataway, NJ: IEEE.
Gavison, R., & Gavison, R. (1984). Privacy and the limits of law philosophical dimensions of privacy. New York: Cambridge University Press.
Pinkas, B. (2002). Cryptographic techniques for privacy-preserving data mining. ACM Sigkdd Explorations Newsletter, 4, 12–19.
Yao, A. C.-C. (1986). How to generate and exchange secrets. In Foundations of Computer Science, 1986, 27th Annual Symposium on (pp. 162–167). Piscataway, NJ: IEEE.
Goldreich, O. (2004). Foundations of cryptography: Volume 2, basic applications. New York: Cambridge University Press.
Paillier, P. (1999). Public-key cryptosystems based on composite degree residuosity classes. In Proceedings of the 17th international conference on theory and application of cryptographic techniques. Prague, Czech Republic: Springer.
Clifton, C., Kantarcioglu, M., Vaidya, J., Lin, X., & Zhu, M. Y. (2002). Tools for privacy preserving distributed data mining. ACM Sigkdd Explorations Newsletter, 4, 28–34.
Drineas, P., & Mahoney, M. W. (2005). On the Nystrom method for approximating a gram matrix for improved kernel-based learning. Journal of Machine Learning Research, 6, 2153–2175.
Zhang, K., Tsang, I. W., & Kwok, J. T. (2008). Improved Nystrom low rank approximation and error analysis. In Presented at the Proceedings of the 25th international conference on Machine learning. Helsinki, Finland: ACM.
Kumar, S., Mohri, M., & Talwalkar, A. (2012). Sampling methods for the Nyström method. Journal of Machine Learning Research, 13, 981–1006.
Harbrecht, H., Peters, M., & Schneider, R. (2012). On the low-rank approximation by the pivoted Cholesky decomposition. Applied Numerical Mathematics, 62, 428–440.
Zhang, K., Lan, L., Wang, Z., & Moerchen, F. (2012). Scaling up kernel SVM on limited resources: A low-rank linearization approach. International Conference on Artificial Intelligence and Statistics (AISTATS), 22, 1425–1434.
Franc, V., & Sonnenburg, S. (2009). Optimized cutting plane algorithm for large-scale risk minimization. Journal of Machine Learning Research, 10, 2157–2192.
LIBSVM. (2016). LIBSVM data. Retrieved from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
Bache, K., & Lichman, M. (2013). UCI machine learning repository. Retrieved from http://archive.ics.uci.edu/ml
Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., et al. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286, 531–537.
Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., et al. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of the United States of America, 96, 6745–6750.
Zhu, Z., Ong, Y. S., & Dash, M. (2007). Markov blanket-embedded genetic algorithm for gene selection. Pattern Recognition, 49, 3236–3248.
Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A., et al. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403, 503–511.
Ambroise, C., & McLachlan, G. J. (2002). Selection bias in gene extraction on the basis of microarray gene-expression data. Proceedings of the National Academy of Sciences of the United States of America, 99, 6562–6566.
Amir Ben-Dor, L. B., Friedman, N., Nachman, I., Schummer, M., & Yakhini, Z. (2000, April). Tissue classification with gene expression profiles. Journal of Computational Biology, 7, 559–583.
Ben-Dor, L. B. A., Friedman, N., Nachman, I., Schummer, M., & Yakhini, Z. (2007). Journal of Computational Biology, 7, 559–583.
Furlanello, C., Serafini, M., Merler, S., & Jurman, G. (2003). Entropy-based gene ranking without selection bias for the predictive classification of microarray data. BMC Bioinformatics, 4, 54.
Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2, 1–27.
Maldonado, S., Weber, R., & Basak, J. (2011). Simultaneous feature selection and classification using kernel-penalized support vector machines. Information Sciences, 181, 115–128.
Acknowledgements
This work is part of the Ph.D. dissertation of Yunmei Lu, who would like to express her great gratitude to all of her committee members, Prof. Yanqing Zhang, Prof. Yi Pan, Prof. Rajshekhar Sunderraman and Prof. Yichuan Zhao, for their guidance and support. This work would have not been possible without their guidance and support. The authors also would like to thank the reviewers of this paper for their constructive comments and suggestions. Yunmei Lu is grateful to the continued financial support from the Department of Computer Science and the Molecular Basis of Disease (MBD) fellowship at GSU.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Lu, Y., Zhang, Y. (2020). Privacy Preserving Feature Selection via Voted Wrapper Method for Horizontally Distributed Medical Data. In: Zhao, Y., Chen, DG. (eds) Statistical Modeling in Biomedical Research. Emerging Topics in Statistics and Biostatistics . Springer, Cham. https://doi.org/10.1007/978-3-030-33416-1_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-33416-1_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-33415-4
Online ISBN: 978-3-030-33416-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)