Big Data Feature Selection to Achieve Anonymization

  • U. Selvi
  • S. Pushpa
Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 637)


In the age of big data, data is increasing in a tremendous way in many fields and the data shared by the users is in a great risk. To preserve privacy of an individual anonymization-based algorithm like k-anonymity-related algorithm and differential privacy is proposed to make sure that the resulting dataset is free from privacy disclosure. However, majority of these anonymization algorithms are applied in isolated environment, without considering the utility in knowledge task making the dataset less informative. Also the presence of redundant data also decreases the performance and reduces accuracy of anonymization. Hence a preprocessing-based anonymization is required to increase the utility and to achieve accuracy in anonymization. This paper aims to apply the feature selection fast correlation-based filter (FCBF) solution to select the relevant features and remove the redundant data. Then k-anonymity is applied to dataset to achieve data anonymization. Comparisons on real-world dataset were made with anonymized dataset with preprocessing and without preprocessing and result was produced.


Big data Anonymization Data mining Fast correlation-based feature selection Data preprocessing Privacy preservation MapReduce 


Compliance with Ethical Standards

All author states that there is no conflict of interest. We used our own data. Humans and animals are not involved in this research work.


  1. 1.
    Kambatla, K., et al.: Trends in Big Data Analytics. Elsevier (2014)Google Scholar
  2. 2.
    Chen, M., Lin, M.: Big Data: A Survey, vol. 19, pp. 171–209. Springer (2014)Google Scholar
  3. 3.
    Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncertainty Fuzziness Knowl.-Based Syst. 10(5), 557–570 (2002)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Chen, P., et al.: Data-Intensive Applications, Challenges, Techniques and Technologies: A Survey on Big Data. Information Sciences (2014)Google Scholar
  5. 5.
    Zhang, B., et al.: Feature selection for classification under anonymity constraint. Trans. Data Priv. 10, 1–25 (2017)Google Scholar
  6. 6.
    Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artif. Intell. 97(1–2), 273–324 (1997)CrossRefGoogle Scholar
  7. 7.
    Hall, M.A.: Correlation-Based Feature Selection for Machine Learning. Waikato University, Department of Computer Science (1999)Google Scholar
  8. 8.
    Chandrashekar, G., Sahin, F.: A survey on feature selection methods. Comput. Electr. Eng. 40(1), 16–28 (2014)CrossRefGoogle Scholar
  9. 9.
    Garcia, S., et al.: Big data preprocessing: methods and prospects. Big Data Analytics (2019)Google Scholar
  10. 10.
    Li, J., et al.: Feature selection: a data perspective. ACM Comput. Surv. 50(6), Article 94, Publication date (2017)CrossRefGoogle Scholar
  11. 11.
    Wald, R., et al.: Comparison of stability for different families of filter-based and wrapper-based feature selection. In: 12th International Conference on Machine and Application (2013)Google Scholar
  12. 12.
    Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)zbMATHGoogle Scholar
  13. 13.
    Saeys, Y., Inza, I., Larra˜naga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), 2507–2517 (2007)CrossRefGoogle Scholar
  14. 14.
    De, S., et al.: Bayes Wipe: a scalable probabilistics framework for cleaning big data. J. Data Inf. Q. ACM (2016)Google Scholar
  15. 15.
    Peralta, D., et al.: Evolutionary Feature Selection for Big Data Classification: A Map Reduce Approach. Hindawi (2015)Google Scholar
  16. 16.
    Zhou, B., Pei, J.: The k-Anonymity and l-Diversity Approaches for Privacy Preservation in Social Networks Against Neighborhood Attacks. Springer (2010)Google Scholar
  17. 17.
    Yu, L., et al.: Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Proceeding of the Twentieth International Conference on Machine Learning (2003)Google Scholar
  18. 18.
    Zhao, Z., et al.: Graph regularized feature selection with data reconstruction. IEEE Trans. Knowl. Data Eng., 28(3) (2016)CrossRefGoogle Scholar
  19. 19.
    Raul-Jose, et al.: Distributed Correlation-Based Feature Selection in Spark. Information Science (2018) Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2020

Authors and Affiliations

  • U. Selvi
    • 1
  • S. Pushpa
    • 1
  1. 1.St. Peter’s Institute of Higher Education and ResearchChennaiIndia

Personalised recommendations