A Distributed Multi-source Feature Selection Using Spark

  • Bochra ZaghdoudiEmail author
  • Waad BouaguelEmail author
  • Nadia EssoussiEmail author
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 921)


Feature selection is one of the key problems in data pre-processing because it brings the immediate effects on the data mining algorithm. Using high-dimensional data sets, we can describe the data based on multiple sources, which corresponding to different knowledge sources. Multi-source feature selection is another topic relevant with large-scale data. Learning and selecting features from multiple data sources is becoming more common and much needed in many real-world applications. In this work, we propose a new multisource feature selection method based on traditional filters where data sources contain the same set of instances but different sets of features. This method is implemented using Spark as a powerful parallel framework for large-scale data processing. Conducted experiments approve the effectiveness of our approach in terms of execution time and where the classification accuracy is maintained.


Feature selection Multi-source features Distributed Large data Spark 


  1. 1.
    Sutha, K.: A review of feature selection algorithms for data mining techniques. Int. J. Comput. Sci. Eng. 7(6), 63 (2015)Google Scholar
  2. 2.
    Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R.P., Tang, J., Liu, H.: Feature selection: a data perspective. ACM Comput. Surv. 50(6), 94:1–94:45 (2017)CrossRefGoogle Scholar
  3. 3.
    Ramirez-Gallego, S., Mouriño-Talín, H., Martinez-Rego, D., Bolón-Canedo, V., Manuel Benitez, J. M., Alonso-Betanzos, A., Herrera, F.: An information theoretic feature selection framework for big data under apache spark. CoRR, abs/1610.04154 (2016)Google Scholar
  4. 4.
    Liu, H., Setiono, R.: Chi2: Feature selection and discretization of numeric attributes, pp. 388–391 (1995)Google Scholar
  5. 5.
    Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1226–1238 (2005)CrossRefGoogle Scholar
  6. 6.
    Zhao, Z., Liu, H.: Multi-source feature selection via geometry-dependent covariance analysis. In: Proceedings of the 2008 international conference on new challenges for feature selection in data mining and knowledge discovery, vol, 4, pp. 36–47 (2008)Google Scholar
  7. 7.
    Hall, M.A.: Correlation-based feature selection for machine learning, Technical Report (1999)Google Scholar
  8. 8.
    Zhao, Z.A., Liu, H.: Spectral feature selection for data mining. Chapman, Hall/CRC (2011)CrossRefGoogle Scholar
  9. 9.
    Guo, W., Xiong, N., Vasilakos, A.V., Chen, G., Cheng, H.: Multi-source temporal data aggregation in wireless sensor networks. Wirel. Pers. Commun. 56(3), 359–370 (2011)CrossRefGoogle Scholar
  10. 10.
    Yu, L., Liu, H.: Efficient feature selection via analysis of relevance and redundancy. J. Mach. Learn. Res. 5, 1205–1224 (2004)MathSciNetzbMATHGoogle Scholar
  11. 11.
    Palma-Mendoza, R.-J., Rodriguez, D., de Marcos, L.: Distributed ReliefF-based feature selection in spark (2018)Google Scholar
  12. 12.
    Kira, K., Rendell, L.A.: The feature selection problem: traditional methods and a new algorithm, 2, 129–134 (1992)Google Scholar
  13. 13.
    Kononenko, I.: Estimating attributes: analysis and extensions of RELIEF, pp. 171–182 (1994)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.LARODEC, ISGUniversity of TunisTunisTunisia
  2. 2.College of BusinessUniversity of JeddahJeddahSaudi Arabia

Personalised recommendations