Generalized feature similarity measure

Abstract

Quantifying the degree of relation between a feature and target class is one of the key aspects of machine learning. In this regard, information gain (IG) and χ2 are two of the most widely used measures in feature evaluation. In this paper, we discuss a novel approach to unifying these and other existing feature evaluation measures under a common framework. In particular, we introduce a new generalized family of measures to estimate the similarity between features. We show that the proposed set of measures satisfies all the general criteria for quantifying the relationship between features. We demonstrate that IG and χ2 are special cases of the generalized measure. We also analyze some of the topological and set-theoretic aspects of the family of functions that satisfy the criteria of our generalized measure. Finally, we produce novel feature evaluation measures using our approach and analyze their performance through numerical experiments. We show that a diverse array of measures can be created under our framework which can be used in applications such fusion based feature selection.

This is a preview of subscription content, access via your institution.

References

  1. 1.

    Bennasar, M., Hicks, Y., Setchi, R.: Feature selection using joint mutual information maximisation. Expert Syst. Appl. 42(22), 8520–8532 (2015)

    Article  Google Scholar 

  2. 2.

    Bolón-Canedo, V., Sánchez-Marono, N., Alonso-Betanzos, A., Benítez, J.M., Herrera, F.: A review of microarray datasets and applied feature selection methods. Inform. Sci. 282, 111–135 (2014)

    Article  Google Scholar 

  3. 3.

    Brown, G., Pocock, A., Zhao, M.J., Luján, M: Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. J. Mach. Learn. Res. 13(Jan), 27–66 (2012)

    MathSciNet  MATH  Google Scholar 

  4. 4.

    Boutell, M.R., Luo, J., Shen, X., Brown, C.M.: Learning multi-label scene classification. Pattern Recogn. 37(9), 1757–1771 (2004)

    Article  Google Scholar 

  5. 5.

    Cressie, N., Read, T.R.: Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society. Series B (Methodological), vol. 46, no.3 440–464 (1984)

  6. 6.

    Dash, M., Liu, H.: Consistency-based search in feature selection. Artif. Intell. 151(1-2), 155–176 (2003)

    MathSciNet  Article  Google Scholar 

  7. 7.

    Estévez, P.A., Tesmer, M., Perez, C.A., Zurada, J.M.: Normalized mutual information feature selection. IEEE Trans. Neural Netw. 20(2), 189–201 (2009)

    Article  Google Scholar 

  8. 8.

    Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)

    MATH  Google Scholar 

  9. 9.

    Guyon, I.: Design of experiments for the NIPS 2003 variable selection benchmark (2003)

  10. 10.

    Hofmann, D.: German Credit Data Set, UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. University of California, School of Information and Computer Science, Irvine (1994)

  11. 11.

    Hopkins, M., et al.: SPAM Email Database, UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. University of California, School of Information and Computer Science, Irvine (1998)

  12. 12.

    Kamalov, F., Thabtah, F.: A feature selection method based on ranked vector scores of features for classification. Ann. Data Sci. 4(4), 483–502 (2017)

    Article  Google Scholar 

  13. 13.

    Kamalov, F., Leung, H.H., Moussa, S.: Monotonicity of the χ2-statistic and feature selection. Annals of Data Science, 1–19 (2020)

  14. 14.

    Liu, H., Motoda, H.: Feature Selection for Knowledge Discovery and Data Mining, vol. 454. Springer Science & Business Media, Berlin (2012)

    Google Scholar 

  15. 15.

    Maji, P., Pal, S.K.: Feature selection using f-information measures in fuzzy approximation spaces. IEEE Trans. Knowl. Data Eng. 22(6), 854–867 (2010)

    Article  Google Scholar 

  16. 16.

    Mohammad, R., McCluskey, L., Thabtah, F.: Phishing Websites Data Set, UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. University of California School of Information and Computer Science, Irvine (2012)

  17. 17.

    Nie, F., Huang, H., Cai, X., Ding, C.H.: Efficient and robust feature selection via joint l2, 1-norms minimization. In: Advances in neural information processing systems, pp. 1813–1821 (2010)

  18. 18.

    Ogura, H., Amano, H., Kondo, M.: Feature selection with a measure of deviations from Poisson in text categorization. Expert Syst. Appl. 36(3), 6826–6832 (2009)

    Article  Google Scholar 

  19. 19.

    Rijn, J.: BNG(kr-vs-kp) Data Set, OpenML Repository [http://www.openml.org] (2014)

  20. 20.

    Tan, M., Schlimmer, J.: Breast Cancer Data Set, UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. University of California School of Information and Computer Science, Irvine (1988)

  21. 21.

    Thabtah, F., Kamalov, F., Rajab, K.: A new computational intelligence approach to detect autistic features for autism screening. Int. J. Med. Inf. 117, 112–124 (2018)

    Article  Google Scholar 

  22. 22.

    Vergara, J.R., Estévez, P.A.: A review of feature selection methods based on mutual information. Neural Comput. Appl. 24(1), 175–186 (2014)

    Article  Google Scholar 

  23. 23.

    Yang, F., Mao, K.Z.: Robust feature selection for microarray data based on multicriterion fusion. IEEE/ACM Trans. Comput. Biol. Bioinform. 8(4), 1080–1092 (2011)

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Firuz Kamalov.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kamalov, F. Generalized feature similarity measure. Ann Math Artif Intell 88, 987–1002 (2020). https://doi.org/10.1007/s10472-020-09700-8

Download citation

Keywords

  • Feature evaluation measures
  • Feature selection
  • Information gain
  • χ 2
  • Unified framework

Mathematics Subject Classification (2010)

  • 94A17
  • 68T01