Skip to main content

Cost-Sensitive Feature Selection for Class Imbalance Problem

  • Conference paper
  • First Online:

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 655))

Abstract

The class imbalance problem is encountered in real-world applications of machine learning and results in suboptimal performance during data classification. This is especially true when data is not only imbalanced but also high dimensional. The class imbalance is very often accompanied by a high dimensionality of datasets and in such a case these problems should be considered together. Traditional feature selection methods usually assign the same weighting to samples from different classes when the samples are used to evaluate each feature. Therefore, they do not work good enough with imbalanced data. In situation when the costs of misclassification of different classes are diverse, cost-sensitive learning methods are often applied. These methods are usually used in the classification phase, but we propose to take the cost factors into consideration during the feature selection. In this study we analyse whether the use of cost-sensitive feature selection followed by resampling can give good results for mentioned problems. To evaluate tested methods three imbalanced and multidimensional datasets are considered and the performance of chosen feature selection methods and classifiers are analysed.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    They are known as instance-based filtering methods.

  2. 2.

    Selection is ‘wrapped’ around the model.

  3. 3.

    In Weka data mining tool the implementation of the C4.5 algorithm is called J48.

  4. 4.

    In the presented research the cost for the minority class was calculated as the quotient of number of samples in the majority and minority classes.

  5. 5.

    In the following paragraphs it is denoted as SMOTE 100 and SMOTE 200.

  6. 6.

    The standard deviation was determined for all of the analysed metrics, but unfortunately due to volume limitations it is presented only for AUC.

  7. 7.

    Due to volume limitations we do not present all of the rankings, but only discuss their results in the text.

References

  1. Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., Bing, G.: Learning from class-imbalanced data. Expert Syst. Appl. 73, 220–239 (2016). doi:10.1016/j.eswa.2016.12.035. Elsevier

    Article  Google Scholar 

  2. He, H., Garcia, E.A.: Learning from imbalanced data. IEEE TKDE 21(9), 1263–1284 (2009)

    Google Scholar 

  3. Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: synthetic minority over-sampling technique. AIR J. 16, 321–357 (2002)

    MATH  Google Scholar 

  4. Motoda, H., Liu, H.: Feature selection, extraction and construction. Commun. IICM 5, 67–72 (2012)

    Google Scholar 

  5. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)

    MATH  Google Scholar 

  6. Kononenko, I.: Estimating attributes: analysis and extension of relief. In: Proceedings of European Conference on Machine Learning, pp. 171–182 (1994)

    Google Scholar 

  7. Saeys, Y., Inza, I., Larrañaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), 2507–2517 (2007)

    Article  Google Scholar 

  8. Hira, Z.M., Gillies, D.F.: A review of feature selection and feature extraction methods applied on microarray data. Adv. Bioinf. 2015, 1–13 (2015). doi:10.1155/2015/198363

    Article  Google Scholar 

  9. Neumann, U., Riemenschneider, M., Sowa, J.P., Baars, T., Kälsch, J., Canbay, A., Heider, D.: Compensation of feature selection biases accompanied with improved predictive performance for binary classification by using a novel ensemble feature selection approach. BioData Min. 9, 36 (2016). doi:10.1186/s13040-016-0114-4

    Article  Google Scholar 

  10. He, Z., Yu, W.: Stable feature selection for biomarker discovery. Comput. Biol. Chem. 34, 215–225 (2010). Elsevier

    Article  Google Scholar 

  11. Loscalzo, L., Yu, C.D.: Consensus group stable feature selection. In: Proceeding of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 567–575 (2009)

    Google Scholar 

  12. Ein-Dor, L., Zuk, O., Domany, E.: Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proc. Nat. Acad. Sci. U.S.A. 103(15), 5923–5928 (2006)

    Article  Google Scholar 

  13. Yang, P., Liu, W., Zhou, B.B, Chawla, S., Zomaya, A.: Ensemble- based wrapper methods for feature selection and class imbalance learning. In: PAKDD, Advances in Knowledge Discovery and Data Mining. LNCS, vol. 7818, pp. 544–555 (2013)

    Google Scholar 

  14. Werner, A., Bach, M., Pluskiewicz, W.: The study of preprocessing methods’ utility in analysis of multidimensional and highly imbalanced medical data. In: Proceedings of 11th International Conference Internet in the Information Society 2016, pp. 71–87 (2016). ISBN: 978-83-65621-00-9

    Google Scholar 

  15. Bach, M., Werner, A., Żywiec, J., Pluskiewicz, W.: The study of under- and over-sampling methods’ utility in analysis of highly imbalanced data on osteoporosis. Inf. Sci. 384, 174–190 (2016). doi:10.1016/j.ins.2016.09.038. Si: Life Sci. Data Analysis. Elsevier

    Article  Google Scholar 

  16. WEKA download page. http://www.cs.waikato.ac.nz/ml/weka/down-loading.html. Last accessed 10 Apr 2017

  17. The R Project for Statistical Computing, web page. https://www.r-project.org/. Last accessed 10 Apr 2017

  18. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml/index.html

  19. Ashari, A., Paryudi, I., et al.: Performance comparison between Naïve Bayes, decision tree and k-nearest neighbor in searching alternative design in an energy simulation tool. Int. J. Adv. Comput. Sci. Appl. (IJACSA) 4, 33–39 (2013). doi:10.14569/IJACSA.2013.041105

    Google Scholar 

  20. John, G.H., Langley, P.: Estimating continuous distributions in Bayesian classifiers. In: Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338–345 (1995)

    Google Scholar 

  21. Aha, D., Kibler, D.: Instance-based learning algorithms. Mach. Learn. 6, 37–66 (1991)

    MATH  Google Scholar 

  22. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Francisco (1993). ISBN: 1-55860-238-0

    Google Scholar 

  23. López, V., Fernandez, A., Garcia, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013). doi:10.1016/j.ins.2013.07.007

    Article  Google Scholar 

  24. Kostrzewa, D., Brzeski, R.: Parametric optimization of the selected classifiers in binary classification. In: Advanced Topics in Intelligent Information and Database Systems, pp. 59–69 (2017). doi:10.1007/978-3-319-56660-3_6

  25. Raeder, T., Forman, G., Chawla, N.V.: Learning from imbalanced data: evaluation matters, ISRL 23. In: Holmes, D.E., Jain, L.C. (eds.) Data Mining: Foundations & Intelligent Paradigms, pp. 315–331. Springer-Verlag (2012)

    Google Scholar 

Download references

Acknowledgements

Project financed from the Polish funds for learning in 2017 year.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Małgorzata Bach or Aleksandra Werner .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this paper

Cite this paper

Bach, M., Werner, A. (2018). Cost-Sensitive Feature Selection for Class Imbalance Problem. In: Borzemski, L., Świątek, J., Wilimowska, Z. (eds) Information Systems Architecture and Technology: Proceedings of 38th International Conference on Information Systems Architecture and Technology – ISAT 2017. ISAT 2017. Advances in Intelligent Systems and Computing, vol 655. Springer, Cham. https://doi.org/10.1007/978-3-319-67220-5_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-67220-5_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-67219-9

  • Online ISBN: 978-3-319-67220-5

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics