Cost-Sensitive Feature Selection for Class Imbalance Problem

Bach, Małgorzata; Werner, Aleksandra

doi:10.1007/978-3-319-67220-5_17

Cost-Sensitive Feature Selection for Class Imbalance Problem

Małgorzata Bach¹⁷ &
Aleksandra Werner¹⁷

Conference paper
First Online: 02 September 2017

1411 Accesses
8 Citations

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 655))

Abstract

The class imbalance problem is encountered in real-world applications of machine learning and results in suboptimal performance during data classification. This is especially true when data is not only imbalanced but also high dimensional. The class imbalance is very often accompanied by a high dimensionality of datasets and in such a case these problems should be considered together. Traditional feature selection methods usually assign the same weighting to samples from different classes when the samples are used to evaluate each feature. Therefore, they do not work good enough with imbalanced data. In situation when the costs of misclassification of different classes are diverse, cost-sensitive learning methods are often applied. These methods are usually used in the classification phase, but we propose to take the cost factors into consideration during the feature selection. In this study we analyse whether the use of cost-sensitive feature selection followed by resampling can give good results for mentioned problems. To evaluate tested methods three imbalanced and multidimensional datasets are considered and the performance of chosen feature selection methods and classifiers are analysed.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
They are known as instance-based filtering methods.
2.
Selection is ‘wrapped’ around the model.
3.
In Weka data mining tool the implementation of the C4.5 algorithm is called J48.
4.
In the presented research the cost for the minority class was calculated as the quotient of number of samples in the majority and minority classes.
5.
In the following paragraphs it is denoted as SMOTE 100 and SMOTE 200.
6.
The standard deviation was determined for all of the analysed metrics, but unfortunately due to volume limitations it is presented only for AUC.
7.
Due to volume limitations we do not present all of the rankings, but only discuss their results in the text.

References

Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., Bing, G.: Learning from class-imbalanced data. Expert Syst. Appl. 73, 220–239 (2016). doi:10.1016/j.eswa.2016.12.035. Elsevier
Article Google Scholar
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE TKDE 21(9), 1263–1284 (2009)
Google Scholar
Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: synthetic minority over-sampling technique. AIR J. 16, 321–357 (2002)
MATH Google Scholar
Motoda, H., Liu, H.: Feature selection, extraction and construction. Commun. IICM 5, 67–72 (2012)
Google Scholar
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
MATH Google Scholar
Kononenko, I.: Estimating attributes: analysis and extension of relief. In: Proceedings of European Conference on Machine Learning, pp. 171–182 (1994)
Google Scholar
Saeys, Y., Inza, I., Larrañaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), 2507–2517 (2007)
Article Google Scholar
Hira, Z.M., Gillies, D.F.: A review of feature selection and feature extraction methods applied on microarray data. Adv. Bioinf. 2015, 1–13 (2015). doi:10.1155/2015/198363
Article Google Scholar
Neumann, U., Riemenschneider, M., Sowa, J.P., Baars, T., Kälsch, J., Canbay, A., Heider, D.: Compensation of feature selection biases accompanied with improved predictive performance for binary classification by using a novel ensemble feature selection approach. BioData Min. 9, 36 (2016). doi:10.1186/s13040-016-0114-4
Article Google Scholar
He, Z., Yu, W.: Stable feature selection for biomarker discovery. Comput. Biol. Chem. 34, 215–225 (2010). Elsevier
Article Google Scholar
Loscalzo, L., Yu, C.D.: Consensus group stable feature selection. In: Proceeding of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 567–575 (2009)
Google Scholar
Ein-Dor, L., Zuk, O., Domany, E.: Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proc. Nat. Acad. Sci. U.S.A. 103(15), 5923–5928 (2006)
Article Google Scholar
Yang, P., Liu, W., Zhou, B.B, Chawla, S., Zomaya, A.: Ensemble- based wrapper methods for feature selection and class imbalance learning. In: PAKDD, Advances in Knowledge Discovery and Data Mining. LNCS, vol. 7818, pp. 544–555 (2013)
Google Scholar
Werner, A., Bach, M., Pluskiewicz, W.: The study of preprocessing methods’ utility in analysis of multidimensional and highly imbalanced medical data. In: Proceedings of 11th International Conference Internet in the Information Society 2016, pp. 71–87 (2016). ISBN: 978-83-65621-00-9
Google Scholar
Bach, M., Werner, A., Żywiec, J., Pluskiewicz, W.: The study of under- and over-sampling methods’ utility in analysis of highly imbalanced data on osteoporosis. Inf. Sci. 384, 174–190 (2016). doi:10.1016/j.ins.2016.09.038. Si: Life Sci. Data Analysis. Elsevier
Article Google Scholar
WEKA download page. http://www.cs.waikato.ac.nz/ml/weka/down-loading.html. Last accessed 10 Apr 2017
The R Project for Statistical Computing, web page. https://www.r-project.org/. Last accessed 10 Apr 2017
UCI Machine Learning Repository. http://archive.ics.uci.edu/ml/index.html
Ashari, A., Paryudi, I., et al.: Performance comparison between Naïve Bayes, decision tree and k-nearest neighbor in searching alternative design in an energy simulation tool. Int. J. Adv. Comput. Sci. Appl. (IJACSA) 4, 33–39 (2013). doi:10.14569/IJACSA.2013.041105
Google Scholar
John, G.H., Langley, P.: Estimating continuous distributions in Bayesian classifiers. In: Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338–345 (1995)
Google Scholar
Aha, D., Kibler, D.: Instance-based learning algorithms. Mach. Learn. 6, 37–66 (1991)
MATH Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Francisco (1993). ISBN: 1-55860-238-0
Google Scholar
López, V., Fernandez, A., Garcia, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013). doi:10.1016/j.ins.2013.07.007
Article Google Scholar
Kostrzewa, D., Brzeski, R.: Parametric optimization of the selected classifiers in binary classification. In: Advanced Topics in Intelligent Information and Database Systems, pp. 59–69 (2017). doi:10.1007/978-3-319-56660-3_6
Raeder, T., Forman, G., Chawla, N.V.: Learning from imbalanced data: evaluation matters, ISRL 23. In: Holmes, D.E., Jain, L.C. (eds.) Data Mining: Foundations & Intelligent Paradigms, pp. 315–331. Springer-Verlag (2012)
Google Scholar

Download references

Acknowledgements

Project financed from the Polish funds for learning in 2017 year.

Author information

Authors and Affiliations

Silesian University of Technology, Gliwice, Poland
Małgorzata Bach & Aleksandra Werner

Authors

Małgorzata Bach
View author publications
You can also search for this author in PubMed Google Scholar
Aleksandra Werner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Małgorzata Bach or Aleksandra Werner .

Editor information

Editors and Affiliations

Department of Computer Science, Faculty of Computer Science and Management, Wrocław University of Science and Technology, Wrocław, Poland
Leszek Borzemski
Department of Computer Science, Faculty of Computer Science and Management, Wrocław University of Science and Technology, Wrocław, Poland
Jerzy Świątek
Department of Management Systems, Faculty of Computer Science and Management, Wrocław University of Science and Technology, Wrocław, Poland
Zofia Wilimowska

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bach, M., Werner, A. (2018). Cost-Sensitive Feature Selection for Class Imbalance Problem. In: Borzemski, L., Świątek, J., Wilimowska, Z. (eds) Information Systems Architecture and Technology: Proceedings of 38th International Conference on Information Systems Architecture and Technology – ISAT 2017. ISAT 2017. Advances in Intelligent Systems and Computing, vol 655. Springer, Cham. https://doi.org/10.1007/978-3-319-67220-5_17

Download citation

DOI: https://doi.org/10.1007/978-3-319-67220-5_17
Published: 02 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67219-9
Online ISBN: 978-3-319-67220-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics