Abstract
Many data sets include an overabundance of features (i.e., attributes). However, not all features make the same contributions to the class. Selecting a subset of features that are most relevant to the class attribute is a necessary step to create a more meaningful and usable model [11]. In addition, for a two-group classification problem, class imbalance is frequently encountered [7, 34]. Class imbalance refers to the cases where the examples of one class are significantly outnumbered by the examples of the other class in a data set. For instance, in software quality classification, fault-prone (fp) modules typically are much less common than not-fault-prone (nfp) modules. Traditional classification algorithms attempt to improve classification accuracy without considering the relative significance of the different classes, resulting in a large number of misclassifications from minority class (e.g., fp) to majority class (e.g., nfp). This type of misclassification is extremely severe in some domains such as software quality assurance, implying a lost opportunity to correct a faulty module prior to deployment and operation. A variety of techniques have been proposed to counter the problems associated with class imbalance [30].
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
weka (Waikato Environment for Knowledge Analysis) is a popular suite of machine learning software written in Java, developed at the University of Waikato. WEKA is free software available under the GNU General Public License. In this study, all experiments and algorithms were implemented in the weka tool.
- 2.
Full data sets can be found at http://www.st.cs.uni-saarland.de/softevo/bug-data/eclipse/
References
Berenson, M.L., Goldstein, M., Levine, D.: Intermediate Statistical Methods and Applications: A Computer Package Approach Prentice-Hall, Englewood Cliffs, NJ, 2 edition (1983)
Boetticher, G., Menzies, T., Ostrand, T.: Promise repository of empirical software engineering data (2007)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, P.W.: SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Chen, Z., Menzies, T., Port, D., Boehm, B.: Finding the right data for software cost modeling. IEEE Software. 22(6), 38–46 (2005)
Cieslak, D.A., Chawla, N.V., Striegel, A.: Combating imbalance in network intrusion datasets. In: Proceedings of 2006 IEEE International Conference on Granular Computing, pp. 732–737, Athens, Georgia (2006)
Elkan, C.: The foundations of cost-sensitive learning. In: Proceedings of the Seventeenth International Conference on Machine Learning, pp. 239–246 (2001)
Engen, V., Vincent, J., Phalp, K.: Enhancing network based intrusion detection for imbalanced data. Int. J. Knowl. Base. Intell. Eng. Syst. 12(5-6), 357–367 (2008)
Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)
Gao, K., Khoshgoftaar, T.M., Van Hulse, J.: An evaluation of sampling on filter-based feature selection methods. In: Proceedings of the 23rd International Florida Artificial Intelligence Research Society Conference, p. 416–421, Daytona Beach, FL, USA (2010)
Gao, K., Koshogoftaar, T.M., Napolitano, A.: Exploring software quality classification with a wrapper-based feature ranking technique. In: Proceedings of 21st IEEE International Conference on Tools with Artificial Intelligence, pp. 67–74, Newark, NJ (2009)
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
Hall, M.A., Holmes, G.: Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans. Knowl. Data Eng. 15(6), 1437–1447 (2003)
Haykin, S.: Neural Networks: A Comprehensive Foundation Prentice-Hall, (2 edn.) NJ, USA (1998)
Ilczuk, G., Mlynarski, R., Kargul, W., Wakulicz-Deja, A.: New feature selection methods for qualification of the patients for cardiac pacemaker implantation. Comput. Cardiol. 34(2-3), 423–426 (2007)
Jiang, Y., Lin, J., Cukic, B., Menzies, T.: Variance analysis in software fault prediction models. In: Proceedings of the 20th IEEE International Symposium on Software Reliability Engineering, pp. 99–108, Bangalore-Mysore, India (2009)
Jong, K., Marchiori, E., Sebag, M., van der Vaart, A.: Feature selection in proteomic pattern data with support vector machines. In: Proceedings of the 2004 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (2004)
Kamal, A.H., Zhu, X., Pandya, A.S., Hsu, S., Shoaib, M.: The impact of gene selection on imbalanced microarray expression data. In: Proceedings of the 1st International Conference on Bioinformatics and Computational Biology; Lecture Notes in Bioinformatics; Vol. 5462, pp. 259–269, New Orleans, LA (2009)
Khoshgoftaar, T.M., Bullard, L.A., Geo, K.: Attribute selection using rough sets in software quality classification. Int. J. Reliab. Qual. Saf. Eng. 16(1), 73–89 (2009)
Khoshgoftaar, T.M., Gao, K.: A novel software metric selection technique using the area under roc curves. In: Proceedings of the 22nd International Conference on Software Engineering and Knowledge Engineering, pp. 203–208, San Francisco, CA (2010)
Khoshgoftaar, T.M., Golawala, M., Van Hulse, J.: An empirical study of learning from imbalanced data using random forest. In: Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence, Vol. 2, pp. 310–317, Washington, DC, USA (2007)
Kira, K., Rendell, L.A.: A practical approach to feature selection. In: Proceedings of 9th International Workshop on Machine Learning, pp. 249–256 (1992)
Lessmann, S., Baesens, B., Mues, C., Pietsch, S.: Benchmarking classification models for software defect prediction: A proposed framework and novel findings. IEEE Trans. Software Eng. 34(4), 485–496 (2008)
Liu, H., Yu, L.: Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. Knowl. Data Eng. 17(4), 491–502 (2005)
Liu, H., Motoda, H., Yu, L.: A selective sampling approach to active feature selection. Artif. Intell. 159(1-2), 49–74 (2004)
Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn defect predictors. IEEE Trans. Software Eng. 33(1), 2–13 (2007)
Plackett, R.L.: Karl pearson and the chi-squared test. Int. Stat. Rev. 51(1), 59–72 (1983)
Rodriguez, D., Ruiz, R., Cuadrado-Gallego, J., Aguilar-Ruiz, J.: Detecting fault modules applying feature selection to classifiers. In: Proceedings of 8th IEEE International Conference on Information Reuse and Integration, pp. 667–672, Las Vegas, Nevada (2007)
Seiffert, C., Khoshgoftaar, T.M., Van Hulse, J.: Improving software-quality predictions with data sampling and boosting. Part A: Systems and Humans, IEEE Trans. Syst. Man Cybern. 39(6), 1283–1294 (2009)
Shawe-Taylor, J., Cristianini, N.: (2 edn.) Support Vector Machines, Cambridge University Press, (2000)
Van Hulse, J., Khoshgoftaar, T.M., Napolitano, A.: Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th International Conference on Machine Learning, pp. 935–942, Corvallis, OR, USA (2007)
Wang, H., Khoshgoftaar, T.M., Gao, K., Seliya, N.: Mining data from multiple software development projects. In: Proceedings of the 3rd IEEE International Workshop Mining Multiple Information Sources, pp. 551–557, Miami, FL (2009)
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques Morgan Kaufmann, (2 edn.) (2005)
Wohlin, C., Runeson, P., Host, M., Ohlsson, M.C., Regnell, B., Wesslen, A.: Experimentation in Software Engineering: An Introduction Kluwer International Series in Software Engineering. Kluwer Academic Publishers, Boston, MA (2000)
Zhao, Z.M., Li, X., Chen, L., Aihara, K.: Protein classification with imbalanced data. Proteins: Structure, Function, and Bioinformatics, 70(4), 1125–1132 (2007)
Zimmermann, T., Premraj, R., Zeller, A.: Predicting defects for eclipse. In: Proceedings of the 29th International Conference on Software Engineering Workshops, pp. 76, Washington, DC, USA, IEEE Computer Society (2007)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer Vienna
About this chapter
Cite this chapter
Khoshgoftaar, T.M., Gao, K., Van Hulse, J. (2012). Feature Selection for Highly Imbalanced Software Measurement Data. In: Özyer, T., Kianmehr, K., Tan, M. (eds) Recent Trends in Information Reuse and Integration. Springer, Vienna. https://doi.org/10.1007/978-3-7091-0738-6_8
Download citation
DOI: https://doi.org/10.1007/978-3-7091-0738-6_8
Published:
Publisher Name: Springer, Vienna
Print ISBN: 978-3-7091-0737-9
Online ISBN: 978-3-7091-0738-6
eBook Packages: Computer ScienceComputer Science (R0)