Feature Selection for Highly Imbalanced Software Measurement Data

Khoshgoftaar, Taghi M.; Gao, Kehan; Van Hulse, Jason

doi:10.1007/978-3-7091-0738-6_8

Taghi M. Khoshgoftaar⁴,
Kehan Gao⁵ &
Jason Van Hulse⁴

497 Accesses
3 Citations

Abstract

Many data sets include an overabundance of features (i.e., attributes). However, not all features make the same contributions to the class. Selecting a subset of features that are most relevant to the class attribute is a necessary step to create a more meaningful and usable model [11]. In addition, for a two-group classification problem, class imbalance is frequently encountered [7, 34]. Class imbalance refers to the cases where the examples of one class are significantly outnumbered by the examples of the other class in a data set. For instance, in software quality classification, fault-prone (fp) modules typically are much less common than not-fault-prone (nfp) modules. Traditional classification algorithms attempt to improve classification accuracy without considering the relative significance of the different classes, resulting in a large number of misclassifications from minority class (e.g., fp) to majority class (e.g., nfp). This type of misclassification is extremely severe in some domains such as software quality assurance, implying a lost opportunity to correct a faulty module prior to deployment and operation. A variety of techniques have been proposed to counter the problems associated with class imbalance [30].

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
weka (Waikato Environment for Knowledge Analysis) is a popular suite of machine learning software written in Java, developed at the University of Waikato. WEKA is free software available under the GNU General Public License. In this study, all experiments and algorithms were implemented in the weka tool.
2.
Full data sets can be found at http://www.st.cs.uni-saarland.de/softevo/bug-data/eclipse/

References

Berenson, M.L., Goldstein, M., Levine, D.: Intermediate Statistical Methods and Applications: A Computer Package Approach Prentice-Hall, Englewood Cliffs, NJ, 2 edition (1983)
Google Scholar
Boetticher, G., Menzies, T., Ostrand, T.: Promise repository of empirical software engineering data (2007)
Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, P.W.: SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
MATH Google Scholar
Chen, Z., Menzies, T., Port, D., Boehm, B.: Finding the right data for software cost modeling. IEEE Software. 22(6), 38–46 (2005)
Article Google Scholar
Cieslak, D.A., Chawla, N.V., Striegel, A.: Combating imbalance in network intrusion datasets. In: Proceedings of 2006 IEEE International Conference on Granular Computing, pp. 732–737, Athens, Georgia (2006)
Google Scholar
Elkan, C.: The foundations of cost-sensitive learning. In: Proceedings of the Seventeenth International Conference on Machine Learning, pp. 239–246 (2001)
Google Scholar
Engen, V., Vincent, J., Phalp, K.: Enhancing network based intrusion detection for imbalanced data. Int. J. Knowl. Base. Intell. Eng. Syst. 12(5-6), 357–367 (2008)
Google Scholar
Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)
MATH Google Scholar
Gao, K., Khoshgoftaar, T.M., Van Hulse, J.: An evaluation of sampling on filter-based feature selection methods. In: Proceedings of the 23rd International Florida Artificial Intelligence Research Society Conference, p. 416–421, Daytona Beach, FL, USA (2010)
Google Scholar
Gao, K., Koshogoftaar, T.M., Napolitano, A.: Exploring software quality classification with a wrapper-based feature ranking technique. In: Proceedings of 21st IEEE International Conference on Tools with Artificial Intelligence, pp. 67–74, Newark, NJ (2009)
Google Scholar
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
MATH Google Scholar
Hall, M.A., Holmes, G.: Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans. Knowl. Data Eng. 15(6), 1437–1447 (2003)
Article Google Scholar
Haykin, S.: Neural Networks: A Comprehensive Foundation Prentice-Hall, (2 edn.) NJ, USA (1998)
Google Scholar
Ilczuk, G., Mlynarski, R., Kargul, W., Wakulicz-Deja, A.: New feature selection methods for qualification of the patients for cardiac pacemaker implantation. Comput. Cardiol. 34(2-3), 423–426 (2007)
Article Google Scholar
Jiang, Y., Lin, J., Cukic, B., Menzies, T.: Variance analysis in software fault prediction models. In: Proceedings of the 20th IEEE International Symposium on Software Reliability Engineering, pp. 99–108, Bangalore-Mysore, India (2009)
Google Scholar
Jong, K., Marchiori, E., Sebag, M., van der Vaart, A.: Feature selection in proteomic pattern data with support vector machines. In: Proceedings of the 2004 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (2004)
Google Scholar
Kamal, A.H., Zhu, X., Pandya, A.S., Hsu, S., Shoaib, M.: The impact of gene selection on imbalanced microarray expression data. In: Proceedings of the 1st International Conference on Bioinformatics and Computational Biology; Lecture Notes in Bioinformatics; Vol. 5462, pp. 259–269, New Orleans, LA (2009)
Google Scholar
Khoshgoftaar, T.M., Bullard, L.A., Geo, K.: Attribute selection using rough sets in software quality classification. Int. J. Reliab. Qual. Saf. Eng. 16(1), 73–89 (2009)
Article Google Scholar
Khoshgoftaar, T.M., Gao, K.: A novel software metric selection technique using the area under roc curves. In: Proceedings of the 22nd International Conference on Software Engineering and Knowledge Engineering, pp. 203–208, San Francisco, CA (2010)
Google Scholar
Khoshgoftaar, T.M., Golawala, M., Van Hulse, J.: An empirical study of learning from imbalanced data using random forest. In: Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence, Vol. 2, pp. 310–317, Washington, DC, USA (2007)
Google Scholar
Kira, K., Rendell, L.A.: A practical approach to feature selection. In: Proceedings of 9th International Workshop on Machine Learning, pp. 249–256 (1992)
Google Scholar
Lessmann, S., Baesens, B., Mues, C., Pietsch, S.: Benchmarking classification models for software defect prediction: A proposed framework and novel findings. IEEE Trans. Software Eng. 34(4), 485–496 (2008)
Article Google Scholar
Liu, H., Yu, L.: Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. Knowl. Data Eng. 17(4), 491–502 (2005)
Article Google Scholar
Liu, H., Motoda, H., Yu, L.: A selective sampling approach to active feature selection. Artif. Intell. 159(1-2), 49–74 (2004)
Article MathSciNet MATH Google Scholar
Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn defect predictors. IEEE Trans. Software Eng. 33(1), 2–13 (2007)
Article Google Scholar
Plackett, R.L.: Karl pearson and the chi-squared test. Int. Stat. Rev. 51(1), 59–72 (1983)
Article MathSciNet MATH Google Scholar
Rodriguez, D., Ruiz, R., Cuadrado-Gallego, J., Aguilar-Ruiz, J.: Detecting fault modules applying feature selection to classifiers. In: Proceedings of 8th IEEE International Conference on Information Reuse and Integration, pp. 667–672, Las Vegas, Nevada (2007)
Google Scholar
Seiffert, C., Khoshgoftaar, T.M., Van Hulse, J.: Improving software-quality predictions with data sampling and boosting. Part A: Systems and Humans, IEEE Trans. Syst. Man Cybern. 39(6), 1283–1294 (2009)
Google Scholar
Shawe-Taylor, J., Cristianini, N.: (2 edn.) Support Vector Machines, Cambridge University Press, (2000)
Google Scholar
Van Hulse, J., Khoshgoftaar, T.M., Napolitano, A.: Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th International Conference on Machine Learning, pp. 935–942, Corvallis, OR, USA (2007)
Google Scholar
Wang, H., Khoshgoftaar, T.M., Gao, K., Seliya, N.: Mining data from multiple software development projects. In: Proceedings of the 3rd IEEE International Workshop Mining Multiple Information Sources, pp. 551–557, Miami, FL (2009)
Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques Morgan Kaufmann, (2 edn.) (2005)
Google Scholar
Wohlin, C., Runeson, P., Host, M., Ohlsson, M.C., Regnell, B., Wesslen, A.: Experimentation in Software Engineering: An Introduction Kluwer International Series in Software Engineering. Kluwer Academic Publishers, Boston, MA (2000)
MATH Google Scholar
Zhao, Z.M., Li, X., Chen, L., Aihara, K.: Protein classification with imbalanced data. Proteins: Structure, Function, and Bioinformatics, 70(4), 1125–1132 (2007)
Article Google Scholar
Zimmermann, T., Premraj, R., Zeller, A.: Predicting defects for eclipse. In: Proceedings of the 29th International Conference on Software Engineering Workshops, pp. 76, Washington, DC, USA, IEEE Computer Society (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Florida Atlantic University, Boca Raton, FL, 33431, USA
Taghi M. Khoshgoftaar & Jason Van Hulse
Eastern Connecticut State University, Willimantic, CT, 06226, USA
Kehan Gao

Authors

Taghi M. Khoshgoftaar
View author publications
You can also search for this author in PubMed Google Scholar
Kehan Gao
View author publications
You can also search for this author in PubMed Google Scholar
Jason Van Hulse
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Taghi M. Khoshgoftaar .

Editor information

Editors and Affiliations

, Department of Computer Engineering, Tobb University, Sögütözü Caddesi 43, Ankara, 06560, Turkey
Tansel Özyer
, Department of Electrical Engineering, University of Western Ontario, Building 363, London, N6A 5B9, Ontario, Canada
Keivan Kianmehr
, Department of Computer Engineering, Tobb University, Söğütözü Caddesi 43, Ankara, 06560, Ankara, Turkey
Mehmet Tan

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Khoshgoftaar, T.M., Gao, K., Van Hulse, J. (2012). Feature Selection for Highly Imbalanced Software Measurement Data. In: Özyer, T., Kianmehr, K., Tan, M. (eds) Recent Trends in Information Reuse and Integration. Springer, Vienna. https://doi.org/10.1007/978-3-7091-0738-6_8

Download citation

DOI: https://doi.org/10.1007/978-3-7091-0738-6_8
Published: 20 August 2011
Publisher Name: Springer, Vienna
Print ISBN: 978-3-7091-0737-9
Online ISBN: 978-3-7091-0738-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics