Feature Selection for Highly Imbalanced Software Measurement Data

  • Taghi M. KhoshgoftaarEmail author
  • Kehan Gao
  • Jason Van Hulse


Many data sets include an overabundance of features (i.e., attributes). However, not all features make the same contributions to the class. Selecting a subset of features that are most relevant to the class attribute is a necessary step to create a more meaningful and usable model [11]. In addition, for a two-group classification problem, class imbalance is frequently encountered [7, 34]. Class imbalance refers to the cases where the examples of one class are significantly outnumbered by the examples of the other class in a data set. For instance, in software quality classification, fault-prone (fp) modules typically are much less common than not-fault-prone (nfp) modules. Traditional classification algorithms attempt to improve classification accuracy without considering the relative significance of the different classes, resulting in a large number of misclassifications from minority class (e.g., fp) to majority class (e.g., nfp). This type of misclassification is extremely severe in some domains such as software quality assurance, implying a lost opportunity to correct a faulty module prior to deployment and operation. A variety of techniques have been proposed to counter the problems associated with class imbalance [30].


Feature Selection Minority Class Class Imbalance Defect Prediction Feature Ranking 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Berenson, M.L., Goldstein, M., Levine, D.: Intermediate Statistical Methods and Applications: A Computer Package Approach Prentice-Hall, Englewood Cliffs, NJ, 2 edition (1983)Google Scholar
  2. 2.
    Boetticher, G., Menzies, T., Ostrand, T.: Promise repository of empirical software engineering data (2007)Google Scholar
  3. 3.
    Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, P.W.: SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)zbMATHGoogle Scholar
  4. 4.
    Chen, Z., Menzies, T., Port, D., Boehm, B.: Finding the right data for software cost modeling. IEEE Software. 22(6), 38–46 (2005)CrossRefGoogle Scholar
  5. 5.
    Cieslak, D.A., Chawla, N.V., Striegel, A.: Combating imbalance in network intrusion datasets. In: Proceedings of 2006 IEEE International Conference on Granular Computing, pp. 732–737, Athens, Georgia (2006)Google Scholar
  6. 6.
    Elkan, C.: The foundations of cost-sensitive learning. In: Proceedings of the Seventeenth International Conference on Machine Learning, pp. 239–246 (2001)Google Scholar
  7. 7.
    Engen, V., Vincent, J., Phalp, K.: Enhancing network based intrusion detection for imbalanced data. Int. J. Knowl. Base. Intell. Eng. Syst. 12(5-6), 357–367 (2008)Google Scholar
  8. 8.
    Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)zbMATHGoogle Scholar
  9. 9.
    Gao, K., Khoshgoftaar, T.M., Van Hulse, J.: An evaluation of sampling on filter-based feature selection methods. In: Proceedings of the 23rd International Florida Artificial Intelligence Research Society Conference, p. 416–421, Daytona Beach, FL, USA (2010)Google Scholar
  10. 10.
    Gao, K., Koshogoftaar, T.M., Napolitano, A.: Exploring software quality classification with a wrapper-based feature ranking technique. In: Proceedings of 21st IEEE International Conference on Tools with Artificial Intelligence, pp. 67–74, Newark, NJ (2009)Google Scholar
  11. 11.
    Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)zbMATHGoogle Scholar
  12. 12.
    Hall, M.A., Holmes, G.: Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans. Knowl. Data Eng. 15(6), 1437–1447 (2003)CrossRefGoogle Scholar
  13. 13.
    Haykin, S.: Neural Networks: A Comprehensive Foundation Prentice-Hall, (2 edn.) NJ, USA (1998)Google Scholar
  14. 14.
    Ilczuk, G., Mlynarski, R., Kargul, W., Wakulicz-Deja, A.: New feature selection methods for qualification of the patients for cardiac pacemaker implantation. Comput. Cardiol. 34(2-3), 423–426 (2007)CrossRefGoogle Scholar
  15. 15.
    Jiang, Y., Lin, J., Cukic, B., Menzies, T.: Variance analysis in software fault prediction models. In: Proceedings of the 20th IEEE International Symposium on Software Reliability Engineering, pp. 99–108, Bangalore-Mysore, India (2009)Google Scholar
  16. 16.
    Jong, K., Marchiori, E., Sebag, M., van der Vaart, A.: Feature selection in proteomic pattern data with support vector machines. In: Proceedings of the 2004 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (2004)Google Scholar
  17. 17.
    Kamal, A.H., Zhu, X., Pandya, A.S., Hsu, S., Shoaib, M.: The impact of gene selection on imbalanced microarray expression data. In: Proceedings of the 1st International Conference on Bioinformatics and Computational Biology; Lecture Notes in Bioinformatics; Vol. 5462, pp. 259–269, New Orleans, LA (2009)Google Scholar
  18. 18.
    Khoshgoftaar, T.M., Bullard, L.A., Geo, K.: Attribute selection using rough sets in software quality classification. Int. J. Reliab. Qual. Saf. Eng. 16(1), 73–89 (2009)CrossRefGoogle Scholar
  19. 19.
    Khoshgoftaar, T.M., Gao, K.: A novel software metric selection technique using the area under roc curves. In: Proceedings of the 22nd International Conference on Software Engineering and Knowledge Engineering, pp. 203–208, San Francisco, CA (2010)Google Scholar
  20. 20.
    Khoshgoftaar, T.M., Golawala, M., Van Hulse, J.: An empirical study of learning from imbalanced data using random forest. In: Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence, Vol. 2, pp. 310–317, Washington, DC, USA (2007)Google Scholar
  21. 21.
    Kira, K., Rendell, L.A.: A practical approach to feature selection. In: Proceedings of 9th International Workshop on Machine Learning, pp. 249–256 (1992)Google Scholar
  22. 22.
    Lessmann, S., Baesens, B., Mues, C., Pietsch, S.: Benchmarking classification models for software defect prediction: A proposed framework and novel findings. IEEE Trans. Software Eng. 34(4), 485–496 (2008)CrossRefGoogle Scholar
  23. 23.
    Liu, H., Yu, L.: Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. Knowl. Data Eng. 17(4), 491–502 (2005)CrossRefGoogle Scholar
  24. 24.
    Liu, H., Motoda, H., Yu, L.: A selective sampling approach to active feature selection. Artif. Intell. 159(1-2), 49–74 (2004)MathSciNetzbMATHCrossRefGoogle Scholar
  25. 25.
    Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn defect predictors. IEEE Trans. Software Eng. 33(1), 2–13 (2007)CrossRefGoogle Scholar
  26. 26.
    Plackett, R.L.: Karl pearson and the chi-squared test. Int. Stat. Rev. 51(1), 59–72 (1983)MathSciNetzbMATHCrossRefGoogle Scholar
  27. 27.
    Rodriguez, D., Ruiz, R., Cuadrado-Gallego, J., Aguilar-Ruiz, J.: Detecting fault modules applying feature selection to classifiers. In: Proceedings of 8th IEEE International Conference on Information Reuse and Integration, pp. 667–672, Las Vegas, Nevada (2007)Google Scholar
  28. 28.
    Seiffert, C., Khoshgoftaar, T.M., Van Hulse, J.: Improving software-quality predictions with data sampling and boosting. Part A: Systems and Humans, IEEE Trans. Syst. Man Cybern. 39(6), 1283–1294 (2009)Google Scholar
  29. 29.
    Shawe-Taylor, J., Cristianini, N.: (2 edn.) Support Vector Machines, Cambridge University Press, (2000)Google Scholar
  30. 30.
    Van Hulse, J., Khoshgoftaar, T.M., Napolitano, A.: Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th International Conference on Machine Learning, pp. 935–942, Corvallis, OR, USA (2007)Google Scholar
  31. 31.
    Wang, H., Khoshgoftaar, T.M., Gao, K., Seliya, N.: Mining data from multiple software development projects. In: Proceedings of the 3rd IEEE International Workshop Mining Multiple Information Sources, pp. 551–557, Miami, FL (2009)Google Scholar
  32. 32.
    Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques Morgan Kaufmann, (2 edn.) (2005)Google Scholar
  33. 33.
    Wohlin, C., Runeson, P., Host, M., Ohlsson, M.C., Regnell, B., Wesslen, A.: Experimentation in Software Engineering: An Introduction Kluwer International Series in Software Engineering. Kluwer Academic Publishers, Boston, MA (2000)zbMATHGoogle Scholar
  34. 34.
    Zhao, Z.M., Li, X., Chen, L., Aihara, K.: Protein classification with imbalanced data. Proteins: Structure, Function, and Bioinformatics, 70(4), 1125–1132 (2007)CrossRefGoogle Scholar
  35. 35.
    Zimmermann, T., Premraj, R., Zeller, A.: Predicting defects for eclipse. In: Proceedings of the 29th International Conference on Software Engineering Workshops, pp. 76, Washington, DC, USA, IEEE Computer Society (2007)Google Scholar

Copyright information

© Springer Vienna 2012

Authors and Affiliations

  • Taghi M. Khoshgoftaar
    • 1
    Email author
  • Kehan Gao
    • 2
  • Jason Van Hulse
    • 1
  1. 1.Florida Atlantic UniversityBoca RatonUSA
  2. 2.Eastern Connecticut State UniversityWillimanticUSA

Personalised recommendations