Skip to main content

Feature Selection for Highly Imbalanced Software Measurement Data

  • Chapter
  • First Online:
Recent Trends in Information Reuse and Integration

Abstract

Many data sets include an overabundance of features (i.e., attributes). However, not all features make the same contributions to the class. Selecting a subset of features that are most relevant to the class attribute is a necessary step to create a more meaningful and usable model [11]. In addition, for a two-group classification problem, class imbalance is frequently encountered [7, 34]. Class imbalance refers to the cases where the examples of one class are significantly outnumbered by the examples of the other class in a data set. For instance, in software quality classification, fault-prone (fp) modules typically are much less common than not-fault-prone (nfp) modules. Traditional classification algorithms attempt to improve classification accuracy without considering the relative significance of the different classes, resulting in a large number of misclassifications from minority class (e.g., fp) to majority class (e.g., nfp). This type of misclassification is extremely severe in some domains such as software quality assurance, implying a lost opportunity to correct a faulty module prior to deployment and operation. A variety of techniques have been proposed to counter the problems associated with class imbalance [30].

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    weka (Waikato Environment for Knowledge Analysis) is a popular suite of machine learning software written in Java, developed at the University of Waikato. WEKA is free software available under the GNU General Public License. In this study, all experiments and algorithms were implemented in the weka tool.

  2. 2.

    Full data sets can be found at http://www.st.cs.uni-saarland.de/softevo/bug-data/eclipse/

References

  1. Berenson, M.L., Goldstein, M., Levine, D.: Intermediate Statistical Methods and Applications: A Computer Package Approach Prentice-Hall, Englewood Cliffs, NJ, 2 edition (1983)

    Google Scholar 

  2. Boetticher, G., Menzies, T., Ostrand, T.: Promise repository of empirical software engineering data (2007)

    Google Scholar 

  3. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, P.W.: SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    MATH  Google Scholar 

  4. Chen, Z., Menzies, T., Port, D., Boehm, B.: Finding the right data for software cost modeling. IEEE Software. 22(6), 38–46 (2005)

    Article  Google Scholar 

  5. Cieslak, D.A., Chawla, N.V., Striegel, A.: Combating imbalance in network intrusion datasets. In: Proceedings of 2006 IEEE International Conference on Granular Computing, pp. 732–737, Athens, Georgia (2006)

    Google Scholar 

  6. Elkan, C.: The foundations of cost-sensitive learning. In: Proceedings of the Seventeenth International Conference on Machine Learning, pp. 239–246 (2001)

    Google Scholar 

  7. Engen, V., Vincent, J., Phalp, K.: Enhancing network based intrusion detection for imbalanced data. Int. J. Knowl. Base. Intell. Eng. Syst. 12(5-6), 357–367 (2008)

    Google Scholar 

  8. Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)

    MATH  Google Scholar 

  9. Gao, K., Khoshgoftaar, T.M., Van Hulse, J.: An evaluation of sampling on filter-based feature selection methods. In: Proceedings of the 23rd International Florida Artificial Intelligence Research Society Conference, p. 416–421, Daytona Beach, FL, USA (2010)

    Google Scholar 

  10. Gao, K., Koshogoftaar, T.M., Napolitano, A.: Exploring software quality classification with a wrapper-based feature ranking technique. In: Proceedings of 21st IEEE International Conference on Tools with Artificial Intelligence, pp. 67–74, Newark, NJ (2009)

    Google Scholar 

  11. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)

    MATH  Google Scholar 

  12. Hall, M.A., Holmes, G.: Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans. Knowl. Data Eng. 15(6), 1437–1447 (2003)

    Article  Google Scholar 

  13. Haykin, S.: Neural Networks: A Comprehensive Foundation Prentice-Hall, (2 edn.) NJ, USA (1998)

    Google Scholar 

  14. Ilczuk, G., Mlynarski, R., Kargul, W., Wakulicz-Deja, A.: New feature selection methods for qualification of the patients for cardiac pacemaker implantation. Comput. Cardiol. 34(2-3), 423–426 (2007)

    Article  Google Scholar 

  15. Jiang, Y., Lin, J., Cukic, B., Menzies, T.: Variance analysis in software fault prediction models. In: Proceedings of the 20th IEEE International Symposium on Software Reliability Engineering, pp. 99–108, Bangalore-Mysore, India (2009)

    Google Scholar 

  16. Jong, K., Marchiori, E., Sebag, M., van der Vaart, A.: Feature selection in proteomic pattern data with support vector machines. In: Proceedings of the 2004 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (2004)

    Google Scholar 

  17. Kamal, A.H., Zhu, X., Pandya, A.S., Hsu, S., Shoaib, M.: The impact of gene selection on imbalanced microarray expression data. In: Proceedings of the 1st International Conference on Bioinformatics and Computational Biology; Lecture Notes in Bioinformatics; Vol. 5462, pp. 259–269, New Orleans, LA (2009)

    Google Scholar 

  18. Khoshgoftaar, T.M., Bullard, L.A., Geo, K.: Attribute selection using rough sets in software quality classification. Int. J. Reliab. Qual. Saf. Eng. 16(1), 73–89 (2009)

    Article  Google Scholar 

  19. Khoshgoftaar, T.M., Gao, K.: A novel software metric selection technique using the area under roc curves. In: Proceedings of the 22nd International Conference on Software Engineering and Knowledge Engineering, pp. 203–208, San Francisco, CA (2010)

    Google Scholar 

  20. Khoshgoftaar, T.M., Golawala, M., Van Hulse, J.: An empirical study of learning from imbalanced data using random forest. In: Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence, Vol. 2, pp. 310–317, Washington, DC, USA (2007)

    Google Scholar 

  21. Kira, K., Rendell, L.A.: A practical approach to feature selection. In: Proceedings of 9th International Workshop on Machine Learning, pp. 249–256 (1992)

    Google Scholar 

  22. Lessmann, S., Baesens, B., Mues, C., Pietsch, S.: Benchmarking classification models for software defect prediction: A proposed framework and novel findings. IEEE Trans. Software Eng. 34(4), 485–496 (2008)

    Article  Google Scholar 

  23. Liu, H., Yu, L.: Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. Knowl. Data Eng. 17(4), 491–502 (2005)

    Article  Google Scholar 

  24. Liu, H., Motoda, H., Yu, L.: A selective sampling approach to active feature selection. Artif. Intell. 159(1-2), 49–74 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  25. Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn defect predictors. IEEE Trans. Software Eng. 33(1), 2–13 (2007)

    Article  Google Scholar 

  26. Plackett, R.L.: Karl pearson and the chi-squared test. Int. Stat. Rev. 51(1), 59–72 (1983)

    Article  MathSciNet  MATH  Google Scholar 

  27. Rodriguez, D., Ruiz, R., Cuadrado-Gallego, J., Aguilar-Ruiz, J.: Detecting fault modules applying feature selection to classifiers. In: Proceedings of 8th IEEE International Conference on Information Reuse and Integration, pp. 667–672, Las Vegas, Nevada (2007)

    Google Scholar 

  28. Seiffert, C., Khoshgoftaar, T.M., Van Hulse, J.: Improving software-quality predictions with data sampling and boosting. Part A: Systems and Humans, IEEE Trans. Syst. Man Cybern. 39(6), 1283–1294 (2009)

    Google Scholar 

  29. Shawe-Taylor, J., Cristianini, N.: (2 edn.) Support Vector Machines, Cambridge University Press, (2000)

    Google Scholar 

  30. Van Hulse, J., Khoshgoftaar, T.M., Napolitano, A.: Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th International Conference on Machine Learning, pp. 935–942, Corvallis, OR, USA (2007)

    Google Scholar 

  31. Wang, H., Khoshgoftaar, T.M., Gao, K., Seliya, N.: Mining data from multiple software development projects. In: Proceedings of the 3rd IEEE International Workshop Mining Multiple Information Sources, pp. 551–557, Miami, FL (2009)

    Google Scholar 

  32. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques Morgan Kaufmann, (2 edn.) (2005)

    Google Scholar 

  33. Wohlin, C., Runeson, P., Host, M., Ohlsson, M.C., Regnell, B., Wesslen, A.: Experimentation in Software Engineering: An Introduction Kluwer International Series in Software Engineering. Kluwer Academic Publishers, Boston, MA (2000)

    MATH  Google Scholar 

  34. Zhao, Z.M., Li, X., Chen, L., Aihara, K.: Protein classification with imbalanced data. Proteins: Structure, Function, and Bioinformatics, 70(4), 1125–1132 (2007)

    Article  Google Scholar 

  35. Zimmermann, T., Premraj, R., Zeller, A.: Predicting defects for eclipse. In: Proceedings of the 29th International Conference on Software Engineering Workshops, pp. 76, Washington, DC, USA, IEEE Computer Society (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Taghi M. Khoshgoftaar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer Vienna

About this chapter

Cite this chapter

Khoshgoftaar, T.M., Gao, K., Van Hulse, J. (2012). Feature Selection for Highly Imbalanced Software Measurement Data. In: Özyer, T., Kianmehr, K., Tan, M. (eds) Recent Trends in Information Reuse and Integration. Springer, Vienna. https://doi.org/10.1007/978-3-7091-0738-6_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-7091-0738-6_8

  • Published:

  • Publisher Name: Springer, Vienna

  • Print ISBN: 978-3-7091-0737-9

  • Online ISBN: 978-3-7091-0738-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics