Advertisement

The impact of feature reduction techniques on defect prediction models

  • Masanari KondoEmail author
  • Cor-Paul Bezemer
  • Yasutaka Kamei
  • Ahmed E. Hassan
  • Osamu Mizuno
Article
  • 48 Downloads

Abstract

Defect prediction is an important task for preserving software quality. Most prior work on defect prediction uses software features, such as the number of lines of code, to predict whether a file or commit will be defective in the future. There are several reasons to keep the number of features that are used in a defect prediction model small. For example, using a small number of features avoids the problem of multicollinearity and the so-called ‘curse of dimensionality’. Feature selection and reduction techniques can help to reduce the number of features in a model. Feature selection techniques reduce the number of features in a model by selecting the most important ones, while feature reduction techniques reduce the number of features by creating new, combined features from the original features. Several recent studies have investigated the impact of feature selection techniques on defect prediction. However, there do not exist large-scale studies in which the impact of multiple feature reduction techniques on defect prediction is investigated. In this paper, we study the impact of eight feature reduction techniques on the performance and the variance in performance of five supervised learning and five unsupervised defect prediction models. In addition, we compare the impact of the studied feature reduction techniques with the impact of the two best-performing feature selection techniques (according to prior work). The following findings are the highlights of our study: (1) The studied correlation and consistency-based feature selection techniques result in the best-performing supervised defect prediction models, while feature reduction techniques using neural network-based techniques (restricted Boltzmann machine and autoencoder) result in the best-performing unsupervised defect prediction models. In both cases, the defect prediction models that use the selected/generated features perform better than those that use the original features (in terms of AUC and performance variance). (2) Neural network-based feature reduction techniques generate features that have a small variance across both supervised and unsupervised defect prediction models. Hence, we recommend that practitioners who do not wish to choose a best-performing defect prediction model for their data use a neural network-based feature reduction technique.

Keywords

Feature reduction Feature selection Defect prediction Restricted Boltzmann machine Neural network 

Notes

Acknowledgment

This work was partially supported by NSERC as well as JSPS KAKENHI (Grant Numbers: JP16K12415 and JP18H03222).

References

  1. Abaei G, Rezaei Z, Selamat A (2013) Fault prediction by utilizing self-organizing map and threshold. In: Proceedings of the international conference on control system, computing and engineering (ICCSCE), IEEE, pp 465–470Google Scholar
  2. Arora I, Tetarwal V, Saha A (2015) Open issues in software defect prediction. Procedia Comput Sci 46:906–912CrossRefGoogle Scholar
  3. Basili VR, Briand LC, Melo WL (1996) A validation of object-oriented design metrics as quality indicators. IEEE Trans Softw Eng (TSE) 22(10):751–761CrossRefGoogle Scholar
  4. Bellman R (1957) Dynamic Programming. Princeton University Press, PrincetonzbMATHGoogle Scholar
  5. Bingham E, Mannila H (2001) Random projection in dimensionality reduction: applications to image and text data. In: Proceedings of the 7th international conference on knowledge discovery and data mining, ACM, pp 245–250Google Scholar
  6. Bishnu PS, Bhattacherjee V (2012) Software fault prediction using quad tree-based k-means clustering algorithm. IEEE Trans Knowl Data Eng 24(6):1146–1150CrossRefGoogle Scholar
  7. Challagulla VUB, Bastani FB, Yen IL, Paul RA (2008) Empirical assessment of machine learning based software defect prediction techniques. Int J Artif Intell Tools 17(02):389–400CrossRefGoogle Scholar
  8. Chidamber SR, Kemerer CF (1994) A metrics suite for object oriented design. IEEE Trans Softw Eng (TSE) 20(6):476–493CrossRefGoogle Scholar
  9. Cohen J (1988) Statistical power analysis for the behavioral sciencesGoogle Scholar
  10. D’Ambros M, Lanza M, Robbes R (2010) An extensive comparison of bug prediction approaches. In: Proceedings of the 7th international conference on mining software repositories (MSR), IEEE, pp 31–41Google Scholar
  11. Dash M, Liu H (2003) Consistency-based search in feature selection. Artif Intell 151(1):155–176MathSciNetzbMATHCrossRefGoogle Scholar
  12. Dunn JC (1973) A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J Cybernet 3:32–57MathSciNetzbMATHCrossRefGoogle Scholar
  13. Faloutsos C, Lin KI (1995) FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: Proceedings of the ACM SIGMOD international conference on management of data, ACM, pp 163–174Google Scholar
  14. Farrar DE, Glauber RR (1967) Multicollinearity in regression analysis: the problem revisited. Rev Econ Stat 49(1):92–107CrossRefGoogle Scholar
  15. Gao K, Khoshgoftaar TM, Wang H, Seliya N (2011) Choosing software metrics for defect prediction: an investigation on feature selection techniques. Software: Practice and Experience 41(5):579–606Google Scholar
  16. Ghotra B, McIntosh S, Hassan AE (2015) Revisiting the impact of classification techniques on the performance of defect prediction models. In: Proceedings of the 37th international conference on software engineering (ICSE), IEEE Press, pp 789–800Google Scholar
  17. Ghotra B, Mcintosh S, Hassan AE (2017) A large-scale study of the impact of feature selection techniques on defect classification models. In: Proceedings of the 14th international conference on mining software repositories (MSR), IEEE Press, pp 146–157Google Scholar
  18. Gray AR, Macdonell SG (1999) Software metrics data analysis–exploring the relative performance of some commonly used modeling techniques. Empir Softw Eng 4 (4):297–316CrossRefGoogle Scholar
  19. Guo L, Cukic B, Singh H (2003) Predicting fault prone modules by the dempster-shafer belief networks. In: Proceedings of the 18th international conference on automated software engineering (ASE), IEEE, pp 249–252Google Scholar
  20. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11(1):10–18CrossRefGoogle Scholar
  21. Hall MA (1999) Correlation-based feature selection for machine learning. Ph.D. thesis, University of Waikato HamiltonGoogle Scholar
  22. Hall MA, Holmes G (2003) Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans Knowl Data Eng 15(6):1437–1447CrossRefGoogle Scholar
  23. Halstead MH (1977) Elements of Software Science, vol 7. Elsevier, New YorkzbMATHGoogle Scholar
  24. Han J, Moraga C (1995) The influence of the sigmoid function parameters on the speed of backpropagation learning. In: Proceedings of the international workshop on artificial neural networks, Springer, pp 195–201Google Scholar
  25. Hartigan JA, Wong MA (1979) Algorithm as 136: a k-means clustering algorithm. J R Stat Soc Ser C Appl Stat 28(1):100–108zbMATHGoogle Scholar
  26. Hassan AE (2009) Predicting faults using the complexity of code changes. In: Proceedings of the 31st international conference on software engineering (ICSE), IEEE Computer Society, pp 78–88Google Scholar
  27. He Z, Shu F, Yang Y, Li M, Wang Q (2012) An investigation on the feasibility of cross-project defect prediction. Autom Softw Eng 19(2):167–199CrossRefGoogle Scholar
  28. Herbold S (2013) Training data selection for cross-project defect prediction. In: Proceedings of the 9th international conference on predictive models in software engineering, ACM, p 6Google Scholar
  29. Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507MathSciNetzbMATHCrossRefGoogle Scholar
  30. Hira ZM, Gillies DF (2015) A review of feature selection and feature extraction methods applied on microarray data. Adv Bioinforma 2015. Article ID 198363, 13 ppGoogle Scholar
  31. Ho TK (1995) Random decision forests. In: Proceedings of the 3rd international conference on document analysis and recognition, vol 1. IEEE, pp 278–282Google Scholar
  32. Jureczko M, Madeyski L (2010) Towards identifying software project clusters with regard to defect prediction. In: Proceedings of the 6th international conference on predictive models in software engineering, ACM, p 9Google Scholar
  33. Kamei Y, Fukushima T, McIntosh S, Yamashita K, Ubayashi N, Hassan AE (2016) Studying just-in-time defect prediction using cross-project models. Empir Softw Eng 21(5):2072–2106CrossRefGoogle Scholar
  34. Kaufman L, Rousseeuw PJ (2009) Finding Groups in Data: an Introduction to Cluster Analysis, vol 344. Wiley, HobokenGoogle Scholar
  35. Kim S, Zimmermann T, Whitehead EJ Jr, Zeller A (2007) Predicting faults from cached history. In: Proceedings of the 29th international conference on software engineering (ICSE), IEEE Computer Society, pp 489–498Google Scholar
  36. Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):1464–1480CrossRefGoogle Scholar
  37. Kuhn M (2015) Caret: classification and regression training. Astrophysics Source Code LibraryGoogle Scholar
  38. Landwehr N, Hall M, Frank E (2005) Logistic model trees. Mach Learn 59 (1):161–205zbMATHCrossRefGoogle Scholar
  39. van der Maaten L (2014) Accelerating t-SNE using tree-based algorithms. J Mach Learn Res 15(1):3221–3245MathSciNetzbMATHGoogle Scholar
  40. van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9(Nov):2579–2605zbMATHGoogle Scholar
  41. Martinetz T, Schulten K (1991) A “neural-gas” network learns topologies. Artificial Neural Networks 1:397–402Google Scholar
  42. McCabe TJ (1976) A complexity measure. IEEE Trans Softw Eng (TSE) SE-2 (4):308–320MathSciNetzbMATHCrossRefGoogle Scholar
  43. McDonald JH (2014) Handbook of Biological Statistics, 3rd edn. Sparky House Publishing, BaltimoreGoogle Scholar
  44. Menzies T, Greenwald J, Frank A (2007a) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng (TSE) 33(1):2–13CrossRefGoogle Scholar
  45. Menzies T, Owen D, Richardson J (2007b) The strangest thing about software. Computer 40(1):54–60CrossRefGoogle Scholar
  46. Moser R, Pedrycz W, Succi G (2008) A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In: Proceedings of the 30th international conference on software engineering (ICSE), IEEE, pp 181–190Google Scholar
  47. Muthukumaran K, Rallapalli A, Murthy N (2015) Impact of feature selection techniques on bug prediction models. In: Proceedings of the 8th India software engineering conference, ACM, pp 120–129Google Scholar
  48. Nagappan N, Ball T, Zeller A (2006) Mining metrics to predict component failures. In: Proceedings of the 28th international conference on software engineering (ICSE), ACM, pp 452–461Google Scholar
  49. Nam J (2014) Survey on software defect prediction. HKUST PhD Qualifying Examination, Department of Compter Science and Engineerning. The Hong Kong University of Science and Technology, Tech. RepGoogle Scholar
  50. Nam J, Fu W, Kim S, Menzies T, Tan L (2017) Heterogeneous defect prediction. IEEE Transactions on Software EngineeringGoogle Scholar
  51. Nam J, Kim S (2015) CLAMI: defect prediction on unlabeled datasets. In: Proceedings of the 30th international conference on automated software engineering (ASE), IEEE, pp 452–463Google Scholar
  52. Nam J, Kim S (2015) Heterogeneous defect prediction. In: Proceedings of the 10th joint meeting on foundations of software engineering (FSE), ACM, pp 508–519Google Scholar
  53. Nam J, Pan SJ, Kim S (2013) Transfer defect learning. In: Proceedings of the 2013 international conference on software engineering (ICSE), IEEE Press, pp 382–391Google Scholar
  54. Neumann DE (2002) An enhanced neural network technique for software risk analysis. IEEE Trans Softw Eng (TSE) 28(9):904–912CrossRefGoogle Scholar
  55. Pan SJ, Tsang IW, Kwok JT, Yang Q (2011) Domain adaptation via transfer component analysis. IEEE Trans Neural Netw 22(2):199–210CrossRefGoogle Scholar
  56. Peters F, Menzies T, Gong L, Zhang H (2013) Balancing privacy and utility in cross-company defect prediction. IEEE Trans Softw Eng 39(8):1054–1068CrossRefGoogle Scholar
  57. Petric J., Bowes D, Hall T, Christianson B, Baddoo N (2016) The jinx on the NASA software defect data sets. In: Proceedings of the 20th international conference on evaluation and assessment in software engineering, ACM, pp 1–5Google Scholar
  58. Quinlan R (1993) C4.5: programs for machine learning. morgan kaufmann publishersGoogle Scholar
  59. Rathore SS, Gupta A (2014) A comparative study of feature-ranking and feature-subset selection techniques for improved fault prediction. In: Proceedings of the 7th India software engineering conference, ACM, p 7Google Scholar
  60. Ren J, Qin K, Ma Y, Luo G (2014) On software defect prediction using machine learning. J Appl Math 2014. Article ID 785435, 8 ppGoogle Scholar
  61. Rodríguez D, Ruiz R, Cuadrado-Gallego J, Aguilar-Ruiz J (2007a) Detecting fault modules applying feature selection to classifiers. In: Proceedings of the 2007 international conference on information reuse and integration, IEEE, pp 667–672Google Scholar
  62. Rodriguez D, Ruiz R, Cuadrado-Gallego J, Aguilar-Ruiz J, Garre M (2007b) Attribute selection in software engineering datasets for detecting fault modules. In: Proceedings of the 2007 EUROMICRO conference on software engineering and advanced applications, IEEE, pp 418–423Google Scholar
  63. Rokach L, Maimon O (2005) Clustering methods. In: Data mining and knowledge discovery handbook, Springer, pp 321–352Google Scholar
  64. Shepperd M, Song Q, Sun Z, Mair C (2013) Data quality: some comments on the NASA software defect datasets. IEEE Trans Softw Eng (TSE) 39(9):1208–1215CrossRefGoogle Scholar
  65. Shihab E (2014) Practical software quality prediction. In: Proceedings of the 2014 international conference on software maintenance and evolution (ICSME), IEEE, pp 639–644Google Scholar
  66. Shihab E, Jiang ZM, Ibrahim WM, Adams B, Hassan AE (2010) Understanding the impact of code and process metrics on post-release defects: A case study on the Eclipse project. In: Proceedings of the international symposium on empirical software engineering and measurement (ESEM), ACM, pp 4:1–4:10Google Scholar
  67. Shivaji S, Whitehead EJ, Akella R, Kim S (2013) Reducing features to improve code change-based bug prediction. IEEE Trans Softw Eng (TSE) 39(4):552–569CrossRefGoogle Scholar
  68. Smolensky P (1986) Information processing in dynamical systems: Foundations of harmony theory. Tech. rep., DTIC DocumentGoogle Scholar
  69. Tantithamthavorn C, Hassan AE (2018) An experience report on defect modelling in practice: Pitfalls and challenges. In: Proceedings of the 40th international conference on software engineering: software engineering in practice track (ICSE-SEIP), ACM, pp 286–295Google Scholar
  70. Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2016) Automated parameter optimization of classification techniques for defect prediction models. In: Proceedings of the 38th international conference on software engineering (ICSE), ACM, pp 321–332Google Scholar
  71. Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2017) An empirical comparison of model validation techniques for defect prediction models. IEEE Trans Softw Eng (TSE) 43(1):1–18CrossRefGoogle Scholar
  72. Tassey G (2002) The economic impacts of inadequate infrastructure for software testing. National Institute of Standards and TechnologyGoogle Scholar
  73. Von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17 (4):395–416MathSciNetCrossRefGoogle Scholar
  74. Xu Z, Liu J, Yang Z, An G, Jia X (2016) The impact of feature selection on defect prediction performance: an empirical comparison. In: Proceedings of the 27th international symposium on software reliability engineering (ISSRE), IEEE, pp 309–320Google Scholar
  75. Yang B, Yin Q, Xu S, Guo P (2008) Software quality prediction using affinity propagation algorithm. In: Proceedings of the international joint conference on neural networks, IEEE, pp 1891–1896Google Scholar
  76. Zhang F, Zheng Q, Zou Y, Hassan AE (2016) Cross-project defect prediction using a connectivity-based unsupervised classifier. In: Proceedings of the 38th international conference on software engineering (ICSE), ACM, pp 309–320Google Scholar
  77. Zhang H (2004) The optimality of Naive Bayes. In: FLAIRS conference, AAAI pressGoogle Scholar
  78. Zhong S, Khoshgoftaar TM, Seliya N (2004) Unsupervised learning for expert-based software quality estimation. In: HASE, Citeseer, pp 149–155Google Scholar
  79. Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: Proceedings of the the 7th joint meeting of the european software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering (ESEC-FSE), ACM, pp 91–100Google Scholar
  80. Zwillinger D, Kokoska S (1999) CRC Standard Probability and Statistics Tables and Formulae. Crc Press, Boca RatonzbMATHCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Software Engineering Laboratory (SEL)Kyoto Institute of TechnologyKyotoJapan
  2. 2.Department of Electrical and Computer EngineeringUniversity of AlbertaEdmontonCanada
  3. 3.Principles of Software Languages group (POSL)Kyushu UniversityFukuokaJapan
  4. 4.Software Analysis and Intelligence Lab (SAIL), School of ComputingQueen’s UniversityKingstonCanada

Personalised recommendations