The impact of feature reduction techniques on defect prediction models
- 48 Downloads
Abstract
Defect prediction is an important task for preserving software quality. Most prior work on defect prediction uses software features, such as the number of lines of code, to predict whether a file or commit will be defective in the future. There are several reasons to keep the number of features that are used in a defect prediction model small. For example, using a small number of features avoids the problem of multicollinearity and the so-called ‘curse of dimensionality’. Feature selection and reduction techniques can help to reduce the number of features in a model. Feature selection techniques reduce the number of features in a model by selecting the most important ones, while feature reduction techniques reduce the number of features by creating new, combined features from the original features. Several recent studies have investigated the impact of feature selection techniques on defect prediction. However, there do not exist large-scale studies in which the impact of multiple feature reduction techniques on defect prediction is investigated. In this paper, we study the impact of eight feature reduction techniques on the performance and the variance in performance of five supervised learning and five unsupervised defect prediction models. In addition, we compare the impact of the studied feature reduction techniques with the impact of the two best-performing feature selection techniques (according to prior work). The following findings are the highlights of our study: (1) The studied correlation and consistency-based feature selection techniques result in the best-performing supervised defect prediction models, while feature reduction techniques using neural network-based techniques (restricted Boltzmann machine and autoencoder) result in the best-performing unsupervised defect prediction models. In both cases, the defect prediction models that use the selected/generated features perform better than those that use the original features (in terms of AUC and performance variance). (2) Neural network-based feature reduction techniques generate features that have a small variance across both supervised and unsupervised defect prediction models. Hence, we recommend that practitioners who do not wish to choose a best-performing defect prediction model for their data use a neural network-based feature reduction technique.
Keywords
Feature reduction Feature selection Defect prediction Restricted Boltzmann machine Neural networkNotes
Acknowledgment
This work was partially supported by NSERC as well as JSPS KAKENHI (Grant Numbers: JP16K12415 and JP18H03222).
References
- Abaei G, Rezaei Z, Selamat A (2013) Fault prediction by utilizing self-organizing map and threshold. In: Proceedings of the international conference on control system, computing and engineering (ICCSCE), IEEE, pp 465–470Google Scholar
- Arora I, Tetarwal V, Saha A (2015) Open issues in software defect prediction. Procedia Comput Sci 46:906–912CrossRefGoogle Scholar
- Basili VR, Briand LC, Melo WL (1996) A validation of object-oriented design metrics as quality indicators. IEEE Trans Softw Eng (TSE) 22(10):751–761CrossRefGoogle Scholar
- Bellman R (1957) Dynamic Programming. Princeton University Press, PrincetonzbMATHGoogle Scholar
- Bingham E, Mannila H (2001) Random projection in dimensionality reduction: applications to image and text data. In: Proceedings of the 7th international conference on knowledge discovery and data mining, ACM, pp 245–250Google Scholar
- Bishnu PS, Bhattacherjee V (2012) Software fault prediction using quad tree-based k-means clustering algorithm. IEEE Trans Knowl Data Eng 24(6):1146–1150CrossRefGoogle Scholar
- Challagulla VUB, Bastani FB, Yen IL, Paul RA (2008) Empirical assessment of machine learning based software defect prediction techniques. Int J Artif Intell Tools 17(02):389–400CrossRefGoogle Scholar
- Chidamber SR, Kemerer CF (1994) A metrics suite for object oriented design. IEEE Trans Softw Eng (TSE) 20(6):476–493CrossRefGoogle Scholar
- Cohen J (1988) Statistical power analysis for the behavioral sciencesGoogle Scholar
- D’Ambros M, Lanza M, Robbes R (2010) An extensive comparison of bug prediction approaches. In: Proceedings of the 7th international conference on mining software repositories (MSR), IEEE, pp 31–41Google Scholar
- Dash M, Liu H (2003) Consistency-based search in feature selection. Artif Intell 151(1):155–176MathSciNetzbMATHCrossRefGoogle Scholar
- Dunn JC (1973) A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J Cybernet 3:32–57MathSciNetzbMATHCrossRefGoogle Scholar
- Faloutsos C, Lin KI (1995) FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: Proceedings of the ACM SIGMOD international conference on management of data, ACM, pp 163–174Google Scholar
- Farrar DE, Glauber RR (1967) Multicollinearity in regression analysis: the problem revisited. Rev Econ Stat 49(1):92–107CrossRefGoogle Scholar
- Gao K, Khoshgoftaar TM, Wang H, Seliya N (2011) Choosing software metrics for defect prediction: an investigation on feature selection techniques. Software: Practice and Experience 41(5):579–606Google Scholar
- Ghotra B, McIntosh S, Hassan AE (2015) Revisiting the impact of classification techniques on the performance of defect prediction models. In: Proceedings of the 37th international conference on software engineering (ICSE), IEEE Press, pp 789–800Google Scholar
- Ghotra B, Mcintosh S, Hassan AE (2017) A large-scale study of the impact of feature selection techniques on defect classification models. In: Proceedings of the 14th international conference on mining software repositories (MSR), IEEE Press, pp 146–157Google Scholar
- Gray AR, Macdonell SG (1999) Software metrics data analysis–exploring the relative performance of some commonly used modeling techniques. Empir Softw Eng 4 (4):297–316CrossRefGoogle Scholar
- Guo L, Cukic B, Singh H (2003) Predicting fault prone modules by the dempster-shafer belief networks. In: Proceedings of the 18th international conference on automated software engineering (ASE), IEEE, pp 249–252Google Scholar
- Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11(1):10–18CrossRefGoogle Scholar
- Hall MA (1999) Correlation-based feature selection for machine learning. Ph.D. thesis, University of Waikato HamiltonGoogle Scholar
- Hall MA, Holmes G (2003) Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans Knowl Data Eng 15(6):1437–1447CrossRefGoogle Scholar
- Halstead MH (1977) Elements of Software Science, vol 7. Elsevier, New YorkzbMATHGoogle Scholar
- Han J, Moraga C (1995) The influence of the sigmoid function parameters on the speed of backpropagation learning. In: Proceedings of the international workshop on artificial neural networks, Springer, pp 195–201Google Scholar
- Hartigan JA, Wong MA (1979) Algorithm as 136: a k-means clustering algorithm. J R Stat Soc Ser C Appl Stat 28(1):100–108zbMATHGoogle Scholar
- Hassan AE (2009) Predicting faults using the complexity of code changes. In: Proceedings of the 31st international conference on software engineering (ICSE), IEEE Computer Society, pp 78–88Google Scholar
- He Z, Shu F, Yang Y, Li M, Wang Q (2012) An investigation on the feasibility of cross-project defect prediction. Autom Softw Eng 19(2):167–199CrossRefGoogle Scholar
- Herbold S (2013) Training data selection for cross-project defect prediction. In: Proceedings of the 9th international conference on predictive models in software engineering, ACM, p 6Google Scholar
- Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507MathSciNetzbMATHCrossRefGoogle Scholar
- Hira ZM, Gillies DF (2015) A review of feature selection and feature extraction methods applied on microarray data. Adv Bioinforma 2015. Article ID 198363, 13 ppGoogle Scholar
- Ho TK (1995) Random decision forests. In: Proceedings of the 3rd international conference on document analysis and recognition, vol 1. IEEE, pp 278–282Google Scholar
- Jureczko M, Madeyski L (2010) Towards identifying software project clusters with regard to defect prediction. In: Proceedings of the 6th international conference on predictive models in software engineering, ACM, p 9Google Scholar
- Kamei Y, Fukushima T, McIntosh S, Yamashita K, Ubayashi N, Hassan AE (2016) Studying just-in-time defect prediction using cross-project models. Empir Softw Eng 21(5):2072–2106CrossRefGoogle Scholar
- Kaufman L, Rousseeuw PJ (2009) Finding Groups in Data: an Introduction to Cluster Analysis, vol 344. Wiley, HobokenGoogle Scholar
- Kim S, Zimmermann T, Whitehead EJ Jr, Zeller A (2007) Predicting faults from cached history. In: Proceedings of the 29th international conference on software engineering (ICSE), IEEE Computer Society, pp 489–498Google Scholar
- Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):1464–1480CrossRefGoogle Scholar
- Kuhn M (2015) Caret: classification and regression training. Astrophysics Source Code LibraryGoogle Scholar
- Landwehr N, Hall M, Frank E (2005) Logistic model trees. Mach Learn 59 (1):161–205zbMATHCrossRefGoogle Scholar
- van der Maaten L (2014) Accelerating t-SNE using tree-based algorithms. J Mach Learn Res 15(1):3221–3245MathSciNetzbMATHGoogle Scholar
- van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9(Nov):2579–2605zbMATHGoogle Scholar
- Martinetz T, Schulten K (1991) A “neural-gas” network learns topologies. Artificial Neural Networks 1:397–402Google Scholar
- McCabe TJ (1976) A complexity measure. IEEE Trans Softw Eng (TSE) SE-2 (4):308–320MathSciNetzbMATHCrossRefGoogle Scholar
- McDonald JH (2014) Handbook of Biological Statistics, 3rd edn. Sparky House Publishing, BaltimoreGoogle Scholar
- Menzies T, Greenwald J, Frank A (2007a) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng (TSE) 33(1):2–13CrossRefGoogle Scholar
- Menzies T, Owen D, Richardson J (2007b) The strangest thing about software. Computer 40(1):54–60CrossRefGoogle Scholar
- Moser R, Pedrycz W, Succi G (2008) A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In: Proceedings of the 30th international conference on software engineering (ICSE), IEEE, pp 181–190Google Scholar
- Muthukumaran K, Rallapalli A, Murthy N (2015) Impact of feature selection techniques on bug prediction models. In: Proceedings of the 8th India software engineering conference, ACM, pp 120–129Google Scholar
- Nagappan N, Ball T, Zeller A (2006) Mining metrics to predict component failures. In: Proceedings of the 28th international conference on software engineering (ICSE), ACM, pp 452–461Google Scholar
- Nam J (2014) Survey on software defect prediction. HKUST PhD Qualifying Examination, Department of Compter Science and Engineerning. The Hong Kong University of Science and Technology, Tech. RepGoogle Scholar
- Nam J, Fu W, Kim S, Menzies T, Tan L (2017) Heterogeneous defect prediction. IEEE Transactions on Software EngineeringGoogle Scholar
- Nam J, Kim S (2015) CLAMI: defect prediction on unlabeled datasets. In: Proceedings of the 30th international conference on automated software engineering (ASE), IEEE, pp 452–463Google Scholar
- Nam J, Kim S (2015) Heterogeneous defect prediction. In: Proceedings of the 10th joint meeting on foundations of software engineering (FSE), ACM, pp 508–519Google Scholar
- Nam J, Pan SJ, Kim S (2013) Transfer defect learning. In: Proceedings of the 2013 international conference on software engineering (ICSE), IEEE Press, pp 382–391Google Scholar
- Neumann DE (2002) An enhanced neural network technique for software risk analysis. IEEE Trans Softw Eng (TSE) 28(9):904–912CrossRefGoogle Scholar
- Pan SJ, Tsang IW, Kwok JT, Yang Q (2011) Domain adaptation via transfer component analysis. IEEE Trans Neural Netw 22(2):199–210CrossRefGoogle Scholar
- Peters F, Menzies T, Gong L, Zhang H (2013) Balancing privacy and utility in cross-company defect prediction. IEEE Trans Softw Eng 39(8):1054–1068CrossRefGoogle Scholar
- Petric J., Bowes D, Hall T, Christianson B, Baddoo N (2016) The jinx on the NASA software defect data sets. In: Proceedings of the 20th international conference on evaluation and assessment in software engineering, ACM, pp 1–5Google Scholar
- Quinlan R (1993) C4.5: programs for machine learning. morgan kaufmann publishersGoogle Scholar
- Rathore SS, Gupta A (2014) A comparative study of feature-ranking and feature-subset selection techniques for improved fault prediction. In: Proceedings of the 7th India software engineering conference, ACM, p 7Google Scholar
- Ren J, Qin K, Ma Y, Luo G (2014) On software defect prediction using machine learning. J Appl Math 2014. Article ID 785435, 8 ppGoogle Scholar
- Rodríguez D, Ruiz R, Cuadrado-Gallego J, Aguilar-Ruiz J (2007a) Detecting fault modules applying feature selection to classifiers. In: Proceedings of the 2007 international conference on information reuse and integration, IEEE, pp 667–672Google Scholar
- Rodriguez D, Ruiz R, Cuadrado-Gallego J, Aguilar-Ruiz J, Garre M (2007b) Attribute selection in software engineering datasets for detecting fault modules. In: Proceedings of the 2007 EUROMICRO conference on software engineering and advanced applications, IEEE, pp 418–423Google Scholar
- Rokach L, Maimon O (2005) Clustering methods. In: Data mining and knowledge discovery handbook, Springer, pp 321–352Google Scholar
- Shepperd M, Song Q, Sun Z, Mair C (2013) Data quality: some comments on the NASA software defect datasets. IEEE Trans Softw Eng (TSE) 39(9):1208–1215CrossRefGoogle Scholar
- Shihab E (2014) Practical software quality prediction. In: Proceedings of the 2014 international conference on software maintenance and evolution (ICSME), IEEE, pp 639–644Google Scholar
- Shihab E, Jiang ZM, Ibrahim WM, Adams B, Hassan AE (2010) Understanding the impact of code and process metrics on post-release defects: A case study on the Eclipse project. In: Proceedings of the international symposium on empirical software engineering and measurement (ESEM), ACM, pp 4:1–4:10Google Scholar
- Shivaji S, Whitehead EJ, Akella R, Kim S (2013) Reducing features to improve code change-based bug prediction. IEEE Trans Softw Eng (TSE) 39(4):552–569CrossRefGoogle Scholar
- Smolensky P (1986) Information processing in dynamical systems: Foundations of harmony theory. Tech. rep., DTIC DocumentGoogle Scholar
- Tantithamthavorn C, Hassan AE (2018) An experience report on defect modelling in practice: Pitfalls and challenges. In: Proceedings of the 40th international conference on software engineering: software engineering in practice track (ICSE-SEIP), ACM, pp 286–295Google Scholar
- Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2016) Automated parameter optimization of classification techniques for defect prediction models. In: Proceedings of the 38th international conference on software engineering (ICSE), ACM, pp 321–332Google Scholar
- Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2017) An empirical comparison of model validation techniques for defect prediction models. IEEE Trans Softw Eng (TSE) 43(1):1–18CrossRefGoogle Scholar
- Tassey G (2002) The economic impacts of inadequate infrastructure for software testing. National Institute of Standards and TechnologyGoogle Scholar
- Von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17 (4):395–416MathSciNetCrossRefGoogle Scholar
- Xu Z, Liu J, Yang Z, An G, Jia X (2016) The impact of feature selection on defect prediction performance: an empirical comparison. In: Proceedings of the 27th international symposium on software reliability engineering (ISSRE), IEEE, pp 309–320Google Scholar
- Yang B, Yin Q, Xu S, Guo P (2008) Software quality prediction using affinity propagation algorithm. In: Proceedings of the international joint conference on neural networks, IEEE, pp 1891–1896Google Scholar
- Zhang F, Zheng Q, Zou Y, Hassan AE (2016) Cross-project defect prediction using a connectivity-based unsupervised classifier. In: Proceedings of the 38th international conference on software engineering (ICSE), ACM, pp 309–320Google Scholar
- Zhang H (2004) The optimality of Naive Bayes. In: FLAIRS conference, AAAI pressGoogle Scholar
- Zhong S, Khoshgoftaar TM, Seliya N (2004) Unsupervised learning for expert-based software quality estimation. In: HASE, Citeseer, pp 149–155Google Scholar
- Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: Proceedings of the the 7th joint meeting of the european software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering (ESEC-FSE), ACM, pp 91–100Google Scholar
- Zwillinger D, Kokoska S (1999) CRC Standard Probability and Statistics Tables and Formulae. Crc Press, Boca RatonzbMATHCrossRefGoogle Scholar