Skip to main content
Log in

A Hybrid Instance Selection Using Nearest-Neighbor for Cross-Project Defect Prediction

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Software defect prediction (SDP) is an active research field in software engineering to identify defect-prone modules. Thanks to SDP, limited testing resources can be effectively allocated to defect-prone modules. Although SDP requires sufficient local data within a company, there are cases where local data are not available, e.g., pilot projects. Companies without local data can employ cross-project defect prediction (CPDP) using external data to build classifiers. The major challenge of CPDP is different distributions between training and test data. To tackle this, instances of source data similar to target data are selected to build classifiers. Software datasets have a class imbalance problem meaning the ratio of defective class to clean class is far low. It usually lowers the performance of classifiers. We propose a Hybrid Instance Selection Using Nearest-Neighbor (HISNN) method that performs a hybrid classification selectively learning local knowledge (via k-nearest neighbor) and global knowledge (via naïve Bayes). Instances having strong local knowledge are identified via nearest-neighbors with the same class label. Previous studies showed low PD (probability of detection) or high PF (probability of false alarm) which is impractical to use. The experimental results show that HISNN produces high overall performance as well as high PD and low PF.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Gao K, Khoshgoftaar T. Software defect prediction for high-dimensional and class-imbalanced data. In Proc. the 23rd SEKE, July 2011, pp. 89–94.

  2. Zheng J. Cost-sensitive boosting neural networks for software defect prediction. Expert Syst. Appl., 2010, 37(6): 4537–4543.

  3. Wang S, Yao X. Using class imbalance learning for software defect prediction. IEEE Trans. Reliab., 2013, 62(2): 434–443.

  4. Turhan B, Tosun Mısırlı A, Bener A. Empirical evaluation of the effects of mixed project data on learning defect predictors. Inf. Softw. Technol., 2013, 55(6): 1101–1118.

  5. Turhan B, Menzies T, Bener A B, Di Stefano J. On the relative value of cross-company and within-company data for defect prediction. Empir. Softw. Eng., 2009, 14(5): 540–578.

  6. Wilcoxon F. Individual comparisons by ranking methods. Biometrics Bull., 1945, 1(6): 80–83.

  7. Vargha A, Delaney H D. A critique and improvement of the “CL” common language effect size statistics of McGraw and Wong. J. Educ. Behav. Stat., 2000, 25(2): 101–132.

  8. Hall T, Beecham S, Bowes D, Gray D, Counsell S. A systematic literature review on fault prediction performance in software engineering. IEEE Trans. Softw. Eng., 2012, 38(6): 1276–1304.

  9. Arisholm E, Briand L C, Johannessen E B. A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. J. Syst. Softw., 2010, 83(1): 2–17.

  10. D’Ambros M, Lanza M, Robbes R. Evaluating defect prediction approaches: A benchmark and an extensive comparison. Empir. Softw. Eng., 2012, 17(4/5): 531–577.

  11. Dejaeger K, Verbraker T, Basesens B. Toward comprehensible software fault prediction models using Bayesian network classifiers. IEEE Trans. Softw. Eng., 2013, 39(2): 237–257.

  12. Elish K O, Elish M O. Predicting defect-prone software modules using support vector machines. J. Syst. Softw., 2008, 81(5): 649–660.

  13. Singh Y, Kaur A, Malhotra R. Empirical validation of object-oriented metrics for predicting fault proneness models. Softw. Qual. J., 2009, 18(1): 3–35.

  14. Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B. Cross-project defect prediction: A large scale experiment on data vs. domain vs. process. In Proc. the 7th ESEC/FSE, August 2009, pp. 91–100.

  15. He Z, Shu F, Yang Y, Li M, Wang Q. An investigation on the feasibility of cross-project defect prediction. Autom. Softw. Eng., 2011, 19(2): 167–199.

  16. Ma Y, Luo G, Zeng X, Chen A. Transfer learning for cross-company software defect prediction. Inf. Softw. Technol., 2012, 54(3): 248–256.

  17. Nam J, Pan S J, Kim S. Transfer defect learning. In Proc. the 35th Int. Conf. Softw. Eng., May 2013, pp. 382–391.

  18. Tan P N, Steinbach M, Kumar V. Introduction to Data Mining. Addison Wesley, 2006.

  19. Grbac T, Mausa G, Bašić B. Stability of software defect prediction in relation to levels of data imbalance. In Proc. the 2nd SQAMIA, Sept. 2013, pp.1:1–1:10.

  20. Raman B, Ioerger T R. Enhancing learning using feature and example selection. Technical Report, Department of Computer Science, Texas A&M Univ., 2003.

  21. Beyer K, Goldstein J, Ramakrishnan R, Shaft U. When is “nearest neighbor” meaningful? In Lecture Notes in Computer Science 1540, Beeri C, Buneman P (eds.), Springer-Verlag, 1999, pp. 217–235.

  22. Mahalanobis P C. On the generalised distance in statistics. Proc. Natl. Inst. Sci., 1936, 2(1): 49–55.

  23. Turhan B, Tosun A, Bener A. Empirical evaluation of mixed-project defect prediction models. In Proc. the 37th EUROMICRO Conf. Softw. Eng. Adv. Appl., Aug. 30-Sept. 2, 2011, pp.396–403.

  24. Hall M, Frank E, Holmes G et al. The WEKA data mining software: An update. ACM SIGKDD Explor. Newsl., 2009, 11(1): 10–18.

  25. Menzies T, Milton Z, Turhan B, Cukic B, Jiang Y, Bener A. Defect prediction from static code features: Current results, limitations, new approaches. Autom. Softw. Eng., 2010, 17(4): 375–407.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Duksan Ryu.

Additional information

This work was partly supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (Ministry of Science, ICT and Future Planning (MSIP)) under Grant No. NRF-2013R1A1A2006985 and Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korea government (MSIP) under Grant No. R0101-15-0144, Development of Autonomous Intelligent Collaboration Framework for Knowledge Bases and Smart Devices.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ryu, D., Jang, JI. & Baik, J. A Hybrid Instance Selection Using Nearest-Neighbor for Cross-Project Defect Prediction. J. Comput. Sci. Technol. 30, 969–980 (2015). https://doi.org/10.1007/s11390-015-1575-5

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-015-1575-5

Keywords

Navigation