Advertisement

InECCE2019 pp 541-553 | Cite as

Hybrid Sampling and Random Forest Based Machine Learning Approach for Software Defect Prediction

  • Md Anwar HossenEmail author
  • Md. Shariful Islam
  • Nurhafizah Abu Talip Yusof
  • Md. Sakib Rahman
  • Fatema Siddika
  • Mostafijur Rahman
  • Sabira Khatun
  • Mohamad Shaiful Abdul Karim
  • S. M. Hasan Mahmud
Conference paper
  • 5 Downloads
Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 632)

Abstract

The software has turn into an imperious part of human’s life. In the recent computing era, many large-scale complex network systems and millions of modern technological devices produce a huge amount of data every second. Among these data, the amount of imbalanced data is relatively excessive. The machine learning model is miss leaded by these imbalanced data. Software Defect Prediction (SDP) is a standout amongst the most helping exercises during the testing phase. The estimated cost of finding and fixing defects is approximately billions of pounds per year. To reduce this problem, software defect prediction has come forth but need fine tuning to have expected efficiency. In this chapter, we have proposed a new model based on machine learning approach to predict software defect and identify the key factors that may help the software engineer to identify the most defect-prone part of the system. The proposed model works as follows. First, need to remove highly correlated features and turn all the feature in the same scale using the scaling feature approach. Second, we have used Synthetic Minority Over-Sampling Technique (SMOTE), Adaptive Synthetic (ADASYN) and Hybrid sampling method to balance highly imbalanced datasets. Third, Random Forest Importance and Chi-square algorithms are chosen to find out the factors which have high effect on software defect. Cross validation is used to remove overriding problem. Scikit-learn library is used for machine learning algorithms. Pandas library is used for data processing. Matplotlib, and PyPlot are used for graph and data visualization respectively. The hybrid sampling method and Random Forest (RF) algorithms achieved the highest prediction accuracy about 93.26% by showing its superiority.

Keywords

Software defect prediction Machine learning Imbalanced dataset Chi square Random forest importance 

Notes

Acknowledgements

This research work is supported by research grant RDU1703236 funded by Universiti Malaysia Pahang, https://www.ump.edu.my/. The authors would also like to thank the Faculty of Electrical & Electronics Engineering, Universiti Malaysia Pahang for financial support.

References

  1. 1.
    Menzies T, Greenwald J, Frank A (2007) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng 33:2–13 Google Scholar
  2. 2.
    Lin J-C, Wu K-C (2007) Digging high risk defects out in software engineering. In: International conference on intelligent information processing. Springer US, pp 20–23Google Scholar
  3. 3.
    Gray D, Bowes D (2011) The misuse of the NASA metrics data program data sets for automated software defect prediction. In: IET conference proceedings. The Institution of Engineering & Technology, pp 96–103 (2011)Google Scholar
  4. 4.
    Lessmann S, Baesens B (2008) Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans Software Eng 34(4):485–496Google Scholar
  5. 5.
    Khoshgoftaar TM, Gao K, Napolitano A (2012) An empirical study of feature ranking techniques for software quality prediction. Int J Softw Eng Knowl Eng 22:161–183Google Scholar
  6. 6.
    Vashisht V, Lal M, Sureshchandar GS (2016) Defect prediction framework using neural networks for software enhancement projects. Br J Math Comput Sci (BJMCS) 16(5)Google Scholar
  7. 7.
    Hall MA, Holmes G (2003) Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans Knowl Data Eng 15(6):1437–1447Google Scholar
  8. 8.
    Wang H, Khoshgoftaar TM, Gao K, Seliya N (2009) Mining data from multiple software development projects. In: Proceedings of the 3rd IEEE international workshop mining multiple information sources, pp 551–557, Miami, FLGoogle Scholar
  9. 9.
    Promise Dataset, https://promise.site.uottawa.ca/SERepository/datasets/jm1.arff. Last accessed 4 April 2019
  10. 10.
    Benesty J, Chen J, Huang Y, Cohen I (2009) Pearson correlation coefficient. In: Noise reduction in speech processing. Springer, Berlin, Heidelberg, pp 1–4Google Scholar
  11. 11.
    Danielsson P-E (1980) Euclidean distance mapping. Comput Graph Image Process 14(3):227–248Google Scholar
  12. 12.
  13. 13.
    Strobl C, Boulesteix A-L, Zeileis A, Hothorn T (2007) Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinform 8(25):25Google Scholar
  14. 14.
    Random forest feature importance, https://blog.datadive.net/selecting-goodfeatures-part-iii-random-forests/. Last accessed 1 Oct 2018
  15. 15.
    Sklearn.feature-selection.chi2, https://scikitlearn.org. Last accessed April 2019
  16. 16.
    Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357 (2002)Google Scholar
  17. 17.
    He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: International joint conference on neural networks, IJCNN 2008, pp 1322–1328Google Scholar
  18. 18.
    Seiffert C, Khoshgoftaar TM, Van Hulse J (2009) Hybrid sampling for imbalanced data. Integr Comput Aided Eng 16(3):193–210Google Scholar
  19. 19.
    Burez J, Van den Poel D (2009) Handling class imbalance in customer churn prediction. Expert Syst Appl 36(3):4626-4636Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2020

Authors and Affiliations

  • Md Anwar Hossen
    • 1
    Email author
  • Md. Shariful Islam
    • 1
  • Nurhafizah Abu Talip Yusof
    • 2
  • Md. Sakib Rahman
    • 1
  • Fatema Siddika
    • 3
  • Mostafijur Rahman
    • 1
  • Sabira Khatun
    • 2
  • Mohamad Shaiful Abdul Karim
    • 2
  • S. M. Hasan Mahmud
    • 1
  1. 1.Department of Software EngineeringDaffodil International UniversityDhakaBangladesh
  2. 2.Faculty of Electrical and Electronics EngineeringUniversiti Malaysia PahangPekanMalaysia
  3. 3.Department of Computer Science and EngineeringJagannath UniversityDhakaBangladesh

Personalised recommendations