Skip to main content

Is Bigger Data Better for Defect Prediction: Examining the Impact of Data Size on Supervised and Unsupervised Defect Prediction

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11817))

Abstract

Defect prediction could help software practitioners to predict the future occurrence of bugs in the software code regions. In order to improve the accuracy of defect prediction, dozens of supervised and unsupervised methods have been put forward and achieved good results in this field. One limiting factor of defect prediction is that the data size of defect data is not big, which restricts the scope of application with defect prediction models. In this study, we try to construct bigger defect datasets by merging available datasets with the same measurement dimension and check whether bigger data will bring better defect prediction performance with supervised and unsupervised models or not. The results of our experiment reveal that larger-scale dataset doesn’t bring improvements of both supervised and unsupervised classifiers.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Arisholm, E., Briand, L.C., Johannessen, E.B.: A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. J. Syst. Softw. 83(1), 2–17 (2010)

    Article  Google Scholar 

  2. Emam, K.E., Benlarbi, S., Goel, N., Rai, S.N.: The confounding effect of class size on the validity of object-oriented metrics. IEEE Trans. Softw. Eng. 27(7), 630–650 (1999)

    Article  Google Scholar 

  3. Erlikh, L.: Leveraging legacy system dollars for e-business (2000)

    Article  Google Scholar 

  4. Ghotra, B., Mcintosh, S., Hassan, A.E.: Revisiting the impact of classification techniques on the performance of defect prediction models. In: International Conference on Software Engineering (2015)

    Google Scholar 

  5. Ibrahim, D.R., Ghnemat, R., Hudaib, A.: Software defect prediction using feature selection and random forest algorithm. In: International Conference on New Trends in Computing Sciences (2017)

    Google Scholar 

  6. Romano, J., Kromrey, J.D., Coraggio, J.: Exploring methods for evaluating group differences on the NSSE and other surveys: are the t-test and Cohen’s d indices the most appropriate choices? (2006)

    Google Scholar 

  7. Jing, X.Y., Ying, S., Zhang, Z.W., Wu, S.S., Liu, J.: Dictionary learning based software defect prediction (2014)

    Google Scholar 

  8. Khalid, H., Nagappan, M., Shihab, E., Hassan, A.E.: Prioritizing the devices to test your app on: a case study of Android game apps. In: ACM SIGSOFT International Symposium on Foundations of Software Engineering (2014)

    Google Scholar 

  9. Kim, S., Zimmermann, T., Whitehead Jr., E.J., Zeller, A.: Predicting faults from cached history. In: International Conference on Software Engineering (2008)

    Google Scholar 

  10. Kocaguneli, E., Menzies, T., Keung, J., Cok, D., Madachy, R.: Active learning and effort estimation: finding the essential content of software effort estimation data. IEEE Trans. Softw. Eng. 39(8), 1040–1053 (2013)

    Article  Google Scholar 

  11. Lee, T., Nam, J., Han, D.G., Kim, S., In, H.P.: Micro interaction metrics for defect prediction (2011)

    Google Scholar 

  12. Ma, W., Lin, C., Yang, Y., Zhou, Y., Xu, B.: Empirical analysis of network measures for effort-aware fault-proneness prediction. Inf. Softw. Technol. 69(C), 50–70 (2016)

    Article  Google Scholar 

  13. Mittas, N., Angelis, L.: Ranking and clustering software cost estimation models through a multiple comparisons algorithm. IEEE Trans. Softw. Eng. 39(4), 537–551 (2013)

    Article  Google Scholar 

  14. Nam, J., Fu, W., Kim, S., Menzies, T., Tan, L.: Heterogeneous defect prediction. IEEE Trans. Softw. Eng. PP(99), 1 (2015)

    Google Scholar 

  15. Pushphavathi, T.P., Suma, V., Ramaswamy, V.: A novel method for software defect prediction: hybrid of FCM and random forest. In: International Conference on Electronics & Communication Systems (2014)

    Google Scholar 

  16. Rahman, F., Devanbu, P.: How, and why, process metrics are better. In: International Conference on Software Engineering (2013)

    Google Scholar 

  17. Scott, A.J., Knott, M.: A cluster analysis method for grouping means in the analysis of variance. Biometrics 30(3), 507–512 (1974)

    Article  Google Scholar 

  18. Wang, J., Shen, B., Chen, Y.: Compressed c4.5 models for software defect prediction. In: International Conference on Quality Software (2012)

    Google Scholar 

  19. Wang, S., Liu, T., Tan, L.: Automatically learning semantic features for defect prediction (2016)

    Google Scholar 

  20. Wilcoxon, F.: Individual comparisons of grouped data by ranking methods. J. Econ. Entomol. 39(6), 269 (1946)

    Article  Google Scholar 

  21. Yang, Y., et al.: Effort-aware just-in-time defect prediction: simple unsupervised models could be better than supervised models. In: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016, pp. 157–168. ACM, New York (2016)

    Google Scholar 

  22. Zhang, J., Xu, L., Li, Y.: Classifying Python code comments based on supervised learning. In: Meng, X., Li, R., Wang, K., Niu, B., Wang, X., Zhao, G. (eds.) WISA 2018. LNCS, vol. 11242, pp. 39–47. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-02934-0_4

    Chapter  Google Scholar 

  23. Zhou, Y., Xu, B., Leung, H., Chen, L.: An in-depth study of the potentially confounding effect of class size in fault prediction. ACM Trans. Softw. Eng. Methodol. 23(1), 1–51 (2014)

    Article  Google Scholar 

  24. Zhou, Y., et al.: How far we have progressed in the journey? an examination of cross-project defect prediction. ACM Trans. Softw. Eng. Methodol. 27(1), 1:1–1:51 (2018)

    Article  Google Scholar 

  25. Zimmermann, T., Nagappan, N., Gall, H., Giger, E., Murphy, B.: Cross-project defect prediction a large scale experiment on data vs. domain vs. process. In: Proceedings of the Joint Meeting of the European Software Engineering Conference & the ACM SIGSOFT Symposium on the Foundations of Software Engineering (2009)

    Google Scholar 

Download references

Acknowledgement

The work is supported by National Key R&D Program of China (2018YFB1003901) and the National Natural Science Foundation of China (Grant No. 61872177).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yanhui Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, X., Li, Y. (2019). Is Bigger Data Better for Defect Prediction: Examining the Impact of Data Size on Supervised and Unsupervised Defect Prediction. In: Ni, W., Wang, X., Song, W., Li, Y. (eds) Web Information Systems and Applications. WISA 2019. Lecture Notes in Computer Science(), vol 11817. Springer, Cham. https://doi.org/10.1007/978-3-030-30952-7_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-30952-7_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-30951-0

  • Online ISBN: 978-3-030-30952-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics