Is Bigger Data Better for Defect Prediction: Examining the Impact of Data Size on Supervised and Unsupervised Defect Prediction

Liu, Xinyue; Li, Yanhui

doi:10.1007/978-3-030-30952-7_16

Is Bigger Data Better for Defect Prediction: Examining the Impact of Data Size on Supervised and Unsupervised Defect Prediction

Xinyue Liu^12,13 &
Yanhui Li^12,13

Conference paper
First Online: 16 September 2019

2029 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11817))

Abstract

Defect prediction could help software practitioners to predict the future occurrence of bugs in the software code regions. In order to improve the accuracy of defect prediction, dozens of supervised and unsupervised methods have been put forward and achieved good results in this field. One limiting factor of defect prediction is that the data size of defect data is not big, which restricts the scope of application with defect prediction models. In this study, we try to construct bigger defect datasets by merging available datasets with the same measurement dimension and check whether bigger data will bring better defect prediction performance with supervised and unsupervised models or not. The results of our experiment reveal that larger-scale dataset doesn’t bring improvements of both supervised and unsupervised classifiers.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Arisholm, E., Briand, L.C., Johannessen, E.B.: A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. J. Syst. Softw. 83(1), 2–17 (2010)
Article Google Scholar
Emam, K.E., Benlarbi, S., Goel, N., Rai, S.N.: The confounding effect of class size on the validity of object-oriented metrics. IEEE Trans. Softw. Eng. 27(7), 630–650 (1999)
Article Google Scholar
Erlikh, L.: Leveraging legacy system dollars for e-business (2000)
Article Google Scholar
Ghotra, B., Mcintosh, S., Hassan, A.E.: Revisiting the impact of classification techniques on the performance of defect prediction models. In: International Conference on Software Engineering (2015)
Google Scholar
Ibrahim, D.R., Ghnemat, R., Hudaib, A.: Software defect prediction using feature selection and random forest algorithm. In: International Conference on New Trends in Computing Sciences (2017)
Google Scholar
Romano, J., Kromrey, J.D., Coraggio, J.: Exploring methods for evaluating group differences on the NSSE and other surveys: are the t-test and Cohen’s d indices the most appropriate choices? (2006)
Google Scholar
Jing, X.Y., Ying, S., Zhang, Z.W., Wu, S.S., Liu, J.: Dictionary learning based software defect prediction (2014)
Google Scholar
Khalid, H., Nagappan, M., Shihab, E., Hassan, A.E.: Prioritizing the devices to test your app on: a case study of Android game apps. In: ACM SIGSOFT International Symposium on Foundations of Software Engineering (2014)
Google Scholar
Kim, S., Zimmermann, T., Whitehead Jr., E.J., Zeller, A.: Predicting faults from cached history. In: International Conference on Software Engineering (2008)
Google Scholar
Kocaguneli, E., Menzies, T., Keung, J., Cok, D., Madachy, R.: Active learning and effort estimation: finding the essential content of software effort estimation data. IEEE Trans. Softw. Eng. 39(8), 1040–1053 (2013)
Article Google Scholar
Lee, T., Nam, J., Han, D.G., Kim, S., In, H.P.: Micro interaction metrics for defect prediction (2011)
Google Scholar
Ma, W., Lin, C., Yang, Y., Zhou, Y., Xu, B.: Empirical analysis of network measures for effort-aware fault-proneness prediction. Inf. Softw. Technol. 69(C), 50–70 (2016)
Article Google Scholar
Mittas, N., Angelis, L.: Ranking and clustering software cost estimation models through a multiple comparisons algorithm. IEEE Trans. Softw. Eng. 39(4), 537–551 (2013)
Article Google Scholar
Nam, J., Fu, W., Kim, S., Menzies, T., Tan, L.: Heterogeneous defect prediction. IEEE Trans. Softw. Eng. PP(99), 1 (2015)
Google Scholar
Pushphavathi, T.P., Suma, V., Ramaswamy, V.: A novel method for software defect prediction: hybrid of FCM and random forest. In: International Conference on Electronics & Communication Systems (2014)
Google Scholar
Rahman, F., Devanbu, P.: How, and why, process metrics are better. In: International Conference on Software Engineering (2013)
Google Scholar
Scott, A.J., Knott, M.: A cluster analysis method for grouping means in the analysis of variance. Biometrics 30(3), 507–512 (1974)
Article Google Scholar
Wang, J., Shen, B., Chen, Y.: Compressed c4.5 models for software defect prediction. In: International Conference on Quality Software (2012)
Google Scholar
Wang, S., Liu, T., Tan, L.: Automatically learning semantic features for defect prediction (2016)
Google Scholar
Wilcoxon, F.: Individual comparisons of grouped data by ranking methods. J. Econ. Entomol. 39(6), 269 (1946)
Article Google Scholar
Yang, Y., et al.: Effort-aware just-in-time defect prediction: simple unsupervised models could be better than supervised models. In: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016, pp. 157–168. ACM, New York (2016)
Google Scholar
Zhang, J., Xu, L., Li, Y.: Classifying Python code comments based on supervised learning. In: Meng, X., Li, R., Wang, K., Niu, B., Wang, X., Zhao, G. (eds.) WISA 2018. LNCS, vol. 11242, pp. 39–47. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-02934-0_4
Chapter Google Scholar
Zhou, Y., Xu, B., Leung, H., Chen, L.: An in-depth study of the potentially confounding effect of class size in fault prediction. ACM Trans. Softw. Eng. Methodol. 23(1), 1–51 (2014)
Article Google Scholar
Zhou, Y., et al.: How far we have progressed in the journey? an examination of cross-project defect prediction. ACM Trans. Softw. Eng. Methodol. 27(1), 1:1–1:51 (2018)
Article Google Scholar
Zimmermann, T., Nagappan, N., Gall, H., Giger, E., Murphy, B.: Cross-project defect prediction a large scale experiment on data vs. domain vs. process. In: Proceedings of the Joint Meeting of the European Software Engineering Conference & the ACM SIGSOFT Symposium on the Foundations of Software Engineering (2009)
Google Scholar

Download references

Acknowledgement

The work is supported by National Key R&D Program of China (2018YFB1003901) and the National Natural Science Foundation of China (Grant No. 61872177).

Author information

Authors and Affiliations

State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Xinyue Liu & Yanhui Li
Department of Computer Science and Technology, Nanjing University, Nanjing, 210023, China
Xinyue Liu & Yanhui Li

Authors

Xinyue Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yanhui Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanhui Li .

Editor information

Editors and Affiliations

Southeast University, Nanjing, China
Weiwei Ni
Tianjin University, Tianjin, China
Xin Wang
Wuhan University, Wuhan, China
Wei Song
Tianjin University of Technology, Tianjin, China
Yukun Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, X., Li, Y. (2019). Is Bigger Data Better for Defect Prediction: Examining the Impact of Data Size on Supervised and Unsupervised Defect Prediction. In: Ni, W., Wang, X., Song, W., Li, Y. (eds) Web Information Systems and Applications. WISA 2019. Lecture Notes in Computer Science(), vol 11817. Springer, Cham. https://doi.org/10.1007/978-3-030-30952-7_16

Download citation

DOI: https://doi.org/10.1007/978-3-030-30952-7_16
Published: 16 September 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30951-0
Online ISBN: 978-3-030-30952-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the China Computer Federation (CCF) (opens in a new tab)