Addressing Class Imbalance and Cost Sensitivity in Software Defect Prediction by Combining Domain Costs and Balancing Costs

Siers, Michael J.; Islam, Md Zahidul

doi:10.1007/978-3-319-49586-6_11

Michael J. Siers¹⁸ &
Md Zahidul Islam¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10086))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

2445 Accesses
2 Citations

Abstract

Effective methods for identification of software defects help minimize the business costs of software development. Classification methods can be used to perform software defect prediction. When cost-sensitive methods are used, the predictions are optimized for business cost. The data sets used as input for these methods typically suffer from the class imbalance problem. That is, there are many more defect-free code examples than defective code examples to learn from. This negatively impacts the classifier’s ability to correctly predict defective code examples. Cost-sensitive classification can also be used to mitigate the affects of the class imbalance problem by setting the costs to reflect the level of imbalance in the training data set. Through an experimental process, we have developed a method for combining these two different types of costs. We demonstrate that by using our proposed approach, we can produce more cost effective predictions than several recent cost-sensitive methods used for software defect prediction. Furthermore, we examine the software defect prediction models built by our method and present the discovered insights.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Nasa-software defect datasets. http://openscience.us/repo/defect/mccabehalstedl. Accessed 24 July 2016
Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Dbsmote: density-based synthetic minority over-sampling technique. Appl. Intell. 36(3), 664–684 (2012)
Article Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
MATH Google Scholar
Domingos, P.: Metacost: A general method for making classifiers cost-sensitive. In: Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 155–164. ACM (1999)
Google Scholar
Halstead, M.H.: Elements of software science. North-Holland (1977)
Google Scholar
He, H., Bai, Y., Garcia, E.A., Li, S.: Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322–1328. IEEE (2008)
Google Scholar
Islam, M.Z.: EXPLORE: a novel decision tree classification algorithm. In: MacKinnon, L.M. (ed.) BNCOD 2010. LNCS, vol. 6121, pp. 55–71. Springer, Heidelberg (2012). doi:10.1007/978-3-642-25704-9_7
Chapter Google Scholar
Islam, Z., Giggins, H.: Knowledge discovery through sysfor: a systematically developed forest of multiple decision trees. In: Proceedings of the Ninth Australasian Data Mining Conference, vol. 121, pp. 195–204. Australian Computer Society, Inc. (2011)
Google Scholar
Jiang, T., Tan, L., Kim, S.: Personalized defect prediction. In: 2013 IEEE/ACM 28th International Conference on Automated Software Engineering (ASE), pp. 279–289. IEEE (2013)
Google Scholar
Ling, C.X., Sheng, V.S., Bruckhaus, T., Madhavji, N.H.: Maximum profit mining and its application in software development. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 929–934. ACM (2006)
Google Scholar
McCabe, T.J.: A complexity measure. IEEE Trans. Softw. Eng. 4, 308–320 (1976)
Article MATH Google Scholar
Öztürk, M.M., Cavusoglu, U., Zengin, A.: A novel defect prediction method for web pages using k-means++. Expert Syst. Appl. 42(19), 6496–6506 (2015)
Article Google Scholar
Quinlan, J.R.: C4. 5: programs for machine learning. Elsevier (2014)
Google Scholar
Ramentol, E., Caballero, Y., Bello, R., Herrera, F.: Smote-rsb*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using smote and rough sets theory. Knowl. Inf. Syst. 33(2), 245–265 (2012)
Article Google Scholar
Ricky, M.Y., Purnomo, F., Yulianto, B.: Mobile application software defect prediction. In: 2016 IEEE Symposium on Service-Oriented System Engineering (SOSE), pp. 307–313. IEEE (2016)
Google Scholar
Sheng, V.S., Gu, B., Fang, W., Wu, J.: Cost-sensitive learning for defect escalation. Knowl. Based Syst. 66, 146–155 (2014)
Article Google Scholar
Shepperd, M., Song, Q., Sun, Z., Mair, C.: Data quality: some comments on the nasa software defect datasets. IEEE Trans. Softw. Eng. 39(9), 1208–1215 (2013)
Article Google Scholar
Siers, M.J., Islam, M.Z.: Cost sensitive decision forest and voting for software defect prediction. In: Pham, D.-N., Park, S.-B. (eds.) PRICAI 2014. LNCS (LNAI), vol. 8862, pp. 929–936. Springer, Heidelberg (2014). doi:10.1007/978-3-319-13560-1_80
Google Scholar
Siers, M.J., Islam, M.Z.: Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem. Inf. Syst. 51, 62–71 (2015)
Article Google Scholar
Siers, M.J., Islam, M.Z.: Standoff-balancing: a novel class imbalance treatment method inspired by military strategy. In: Pfahringer, B., Renz, J. (eds.) AI 2015. LNCS (LNAI), vol. 9457, pp. 517–525. Springer, Heidelberg (2015). doi:10.1007/978-3-319-26350-2_46
Chapter Google Scholar
Siers, M.J., Islam, M.Z.: Rbclust: High quality class-specific clustering using rule-based classification. In: European Symposium on Artificial Neural Networks, ESANN 2016 (2016, Accepted)
Google Scholar
Tahir, M.A., Kittler, J., Yan, F.: Inverse random under sampling for class imbalance problem and its application to multi-label classification. Pattern Recogn. 45(10), 3738–3750 (2012)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing and Mathematics, Charles Sturt University, Bathurst, Australia
Michael J. Siers & Md Zahidul Islam

Authors

Michael J. Siers
View author publications
You can also search for this author in PubMed Google Scholar
Md Zahidul Islam
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael J. Siers .

Editor information

Editors and Affiliations

University of Technology , Sydney, New South Wales, Australia
Jinyan Li
University of Queensland , Brisbane, Australia
Xue Li
Beijing Institute of Technology , Beijing, China
Shuliang Wang
University of Western Australia , Crawley, West Australia, Australia
Jianxin Li
University of Adelaide , Adelaide, South Australia, Australia
Quan Z. Sheng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Siers, M.J., Islam, M.Z. (2016). Addressing Class Imbalance and Cost Sensitivity in Software Defect Prediction by Combining Domain Costs and Balancing Costs. In: Li, J., Li, X., Wang, S., Li, J., Sheng, Q. (eds) Advanced Data Mining and Applications. ADMA 2016. Lecture Notes in Computer Science(), vol 10086. Springer, Cham. https://doi.org/10.1007/978-3-319-49586-6_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-49586-6_11
Published: 13 November 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49585-9
Online ISBN: 978-3-319-49586-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics