A method of credit evaluation modeling based on block-wise missing data

Abstract

Missing data is a common problem in credit evaluation practice and can obstruct the development and application of an evaluation model. Block-wise missing data is a particularly troublesome issue. Based on multi-task feature selection approach, this paper proposes a method called MMPFS to build a model for credit evaluation that primarily includes two steps: (1) dividing the dataset into several nonoverlapping subsets based on missing patterns, and (2) integrating the multi-task feature selection approach using logistic regression to perform joint feature learning on all subsets. The proposed method has the following advantages: (1) missing data do not need to be managed in advance, (2) available data can be fully used for model learning, (3) information loss or bias caused by general missing data processing methods can be avoided, and (4) overfitting risk caused by redundant features can be reduced. The implementation framework and algorithm principle of the proposed method are described, and three credit datasets from UCI are investigated to compare the proposed method with other commonly used missing data treatments. The results show that MMPFS can produce a better credit evaluation model than data preprocessing methods, such as sample deletion and data imputation.

This is a preview of subscription content, access via your institution.

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

References

  1. 1.

    Sung HH, Krishnan R (2012) Predicting repayment of the credit card debt. Comput Oper Res 39(4):765–773

    MathSciNet  MATH  Article  Google Scholar 

  2. 2.

    Byanjankar A, Heikkila M, Mezei J (2015) Predicting credit risk in peer-to-peer lending: a neural network approach. Paper presented at the 2015 IEEE symposium series on computational intelligence

  3. 3.

    Koutanaei FN, Sajedi H, Khanbabaei M (2015) A hybrid data mining model of feature selection algorithms and ensemble learning classifiers for credit scoring. J Retail Consum Serv 27:11–23

    Article  Google Scholar 

  4. 4.

    Xu Q, Xu Q (2017) Model design and implementation of enterprise credit information based on data mining. Paper presented at the proceedings of the 50th Hawaii international conference on system sciences

  5. 5.

    Yang S, Zhang H (2018) Comparison of several data mining methods in credit card default prediction. Intell Inf Manag 10(05):115–122

    Google Scholar 

  6. 6.

    Ying L (2018) Research on bank credit default prediction based on data mining algorithm. The International Journal of Social Sciences and Humanities Invention 5(6):4820–4823

    Article  Google Scholar 

  7. 7.

    Huang C-L, Chen M-C, Wang C-J (2007) Credit scoring with a data mining approach based on support vector machines. Expert Syst Appl 33(4):847–856

    Article  Google Scholar 

  8. 8.

    Song X, Nie L, Zhang L, Akbari M, Chua T-S (2015) Multiple social network learning and its application in volunteerism tendency prediction. Paper presented at the proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval - SIGIR ’15

  9. 9.

    Yuan L, Wang Y, Thompson PM, Narayan VA, Ye J (2012) Multi-source feature learning for joint analysis of incomplete multiple heterogeneous neuroimaging data. Neuroimage 61(3):622–632

    Article  Google Scholar 

  10. 10.

    Lin J, Li N, Alam MA, Ma Y (2019) Data-driven missing data imputation in cluster monitoring system based on deep neural network. Appl Intell 50(3):860–877

    Article  Google Scholar 

  11. 11.

    Lan Q, Xu X, Ma H, Li G (2020) Multivariable data imputation for the analysis of incomplete credit data. Expert Syst Appl 141(112926):1–12

    Google Scholar 

  12. 12.

    Garciarena U, Santana R (2017) An extensive analysis of the interaction between missing data types, imputation methods, and supervised classifiers. Expert Syst Appl 89:52–65

    Article  Google Scholar 

  13. 13.

    Imanol Bilbao JB (2017) Overfitting problem and the over-training in the era of data: particularly for artificial neural networks. Paper presented at the 2017 eighth international conference on intelligent computing and information systems, ICICIS, Cairo

  14. 14.

    Ghannad-Rezaie M, Soltanian-Zadeh H, Ying H, Dong M (2010) Selection-fusion approach for classification of datasets with missing values. Pattern Recogn 43(6):2340–2350

    MATH  Article  Google Scholar 

  15. 15.

    Feng X, Xiao Z, Zhong B, Dong Y, Qiu J (2018) Dynamic weighted ensemble classification for credit scoring using Markov Chain. Appl Intell 49(2):555–568

    Article  Google Scholar 

  16. 16.

    Florez-Lopez R (2017) Effects of missing data in credit risk scoring. A comparative analysis of methods to achieve robustness in the absence of sufficient data. J Oper Res Soc 61(3):486–501

    Article  Google Scholar 

  17. 17.

    Schafer JL, Graham JW (2002) Missing data: our view of the state of the art. Psychol Methods 7(2):147–177

    Article  Google Scholar 

  18. 18.

    Pigott TD (2001) A review of methods for missing data. Educ Res Eval 7(4):353–383

    Article  Google Scholar 

  19. 19.

    Farhangfar A, Kurgan L, Dy J (2008) Impact of imputation of missing values on classification error for discrete data. Pattern Recogn 41(12):3692–3705

    MATH  Article  Google Scholar 

  20. 20.

    Piramuthu S (1999) Financial credit-risk evaluation with neural and neurofuzzy systems. Eur J Oper Res 112(2):310–321

    MathSciNet  Article  Google Scholar 

  21. 21.

    Donders ART, van der Heijden GJ, Stijnen T, Moons KGM (2006) Review: a gentle introduction to imputation of missing values. J Clin Epidemiol 59(10):1087–1091

    Article  Google Scholar 

  22. 22.

    Ren Y, Li G, Zhang J, Zhou W (2013) Lazy collaborative filtering for data sets with missing values. IEEE Transactions on Cybernetics 43(6):1822–1834

    Article  Google Scholar 

  23. 23.

    Wang C -M, Huang Y -F (2009) Evolutionary-based feature selection approaches with new criteria for data mining: a case study of credit approval data. Expert Syst Appl 36(3):5900– 5908

    Article  Google Scholar 

  24. 24.

    Paleologo G, Elisseeff A, Antonini G (2010) Subagging for credit scoring models. Eur J Oper Res 201(2):490–499

    Article  Google Scholar 

  25. 25.

    Fogarty DJ (2006) Multiple imputation as a missing data approach to reject inference on consumer credit scoring. Interstat 41:1– 41

    Google Scholar 

  26. 26.

    Wei W, Tang Y (2003) A generic neural network approach for filling missing data in data mining. Paper presented at the SMC’03 conference proceedings. 2003 IEEE international conference on systems, man and cybernetics. Conference theme - system security and assurance (Cat. No.03CH37483)

  27. 27.

    Wu C-H, Wun C-H, Chou H-J (2004) Using association rules for completing missing data. Paper presented at the fourth international conference on hybrid intelligent systems (HIS’04)

  28. 28.

    Beretta L, Santaniello A (2016) Nearest neighbor imputation algorithms: a critical evaluation. BMC Medical Informatics and Decision Making 16(74):197–208

    Google Scholar 

  29. 29.

    Rahman G, Islam Z (2011) A decision tree-based missing value imputation technique for data pre-processing. Paper presented at the proceedings of the 9-th Australasian Data Mining Confere (AusDM’11)

  30. 30.

    Xiang S, Yuan L, Fan W, Wang Y, Thompson PM, Ye J (2013) Multi-source learning with block-wise missing data for alzheimer’s disease prediction. Paper presented at the KDD ’13: proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining

  31. 31.

    Li Y, Yang T, Zhou J, Ye J (2018) Multi-task learning based survival analysis for predicting alzheimer’s disease progression with multi-source block-wise missing data. Paper presented at the proceedings of the 2018 SIAM international conference on data mining

  32. 32.

    Liu M, Gao Y, Yap PT, Shen D (2018) Multi-hypergraph learning for incomplete multimodality data. IEEE Journal of Biomedical and Health Informatics 22(4):1197–1208

    Article  Google Scholar 

  33. 33.

    Caruana R (1997) Multitask learning. Machine Learning 28(1):41–75

    MathSciNet  Article  Google Scholar 

  34. 34.

    Zhang Y, Yang Q (2018) An overview of multi-task learning. National Science Review 5 (1):30–43

    Article  Google Scholar 

  35. 35.

    Argyriou A, Evgeniou T, Pontil M (2008) Convex multi-task feature learning. Mach Learn 73(3):243–272

    Article  Google Scholar 

  36. 36.

    Gong P, Ye J, Zhang C (2012) Robust multi-task feature learning. Paper presented at the KDD ’12: proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, Aug 12

  37. 37.

    Nesterov Y (2003) Introductory lectures on convex optimization: a basic course (applied optimization). Springer, Netherlands

    Google Scholar 

  38. 38.

    Nesterov Y (2012) Gradient methods for minimizing composite functions. Math Program 140 (1):125–161

    MathSciNet  MATH  Article  Google Scholar 

  39. 39.

    Bahnsen AC, Aouada D, Ottersten B (2014) Example-dependent cost-sensitive logistic regression for credit scoring. Paper presented at the 2014 13th international conference on machine learning and applications

  40. 40.

    Dong G, Lai KK, Yen J (2012) Credit scorecard based on logistic regression with random coefficients. Procedia Computer Science 1(1):2463–2468

    Article  Google Scholar 

  41. 41.

    Sohn SY, Kim DH, Yoon JH (2016) Technology credit scoring model with fuzzy logistic regression. Appl Soft Comput 43:150–158

    Article  Google Scholar 

  42. 42.

    Wang H, Xu Q, Zhou L (2015) Large unbalanced credit scoring using lasso-logistic regression ensemble. PLoS One 10(2):1– 20

    Google Scholar 

  43. 43.

    Liu J, Ji S, Ye J (2009) Multi-task feature learning via efficient l2,1-norm minimization. Paper presented at the proceedings of the twenty-fifth conference on uncertainty in artificial intelligence

  44. 44.

    Nesterov Y (1983) A method of solving a convex programming problem with convergence rate O (1/k2). Soviet Mathematics Doklady 27

  45. 45.

    Louzada F, Ara A, Fernandes GB (2016) Classification methods applied to credit scoring: systematic review and overall comparison. Surveys in Operations Research and Management Science 21(2):117–134

    MathSciNet  Article  Google Scholar 

  46. 46.

    Yan Y, Liu R, Ding Z, Du X, Chen J, Zhang Y (2019) A parameter-free cleaning method for SMOTE in imbalanced classification. IEEE Access 7:23537–23548

    Article  Google Scholar 

  47. 47.

    Sun J, Lang J, Fujita H, Li H (2018) Imbalanced enterprise credit evaluation with DTE-SBD: decision tree ensemble based on SMOTE and bagging with differentiated sampling rates. Inf Sci 425:76–91

    MathSciNet  Article  Google Scholar 

  48. 48.

    Namvar A, Siami M, Rabhi F, Naderpour M (2018) Credit risk prediction in an imbalanced social lending environment. International Journal of Computational Intelligence Systems 11(1):925–935

    Article  Google Scholar 

  49. 49.

    Zahin SA, Ahmed CF, Alam T (2018) An effective method for classification with missing values. Appl Intell 48(10):3209–3230

    Article  Google Scholar 

  50. 50.

    Branco P, Torgo L, Ribeiro RP (2016) A survey of predictive modeling on imbalanced domains. ACM Comput Surv 49(2):1–50

    Article  Google Scholar 

  51. 51.

    Fanshawe TRP M, Graziadio S, Ordonez-Mena JM, Simpson J, Allen J (2018) Interactive visualisation for interpreting diagnostic test accuracy study results. BMJ Evidence-Based Medicine 23:13–16

    Article  Google Scholar 

  52. 52.

    Ohsaki M, Wang P, Matsuda K, Katagiri S, Watanabe H, Ralescu A (2017) Confusion-matrix-based kernel logistic regression for imbalanced data classification. IEEE Trans Knowl Data Eng 29(9):1806–1819

    Article  Google Scholar 

  53. 53.

    Davis J, Goadrich M (2006) The relationship between Precision Recall and ROC curves. Paper presented at the proceedings of the 23rd international conference on machine learning

  54. 54.

    Liaw A, Wiener M (2002) Classification and regression by randomForest. R News 2(3):18–22

    Google Scholar 

  55. 55.

    Matthews BW (1975) Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure 405(2):442–451

    Article  Google Scholar 

  56. 56.

    Jeatrakul P, Wong KW, Fung CC (2010) Classification of imbalanced data by combining the complementary neural network and SMOTE algorithm. In: Neural information processing. Models and applications. Springer, Berlin, pp 152–159

  57. 57.

    Zweig MH, Campbell G (1993) Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. Clinical Chemistry 39:561–577

    Article  Google Scholar 

  58. 58.

    Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn 30(7):1145–1159

    Article  Google Scholar 

  59. 59.

    Zorkeflee M, Din AM, Ku-Mahamud KR (2015) Fuzzy and SMOTE resampling technique for imbalanced data sets. Paper presented at the 5th international conference on computing and informatics (ICOCI)

  60. 60.

    Boughorbel S, Jarray F, El-anbari M (2017) Optimal classifier for imbalanced data using Matthews correlation coefficient metric. PLoS One 12(6):1–17

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the Science Foundation of the Ministry of Education of China [18YJAZH038]; Hunan Provincial Science & Technology Major Project (No. 2018GK1020); and the National Natural Science Foundation of China [No. 71871090,71850012].

Author information

Affiliations

Authors

Corresponding author

Correspondence to Qiujun Lan.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Lan, Q., Jiang, S. A method of credit evaluation modeling based on block-wise missing data. Appl Intell (2021). https://doi.org/10.1007/s10489-021-02225-5

Download citation

Keywords

  • Missing data
  • Credit evaluation
  • Data mining
  • Multi-task learning