Prediction of Potential Bank Customers: Application on Data Mining

Başarslan, Muhammet Sinan; Argun, İrem Düzdar

doi:10.1007/978-3-030-36178-5_9

Prediction of Potential Bank Customers: Application on Data Mining

Conference paper
First Online: 03 January 2020

1718 Accesses
4 Citations

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 43))

Abstract

Banking is an important industry, where financial transactions are performed to meet our needs in our everyday lives. Today, banks are frequently used to meet all kinds of financial transactions. In line with the increasing competition, the banks are aiming at acquiring new customers through customer satisfaction. At this point, studies on acquiring new customers by analyzing the customer data have gained importance recently. As a result, data analysis units have been established in the banks. In addition to the banks, these units have also been established for data analysis in customer focused industries such as insurance and telecommunication. In this study, models are established by using classification algorithms to estimate potential bank customers on the bank dataset obtained by telemarketing method in UCI Machine Learning Repository, and the results are compared. Using this comparison result, it is aimed to perform a more detailed and effective data analysis. Various models have been established with various classification algorithms for the estimation of customer acquisition. The classification algorithms used in this study include the C4.5 Decision Tree, Navie Bayes (NB) algorithm, K nearest neighbors algorithm (k-nn), Logistic Regression algorithm (LogReg), Random Forest algorithm (RanFor), and Adaptive Boosting algorithm (AdaBoostM1-Ada). While establishing the classification models, it is aimed to achieve consistency in the performance of the classification models by dividing the test and training data set by two different methods. K-fold Cross Validation and Holdout methods are used for this purpose. In the K-fold cross validation, training and test da-ta sets are separated with 5- and 10-fold cross validation. In the holdout method, the dataset was divided into training and test datasets with the 60–40%, 75–25% and 80–20% training and test separation ratios, respectively. These separations are evaluated for Accuracy (ACC), Precision (PPV), Sensitivity (TPR), and F-measure (F) performance. The performance results are similar in both separation results. According to the Accuracy and F-measure criteria, the classification model established by Random Forest algorithm highest results the other models, whereas the Naive Bayes algorithm gave highest results according to the precision criterion, and the AdaBoostM1 classification algorithm yielded better according to the sensitivity criterion.

Download conference paper PDF

1 Introduction

Customer acquisition is crucial for businesses. In addition to new customer acquisition, they also carry out a variety of activities on customer retention. These studies are referred to as customer churn analysis. The purpose of this study is to conduct a study on new customer acquisition. Marketing is one of the common uses of data mining. The dataset used in this study has been collected by tele-marketing method [1]. Basically, data mining is used for following marketing activities [2]:

Determining the customer profile
Forecasting potential customers
Basket analysis
CRM (Customer Relationship Management)
Sales forecast

Many classifying applications have been carried out on the bank data set. Bach and colleagues used classifier algorithms using artificial neural networks [3]. Sumathi and Sivanandam used a variety of data mining methods to explore the relationships between data [4]. Keramati et al. One of the classification algorithms on the data of a telecommunications company in Iran; they used decision trees, artificial neural networks, nearest neighbors and support vector machines. [5].The result of Nachev’s artificial neural networks in the accuracy performance criterion is 90% [6]. Kumari and colleagues studied with decision trees and found that the performance value in the classifier model was 88.67% [7]. Elsalamony Artificial Neural Networks, C4.5 Decision Tree, Navie Bayes [8]. These workshops focus on a single performance criterion.

Classification algorithms are preferred in studies conducted to estimate potential customers. In this study, a model was established with classification algorithms for estimation of customer acquisition. Data mining is used for obtaining usable data from large data sets. A certain standard procedure should be followed against the problems that may arise because of the data size. For this purpose, processes such as KDD (Knowledge Discovery in Databases), SEMMA (Sample, Explore, Modify, Model, Assess) and CRISP-DM (Cross-Industry Standard Process for Data Mining) are used [9].

The SEMMA process has been developed by a proprietary data analysis vendor [10]. CRIPS-DM, however, has been introduced by a consortium in contrast to SEMMA methodology, and is preferred to KDD and SEMMA methodologies [11]. Cross-Industry Standard Process (CRISP) model, which is widely used in data mining studies, was used. The CRISP model consists of six steps:

Problem Definition
Data Understanding
Data Preparation
Model Building
Model Evaluation and Selection
Application of the Model

These six steps start with the problem definition step. This step must be well-defined in order for the analysis to be successful. In the next step, data understanding step, the relationship between descriptor values and variables such as summary information, averages, frequency, etc. of variables on the data set is examined graphically and using various methods. In the data preparation step, data such as missing data, outlier data, and normalization are performed on the data set. After these steps the classification models are carried out in the model building step. The generated model is evaluated in the model performance evaluation and selection step. Figure 1 shows the visuals of these six stages [12].

Section 2 data mining and classification algorithms used in the study and performance evaluation criteria of these classification algorithms are given. In Sect. 3, the application was carried out using the CRISP-DM process. Finally, the performances of the classification models set forth in the study are included in the results and discussion headings in Sect. 4.

2 Methodology

In this section, data mining, classification algorithms and performance evaluation criteria of these algorithms will be given.

2.1 Data Mining

As the most general definition, data mining is obtaining usable information from complex data [13]. Data mining is the process of extracting useful information from large amounts of data. The knowledge discovery on the data by combining areas such as statistics, machine learning, database management systems, together is called data mining [14,15,16].

2.2 Used Classification Algorithms

The classification algorithms used in this study are Adaptive Boosting (AdaBoostM1), Logistic Regression, k nearest neighbors (k-nn), Naive Bayes, Random Forest, and C4.5 Decision Tree. These algorithms will be mentioned in this section.

Adaptive Boosting (AdaBoostM1) Classifier Algorithm.

The AdaBoost.M1 algorithm is used in binary classification [17]. In this algorithm, the aim is to create a more efficient algorithm by combining weak learning algorithms. Pre-misclassified samples are assigned a weight to classify more accurately. The weight assigned here provides poor classification. This algorithm is a widely used upgrade algorithm [18].

Logistic Regression Classifier Algorithm.

Logistic regression is an algorithm used to analyze data sets with one or more independent variables that determine a result developed by David Cox [19].

k-Nearest Neighbor Classifier Algorithm.

The nearest neighbor algorithm is one of the sampling-based learning algorithms. In this algorithm, learning is carried out through training data. In order to find out which class the new data belongs to, there is a k value closer to the model field. While this k value is found, distance calculation methods are used. Newly unknown data have been assigned to the class most similar to the nearest neighbor [20].

Navie Bayes Classifier Algorithm.

The Naive Bayes algorithm is based on Bayes’ theorem. It is one of the statistical classification methods. This algorithm is an algorithm to find the relationship between the target variable and the argument [21].

Random Forest Classifier Algorithm.

Leo Breiman has developed. In this algorithm, instead of creating a single Decision Tree, it has emerged to combine the decisions of trees with a variable of different training sets. In this algorithm, different training clusters are looked at and different. After the calculation of the attributes at all levels, the properties in all formed trees are combined. in this combination, the most used attribute is selected and added to the tree. This is done at all levels of the tree [14].

C4.5 Decision Tree Classifier Algorithm.

Developed by Ross Quinlan. C4.5 Decision Tree algorithm can be used in categorical or numerical data set. C4.5 The gain ratio is used in the decision tree algorithm [22].

2.3 Performance Criteria of Classification Models

In the study, the evaluation of the models with classification algorithms was done with the confusion matrix [24]. The confusion matrix is shown in Table 1. The performance criteria obtained from the mess matrix are shown below [23,24,25].

Table 1. Confusion matrix.

Full size table

Accuracy (ACC) and Error values of classifier models Eqs. (1) and (2), are shown respectively [23].

$$ \text{Accuracy }\left( {\text{ACC}} \right) = \frac{{\text{T}_{\text{P}} + \text{T}_{\text{N}} }}{\text{M}} $$

(1)

$$ \text{Error} = 1 - \text{Accuracy} $$

(2)

Precision (Positive Prediction Rate-PPV) and Sensitivity (Real Positive Rate-TPR) values of classifying models Eqs. (3) and (4), are shown respectively [23].

$$ \text{Precision}\;(\text{PPV}) = \frac{\text{TT}}{{{\text{TT}} + {\text{FT}}}} $$

(3)

$$ \text{Accuracy }\left( {\text{ACC}} \right) = \frac{{\text{T}_{\text{P}} + \text{T}_{\text{N}} }}{\text{M}} $$

(4)

Specificity and F-measure values of classifying models are given by Eqs. (5) and (6), are shown respectively [23].

$$ \text{Specificity}\varvec{ } = \frac{\text{FF}}{{{\text{FF}} + {\text{FT}}}} $$

(5)

$$ {\text{F}} - \text{measure} = \frac{{2 \times \text{TPR} \times {\text{PPV}}}}{{\text{TPR} + {\text{PPV}}}} $$

(6)

The classifier models created in the study were created to make estimates on the data set. The performances of these classifier models are divided into data set training and test data. Holdout and k-fold cross validation methods are used in this literature. In Fig. 2, there are 60%–40% training-tests, 75%–25% training-test, 80%–20% training test distinctions [26]. Figure 3 shows 5 times cross validation differences [27].

3 Application

In this study, the models are established with the classification algorithms on the dataset obtained according to whether the customers called during telemarketing are bank customers in order to determine the best classifier algorithm that gives the best estimation for bank customer acquisition. Within the scope of this study, CRISP model is used to follow a systematic approach for establishing models using classification algorithms. The steps used for preparing the classification models in the CRISP model process include problem definition, data interpretation, data preparation and model setup. The performances of the classification models established after these steps are presented in the conclusion section.

3.1 Problem Definition

Among the data mining methods, classification algorithms are used in studies conducted for customer acquisition to predict the potential customers of the company in question in the related industry. This study was carried out to determine the best estimator for bank customer acquisition by establishing models with classification algorithms on the bank customer dataset obtained by telemarketing method in UCI Machine Learning Repository [28].

3.2 Data Analysis

There are 17 properties and 45211 customer records in the bank data set. Table 2 shows the data types and characteristics of the data set from the UCI.

Table 2. All variables, formats and types related to telecommunication data set

Full size table

The statistical analysis results of the dataset obtained with the R program is shown in Fig. 4. This Figure shows the minimum, maximum, median, mean, first and third quartile values of the numerical values. In addition, the frequencies of the values of the categorical features are also shown. Figure 4 shows that categorical attributes such as education, contact, and outcome get unknown values in the data set. It is believed that the inclusion of data sets in this way affects data analysis.

3.3 Data Pre-processing

According to the statistical analysis, there was no missing data in the bank dataset (Fig. 4). For this reason, no pre-processing was performed to complete the missing data.

3.4 Application of Classification Models

In this study, models are established with various classification algorithms using the bank dataset to estimate potential bank customers. The classification algorithms used in the models is Adaptive Boosting (AdaBoostM1- Ada), k nearest neighbors (k-nn), Logistic Regression (LogReg), Naive Bayes (NB), Random Forest (RanFor), and C4.5 Decision Tree. To evaluate the performances of these models, four performance measures are used, namely the Accuracy, Precision, Sensitivity, and F measures.

4 Conclusion and Discussion

Estimation models are established using the Adaptive Boosting (AdaBoostM1), k nearest neighbors, Logistic Regression, Naive Bayes, Random Forest, Part and C4.5 Decision Tree classification algorithms with the bank marketing dataset. The performance of all models implemented within the scope of the study is tested using a training and test dataset separated with holdout and k-fold cross validation methods. It is aimed to check the similarity of the results of two different model performance evaluations The dataset was divided into training and test datasets using the holdout method with the 60–40%, 75–25% and 80–20% separation ratios as shown in Table 3.

Table 3. Accuracy, Precision, F measure values for holdout separation of the bank marketing dataset

Full size table

In Table 3, the dataset is divided into training and test datasets using the holdout separation method with the 60–40%, 75–25% and 80–20% separation ratios to evaluate the performances of classification algorithms. In these separations, the highest performance is achieved with the Random Forest algorithm according to the accuracy criterion. According to the precision criterion, however, the Bayes algorithm yielded best results, whereas the AdaBoostM1 was highest in sensitivity, and Random Forest algorithm is highest in the F-measure criterion. The reason for not having a single superior classification model in these four different performance criteria is the confusion matrix.

In the K-fold cross validation, training and test data sets are separated with 5- and 10-fold cross validation as shown in Table 4. The results obtained in 5-fold cross validation and 10-fold cross validation is similar to all holdout separation results. In the 5-fold and 10-fold cross validation, the Random Forest algorithm had the best performance in accordance with the accuracy criteria, similar to the holdout method. According to the precision criterion, the Bayes algorithm gave prime results. AdaBoostM1 is best in sensitivity analysis, and Random Forest algorithm was best in F-measure analysis. It is clearly observed that the accuracy analysis, which is the most used performance criterion, is the most distinctive among the algorithms used.

Table 4. Accuracy, Precision, F measure values for 5- and 10-fold cross validation separation of the bank marketing dataset

Full size table

As a result, it can be deduced that the classifier models established are consistent since similar results are obtained in two different separations. Since the performances are obtained by the confusion matrix, the classifier models established within the scope of the study give different results in each performance criterion. As a result of the similar results obtained in the test and training dataset separation in this study, as a future study to see whether the results are unique to this dataset by testing various datasets under the same conditions.

In the literature study conducted within the scope of the study, it was observed that the models produced are generally evaluated according to a single performance criterion (accuracy). In this study, modellers are evaluated with four different criteria: Accuracy (ACC), Precision (PPV), Sensitivity (TPR), and F-measure (F) performance criteria. In addition, the models generated by C4.5 Decision Tree, Navie Bayes algorithm, K nearest neighbors algorithm, Logistic regression algorithm, Random forest algorithm, and Adaptive Boosting algorithm As a training-test data for evaluating their performance, the data set is divided into hold-out with random discrimination as 60%–40%, 75%–25%, 80%–20% training-test respectively. In this study, k-fold cross validation was performed in addition to the random discrimination on the training and test set, and the results on the performance criteria are compared and compared with each other.

References

Baynal, K., Çaliş, A.: Veri Madenciliğinde Kümeleme Analizi ile Bankacılık Sektöründe Bir Uygulama. Beykent Üniversitesi Fen ve Mühendislik Bilimleri Dergisi, vol. 9(1) (2016)
Google Scholar
Çankiri, S., Kartal, E., Yildirim, K., Gülseçen, S.: Organizasyonlarda Bilgi Yönetimi Sürecinde Veri Madenciliği Yaklaşımı. Fırsatlar ve Tehditler Sempozyumu, İstanbul, Bilgi Çağında Varoluş (2009)
Google Scholar
Bach, M.P., Juković, S., Dumiči, K., Šarlija, N.: Business client segmentation in banking using self-organizing maps. South East Eur. J. Econ. Bus. 8(2), 32–41 (2013)
Article Google Scholar
Sumathi, S., Sivanandam, S.N.: Introduction to Data Mining Principles. Springer, Heidelberg (2006)
Book Google Scholar
Keramati, A., Jafari-Marandi, R., Aliannejadi, M., Ahmadian, I., Mozaffari, M., Abbasi, U.: Improved churn prediction in telecommunication industry using data techniques. Appl. Soft Comput., 994–1012 (2014)
Google Scholar
Nachev, A.: Application of Data Mining Techniques for Direct Marketing, Sofia (2014)
Google Scholar
Elsalamony, H.A.: Bank direct marketing analysis of data mining techniques. Int. J. Comput. Appl., 12–22 (2014)
Google Scholar
Kumari, B., Shrivastava, V.: Evaluation and comparison of performance of different classifiers. Int. J. Emerg. Trend Eng. Basic Sci. (IJEEBS), 604–611 (2015)
Google Scholar
Erol, Ç.: Sağlık Bilimlerinde R ile Veri Madenciliği, R ile Veri Madenciliği Uygulamaları, Çağlayan Kitabevi, İstanbul, pp. 25–46 (2016)
Google Scholar
Azevedo, A.I.R.L., Santos, M.F.: KDD, SEMMA and CRISP-DM: a parallel overview, IADS-DM (2008)
Google Scholar
Dolgun, M.Ö., Ersel, D.: Doğrudan Pazarlama Stratejilerinin Belirlenmesinde Veri Madenciliği Yöntemlerinin Kullanımı. İstatistikçiler Dergisi: İstatistik & Aktüerya 7, 1–13 (2014)
Google Scholar
Bakioğlu, F.Ö.K., Kartal, E., Özen, Z., Erol, Ç., Gülseçen, S.: Aspects of students about ınformation technology courses in social science. Procedia - Soc. Behav. Sci., 176, 148–154 (2015)
Google Scholar
Özkan, Y.: Veri madenciliği yöntemleri, 2. basım, İstanbul, Türkiye: Papatya Yayıncılık (2013)
Google Scholar
Han, J., Kamber, M., Pei, J.: Data mining: concepts and techniques. In: Data Management Systems, pp. 230–240. The Morgan Kaufmann Series (2006)
Google Scholar
Gordon, L.S., Berry, M.J.A.: Mastering data mining: for marketing, sales, and customer relationship management, 2nd ed. Willey Publishing, New York (2004)
Google Scholar
Ayre, L.B.: Data Mining for Information Professionals, San Diego. California, USA (2006)
Google Scholar
Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: Thirteenth International Conference on Machine Learning, pp. 148–156 (1996)
Google Scholar
Özgür, A., Erdem, H.: Saldırı Tespit Sistemlerinde Kullanılan Kolay Erişilen Makine Öğrenme Algoritmalarının Karşılaştırılması. Bilişim Teknolojileri Dergisi 5(2), 41–48 (2012)
Google Scholar
Logistic Regression. https://wiki2.org/en/Logistic_regression+Newton. Accessed 20 Aug 2018
Anonymous. http://www.matlabyar.com/wp-content/uploads/edd/2016/03/knnng.png. Accessed 16 Apr 2018
Harrington, P.: Machine Learning in Action. Manning, New York (2012)
Google Scholar
Witten, I.H., Frank, E., Mark, A.H.: Veri Madenciliği: Pratik makine öğrenme araçları ve teknikleri, 3. Baskı, p. 191. Morgan Kaufmann, San Francisco (2011)
Google Scholar
Japkowicz, N.: Performance evaluation for learning algorithms. International Conference on Machine Learning, Scotland (2012)
Google Scholar
Clark, M.: An introduction to machine learning: with applications in R
Google Scholar
Flach, P.: The many faces of ROC analysis in machine learning. ICML Tutorial
Google Scholar
Avrim, B., Adam, K., John, L.: Beating the hold-out: Bounds for k-fold and progressive cross-validation. In: Proceedings of the Twelfth Annual Conference on Computational Learning Theory. ACM (1999)
Google Scholar
Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. Ijcai 14, 1137–1145 (1995)
Google Scholar
Moro, S., Cortez, P., Rita, P.: A data-driven approach to predict the success of bank telemarketing. Decis. Support Syst. 62, 22–31 (2014)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Doğuş University, İstanbul, Turkey
Muhammet Sinan Başarslan
Duzce University, Düzce, Turkey
İrem Düzdar Argun

Authors

Muhammet Sinan Başarslan
View author publications
You can also search for this author in PubMed Google Scholar
İrem Düzdar Argun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Muhammet Sinan Başarslan or İrem Düzdar Argun .

Editor information

Editors and Affiliations

Department of ECE, Karunya University, Coimbatore, Tamil Nadu, India
D. Jude Hemanth
Department of Computer Engineering, Faculty of Engineering, Suleyman Demirel University, Isparta, Isparta, Turkey
Utku Kose

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Başarslan, M.S., Argun, İ.D. (2020). Prediction of Potential Bank Customers: Application on Data Mining. In: Hemanth, D., Kose, U. (eds) Artificial Intelligence and Applied Mathematics in Engineering Problems. ICAIAME 2019. Lecture Notes on Data Engineering and Communications Technologies, vol 43. Springer, Cham. https://doi.org/10.1007/978-3-030-36178-5_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-36178-5_9
Published: 03 January 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-36177-8
Online ISBN: 978-3-030-36178-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Abstract

1 Introduction

2 Methodology

2.1 Data Mining

2.2 Used Classification Algorithms

Adaptive Boosting (AdaBoostM1) Classifier Algorithm.

Logistic Regression Classifier Algorithm.

k-Nearest Neighbor Classifier Algorithm.

Navie Bayes Classifier Algorithm.

Random Forest Classifier Algorithm.

C4.5 Decision Tree Classifier Algorithm.

2.3 Performance Criteria of Classification Models

3 Application

3.1 Problem Definition

3.2 Data Analysis

3.3 Data Pre-processing

3.4 Application of Classification Models

4 Conclusion and Discussion

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation