Keywords

1 Introduction

Every day, we are faced with situations where decisions are taken to make a difference in several areas. Sometimes, decisions are easy to make, while others are complex and decision-making have been difficult to address. Decision making is a key for success in any discipline. For this reason, we do of course need a mechanism to guarantee swift decision making. Multi-criteria decision analysis (MCDA) or multi-criteria decision making (MCDM) has emerged as a branch of operations research, which is aimed at facilitating the resolution of issues. Multi-criteria decision is able to evaluate, rank, choose or reject a set of actions that can be exercised over several applications. The MCDA is particularly based on the evaluation of a set of criteria using scores, values and intensity of preference. From the operational research point of view, there are two main schools of thought in multi-criteria decision support known as the American and European approach. The methods of the first family (American school) are based on complete aggregation. In general, complete aggregation is the way to proceed and consider the criteria as comparable and combine them into a mathematical form, called a utility or aggregation function. In fact, the methods of complete aggregation are used to calculate the score for each attribute. The score calculation is obtained by the decision maker whose goal is to assign confidence to actions, rank alternatives in decision making and rank the importance of each alternative once this is complete. For example, if we take two actions, therefore we can compare them by specifying the relative importance of one action over the other. In the complete aggregation, the ranking is clear with an easy interpretation of the parameters. The main methods belonging to this approach are: MAUT (Multiple Attribute Utility Theorem), AHP (Analytic Hierarchy Process), GP (Multi-Criterion Mathematical Goal Programming), etc. The methods of the second family (Francophone school) correspond to a constructive approach, allowing developing binary relations (generally neither complete nor transitive) of over-classification based on the preferences of the decision-maker. Once all the actions are compared in this way, a synthesis of all the binary relations is developed in order to bring results in the decision-making processes. This approach contains several methods, the main ones being: ELECTRE, PROMETHEE, ORESTE, QUALIFLEX, etc.

Like all other methods, the MCDA model has created limits at several levels. On the one hand, the analyze duration is often one of the most limiting factors to making sound decisions in a multi-criteria assessment. MCDA methods are often based on slow and iterative processes, which may require significant long-term evaluation plans. On the other hand, the complexity of mathematical aggregation increases the likelihood of coming to erroneous conclusions or to lead the analysis into confusion. In this research, we developed a new way that brings an intelligent and creative approach to making an analytic and automatic decision. We proposed a method to improve the performance of the MCDA model. It is a method which is very different, very practical and very conceptual in order to develop automatic multi-criteria decision analysis. We also tested our proposal to implement them in a practical way which is best suited to specific application, especially in the case of the classification of public transport operators in Tunisia urban.

2 Research Methodology

The MCDA methods are used for relative comparisons between individuals. This comparison is done through a relational model of preference and aggregation (complete and partial). The aggregation is used to obtain a score on an individual, and that according to the profile of the decision makers, in order to be able make a ranking. To improve the classical models of decision-making, we introduced smart concepts to make decision making more and more sophisticated. The developed technique of automatic classification is used to extract relevant synthetic information and come up with a solution that actually improves the performance of MCDA methods. For this reason, we used two categories of statistical analysis methods: descriptive and predictive approaches. Descriptive methods have focused on the analysis of an important set of data, and predictive methods aim at obtaining information about a set of labeled data. In the fundamental aspect, we used two classification techniques, supervised classification and unsupervised classification. Unsupervised classification is used to find two compact and well-separated groups in a set of data, and therefore it is necessary to assign a class label to each observation. The supervised classification is a machine task consisting of learning a prediction function from annotated examples. The methodology used is based on several steps. The first step consists of extracting the scores through the MCDA method. This technique is managed by the unsupervised learning method, and more specifically by (CAH) method. Supervised learning is used to automate decision-making when we have considered it interesting to add individuals or update existing data. Automatic classification in this case is established to classify and predict the label of a new data from the previous model knowledge learned, as indicated in Fig. 1.

Fig. 1.
figure 1

The automatic decision method is applied to classify the alternatives; we used two machine learning techniques (unsupervised and supervised classification) to ensure this objective

It was constructed around a practical case used to test and enrich our methodology. We used an application developed by [1], which refers to the ranking of public transport operators in Tunisia on the basis of MCDA method (see Fig. 2). The goal is not only to find the operators who are not performing, but also to facilitate decision-making. Indeed, the processing of information and the analysis of data appear to make a good decision by the public authority and more specifically to facilitate administrative tasks. When public transport decision makers in Tunisia have decided to integrate a new attribute or update the database. The managers in this case do not need to repeat the procedure of the MCDA method, and it will not be necessary to treat again the work. The role of classification is therefore to reproduce an automatic decision and to exclude any human intervention.

Fig. 2.
figure 2

The automatic decision consists of transforming performance scores obtained by the MCDA method into a classification

3 Unsupervised Classification: CAH

Unsupervised classification is a mathematical method of data analysis that facilitates grouping into multiple distributions. Individuals grouped within the same class (intra-class homogeneity) are subjected to a similar process, while heterogeneous classes have a dissimilar profile (inter-class heterogeneity). Clusters can be considered as classes or groups of similar entities separated by other clusters with non-common features. In our case, we have classified the public transport operators in Tunisia, so that the operators must belong to one of the two classes generated by the classification. We have a set of operators that we denote by X={x1, x2, …, xN} characterized by a set of descriptors (D). The objective of unsupervised classification is to find the groups K= {C1, C2, …, Ck} and verify which elementary operators (x) belong to each cluster. This means for determining a function noted by (Y) that associates each element of (X) with one or more elements of (C). There are several unsupervised classification algorithms to give results on the problem of the data classification. Subsequently, we present two categories of unsupervised classifications, ascending hierarchical classification (CAH) and nonhierarchical classification (K-means). We chose the (CAH) method for the following reasons. The (CAH) method offers a clear and simple approach to facilitate structuring of information and gives a high visibility in the area of multi-criteria analysis. The hierarchical classification is based on three principles:

The dendrogram:

the (X) partitions made at each stage of the (CAH) algorithm can be visualized via a tree called a dendrogram. On one axis appears the individuals to be grouped and on the other axis are indicated the differences corresponding to the different levels of grouping, this is done graphically by means of branches and nodes. The dendrogram, or hierarchical tree, shows not only the links between classes, but also the height of the branches which indicates the level of proximity. Indeed, this technique is based on the measurement of a distance between clusters. And again, there is the choice, depending on the options selected and depending on the different methods of aggregation.

The cut-off point of a dendrogram:

it is the configuration of the dendrogram, a predefined number of clusters make it possible to trace a break at a certain level of aggregation. This method determines the number of classes retained for subsequent events. To select a partition of the population, simply cut the dendrogram is obtained at a certain height. An analysis of the shape of the dendrogram may give us an indication of the number of classes can be selected [2].

Estimation the number of clusters (k):

from the CAH method, the number of classes is not necessarily known a priori. Different techniques exist, and one of the most common is based on information criteria such as BIC (Bayesian Information Criterion) or AIC (Aikake Information Criterion). Homogeneity (intra-class distance) and separation (inter-class distance) are the most common technique used for estimating the number of classes. The silhouette criterion is considered a relevant measure for assessing the quality of partitioning. We chose to use this criterion as a concrete measure aimed at ensuring better consideration of both the homogeneity and heterogeneity of classes. Let a(i) is the average of the dissimilarities (or distances) of the observation (i) with all the other observations within the same class. The more a(i) is small, the assignment of (i) is giving a better classification. Let b(i) is the lowest average of the observation dissimilarities (i) to each other class. The silhouette of the nth observation is then given by:

$$ s(i) = \frac{b(i) - a(i)}{\hbox{max} \, a(i),b(i)};\quad \quad s(i)\left\{ {\begin{array}{*{20}l} {1 - a(i)/b(i),} \hfill & {if \, a(i) < b(i)} \hfill \\ {0,} \hfill & {if \, a(i) = b(i)} \hfill \\ {b(i)/a(i) - 1,} \hfill & {if \, a(i) > b(i)} \hfill \\ \end{array} } \right. $$
(1)

The silhouette of an element (i) is between (−1, 1), if s(i) is close to (1) then it means that the data are correctly grouped to build a strong inter-class variability and low variability intra-class, if s(i) is close to (−1), of the same logic the data are grouped with the neighboring class, if s(i) is close to 0, it means that the data is on the boundary of two classes.

When running of ascending hierarchical classification, the dendrogram illustrated in the form of a tree, the level of similarity is measured along the vertical axis and the various operators of public transport are listed along the horizontal axis. The graph of the (CAH) method consists in illustrating the dispersion (intra and inter classes) between the observations and makes it possible to verify that the classes are sufficiently individualized. This procedure is repeated until all observations are fully merged. The dendrogram (Fig. 3) grouped public transport operators in Tunisia with two classes. The first class in red includes the operators of Sfax, Kairouan, Beja, Tunis, Jendouba, Nabeul, Kef and Kasserine. The second class in blue includes operators Medenine, Sahel, Bizerte, Gabes, and Gafsa. The classification reliability study was established by the silhouette analysis. The result the analysis of the silhouette showed that the average value of classification by two (K = 2, with s(i) = 0.6) is the best way to illustrate the dispersion of the data, which implies that the classification into two classes can provide group level reliability. However, individuals with large silhouette indices are well grouped by a strong distribution structure, and the silhouette indices are totally positive, as indicated in Fig. 4.

Fig. 3.
figure 3

The results of CAH classification showed that there are two types of clusters, in blue the operators have a high-performance and in red the operators are not performing

Fig. 4.
figure 4

The silhouette analysis indicates that the classification of public operators is well grouped by a positive distribution.

4 Supervised Classification

In the supervised context, we already have examples whose data are associated with class labels denoted K= {Cyes; Cno}. The supervised classification is used to assign a new observation to a class from the available clusters. Among the supervised methods we cited: k-nearest neighbors, decision trees and Bayes naive classifiers. In the rest of this report, we presented the three supervised classification methods with detailed examples.

Decision tree:

The decision tree is a recent method of data mining and has been widely studied and applied in the supervised classification domain, for the purpose of predicting a qualitative decision using variables of all data types (qualitative and/or quantitative). However, the decision tree is based on a hierarchical representation for managing a sequence of tests to predict the outcome classification. The different possible classification decisions are located at the ends of the branches (the leaves of the tree). In some areas of application, it is important to produce user-understandable classification procedures. Decision trees respond to this constraint and graphically represent a set of well-designed and clearly interpretable rules. The operating principle is as follows: a decision tree is the graphical representation in the form of a tree to develop the classification procedures. For each node, we chose the variable that best presents the individuals according to the categories of the other variables. In these cases, the score evaluation criterion is characterized by the maximum gain in information. Let (S) a sample, and {S1, …, Sk} the partition of (S) according to the classes of the target attribute. Entropy is defined as follows:

$$ Ent(S) = - \sum\limits_{i = 1}^{k} {\frac{{\left| {S_{i} } \right|}}{S}} \times \log \left( {\frac{{\left| {S_{i} } \right|}}{S}} \right) $$
(2)

The “Gain” information makes it possible to locally evaluate the attribute that brings the most information to the result can be predicted. This function is expressed as follows:

$$ Gain_{Ent} (p,T) = Ent(S_{p} ) - \sum\limits_{j = 1}^{2} {P_{j} \times Ent(S_{pj} )} $$
(3)

These criteria will calculate values for every attribute. The values are sorted, and attributes are placed in the tree by following the order, i.e., the attribute with a high value (in case of information gain) is placed at the root. The process also stops automatically if the elements of a node have the same value for the target variable.

Naive Bayesian Classifier:

The naive Bayesian classification is a simple type of probabilistic classification based on Bayes’ theorem with strong independence of hypotheses (so-called naive). Depending on the nature of each probabilistic model, naive Bayesian classifiers can be effectively trained in a supervised learning context to classify a set of observations [3]. The algorithm is generally based on the first stage of “Apprenticeship Works” data to perform the classification. During the learning phase, the algorithm makes it possible to elaborate classification rules on this data set that will be used for testing and prediction. Given a set of variables X = {x1, x2, …, xd}, we want to calculate the posterior probability of the event (Yj) among a set of possible Y = {c1, c2, …, cd}. In more common terminology, (X) represents the preachers and (Y) represents the variable to predict (the attribute that has K modalities). The Bayes rule is defined as follows:

$$ P(Y = c/X = x) \propto \frac{P(X = x/Y = c)P(Y = c)}{P(X = x)} \propto \frac{Likelihood \times prior}{evidence} $$
(4)

Thanks to the Bayes rule above, we will assign a new observation (X) to the (Yj) class which has the highest posterior probability.

$$ \hat{y}(x) = y_{{j^{*} }} \Leftrightarrow y_{{j^{*} }} = \arg \mathop {\hbox{max} }\limits_{j} P(Y = c_{j} ) \times P\left[ {X(x)/Y = c_{j} } \right] $$
(5)

K nearest neighbor method (k-NN or k-ppv):

The k nearest neighbor method (k-NN) is used to classify target points based on their distance from a learning sample. The k-ppv method is an automatic classification approach, and is a generalization of inductive classification methods. The general principle of the k-NN method is as follows, given a properly labeled learning base, the k-NN classifier determines the class of a new object by assigning it the majority class of (x) objects in the database. In this context, we have a learning database consisting of (N) input-output pairs. To estimate the output associated with a new input (x), the method of k nearest neighbors consists of taking into account the (k) training samples whose input are closest to the new input (x), according to a predetermined distance. Generally, the determination of similarity is based on the Euclidean distance. The algorithm illustrates a decision-making based on the search for one or more similar cases. Indeed, the algorithm looks for the k nearest neighbors of the new case and predicts the most frequent answer by classifying the target points according to their distance from points in the learning base [4].

5 Classification Performance Indicators

Evaluation of classifiers is often an inevitable step in order to test classification performance. The results of a supervised classification must always be validated. Not only this step makes it possible to verify that the model presents a good capacity of generalization, but also makes it possible to compare the results of several techniques and to privilege the most adapted methods in the MCDA application. There are many statistical techniques to evaluate the performance of classifiers. We presented the confusion matrix because it is the most effective method applied in the field of data analysis [5]. In this regard, we constructed a prediction model based on three learning methods (decision tree, nearest neighbor and Bayesian approach method) to classify the performance of public transport operators. From three supervised classification methods, several ratios can measure the performance of classifier such as recall, precision, error rate and F − 1. In our application, we calculated by the MCDA method the performance scores for each operator and for each year from 2007 to 2015, so that we obtained 117 observations.

We applied the method of the automatic multi-criteria decision analysis illustrated in Fig. 1, and using the confusion matrix we are able to find the most suitable supervised learning that can be applied directly in the MCDA procedure.

Confusing matrix:

it allows visualization of the performance of an algorithm. Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class (or vice versa). It is a special kind of contingency tables, with two dimensions (“actual” and “predicted”), and identical sets of “classes” in both dimensions (each combination of dimension and class is a variable in the contingency table). In this article, we used a confusion matrix to predict two classes, a class “Yes” indicates that public transport operators perform well, and a “No” class indicates the non-performance of operators, as indicated in Table 1. To measure reliability, it is customary to distinguish 4 types of elements classified:

Table 1. Confusion matrix
  • True positive TP which designates an element of the class “Yes” correctly predicted.

  • True negative TN which designates an element of the class “No” correctly predicted.

  • False positive FP which designates an element of the class “Yes” poorly predicted.

  • False negative FN which designates an element of the class “No” poorly predicted.

From the confusion matrix, we can release 4 measures:

Precision: It expresses the proportion of the data points our model is performing.

$$ {\text{Precision}} = \frac{TP}{TP + FP} $$
(6)

Recall: it used to determine the ability of a model and to find all the relevant cases within a dataset.

$$ {\text{Recall}} = \frac{TP}{TP + FN} $$
(7)

The error rate: it estimates the misclassifying probability.

$$ {\text{Error}}\,{\text{ rate = }}\frac{FN + FP}{Number \,of\, \, observations} $$
(8)

F − 1: It is a measure of a test’s, and represents the harmonic average of the precision and recall, where an F − 1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.

$$ F - 1 = \frac{2 \times (Precision \times recall)}{Precision + recall} $$
(9)

The results demonstrated that the Bayes classification provided better predictive performance, comparable to other techniques whose effectiveness is recognized. We can also see that this method can be used to classify and predict the performance of public transport operators with an error rate of 5%. The naive Bayesian network also showed a high accuracy of 97% compared to other classification series. A classification system for perfect public transport operators will therefore provide 100% precision and recall. The classification of k-NN is far from performing, when compared to other classification methods, a low accuracy rate of 74% with a very limited recall rate of 77%, while the Bayesian classification algorithm is accurate with a very high score of precision, and more performance at the level of the recall information (the Bayesian classification found 95% of the possible answers compared to 94% of decision trees, as indicated in Table 2.

Table 2. The classification of performance indicators

As we have seen, the Bayes naive provides a useful assessment on several crucial problems, and can be applied successfully in the MCDA application. Despite the relatively simplistic independence assumptions, the naive Bayesian classifier has several properties that make it very practical in real cases. In particular, the dissociation of class conditional probabilities between the different characteristics results in the fact that each probability law can be independently estimated as a one-dimensional probability law. This avoids many problems from the scourge of the dimension, which gives an immediate advantage in terms of computability [6]. The naive Bayes classifier works like the MCDA model, both methods are based on the independence of variables, which gives it a high compatibility in the level of the automatic classification.

6 Conclusion

The processing of information and the analysis of data appear to make a good decision by the public authority and more specifically to facilitate administrative tasks. When public transport decision makers in Tunisia have decided to integrate a new operator, the managers do not need to repeat the procedure of the MCDA method or to duplicate the work. The role of machine learning is therefore to reproduce an automatic classification decision on the database stored in the system and to exclude any human intervention. This is an interesting procedure that allows us to monitor the traceability of public transport operators and identify failures via a permanent crisis management mechanism. This strategy makes it possible to distinguish between operators (performers and non-performers) and to find the operators who are very gathered, it is also an intuitive way to compare performance between different operators. The use of machine learning and predictive analysis involves calculating future trends and opportunities for making recommendations. In this context, the Bayes classifier proves in practice to be well adapted in the context of the automatic decision making, and to the advantage of being extremely efficient in terms of decision-making [7]. Despite a simplistic conception, this model show, in many real situations, a predictability that is surprisingly superior to other competing models, such as decision trees k-NN. The Bayesian classification is also very easy to program, its implementation is of even greater significance, the estimation of its parameters and the construction of the model are very fast on databases of small or medium size, either in number of variables or in the number of observations. The limit of our proposal that the size of the sampling is very limited, 117 observations are not able to release an effective and automated model [8]. The next step is to prune the database, and we able to integrate other methods such as: Boosting, Random Forests, Artificial Neural Network and Support Vector Machine for operating an automatic usage decision.