1 Introduction

Missing data is a common problem for machine learning and data mining. Let, \(\mathbf {x}_k\) be the \(k^{th}\) data point or datum with p features, \(\mathbf {x}_k \in \mathbf {X} \subseteq {\mathbb {R}^p}\), where \(\mathbf {X}\) is the data set. So, if \(\mathbf {x}_k\) for some k has \(q \in \{ 1,2,\cdots ,p\}\) missing values, then \(\mathbf {X}\) is called an incomplete dataset. Many real life systems [3, 9, 11, 13, 15] are affected due to missing data [6]. Missing data are typically characterized into three types [12]: (1) MCAR (missing completely at random), (2) MAR (missing at random) and (3) NMAR (not missing at random). Most missing value prediction methods are for MCAR and MAR types of missing data.

The simplest way to deal with MCAR and MAR type of missing data is to analyze [12, 20] the data points without any missing value. Here, each data point with a missing value(s) is deleted from the datasets and the rest are analyzed. This procedure, however, is effective only when a small number of instances have missing values.

The second family of methods for handling missing data first predicts (imputes) missing values, and then, analyses the entire data. In [6] authors categorized imputation techniques into two groups: statistical imputation methods and machine learning based imputation methods. In statistical imputation methods, the missing values are replaced by usual statistical techniques [1, 12, 20]. For example, the missing value of a particular feature can be replaced by the average of the features of the data points without any missing value (mean imputation). Sometimes, missing values are predicted by a regression model. If k features have missing values, then we need k-regression models. Cold and hot deck imputation [10] is another type of statistical technique where a missing value is imputed using the feature value of the closest complete data point; which is determined based on the existing features from the same (sometimes from different) data sources. In multiple imputations, a missing value is replaced by a set of possible values. Thus, the missing values are imputed multiple times and multiple datasets are created. Then, these imputed datasets are analyzed by using standard procedures and the results are combined for imputing missing values.

Machine learning based procedures are also used to impute missing data. As an example, k–nearest neighbor (k–NN) [4] is a common procedure where the missing value is replaced by the corresponding value of the nearest neighbor (k–NNI). Here, distances are computed using the observed subspace. In [16], the k–NN rule is modified where the distance-weighted k-nearest neighbor rule is used to classify data with missing values. In singular value decomposition imputation (SVDI) [4] missing values are imputed using the k-most significant eigenvectors. Missing values are also imputed using some other machine learning techniques like self-organizing maps (SOMs), and multilayer perceptron (MLP) [5, 7, 8, 17, 19, 23,24,25].

Evidential reasoning is an effective approach to dealing with different types of uncertainty. It has been used in many areas including classification, clustering, and decision-making. Recently, in [14] evidential reasoning has been used to classify incomplete data. Here, first, for each class, a prototype is formed. Then the incomplete data are imputed using all prototypes. So, if r is the total number of classes, for an incomplete data r number of complete data points are generated. Now, each new data points is classified using a classifier. Then the results of the classification are fused based on a new prototype-based credal classification (PCC) method to find the right class label for the incomplete data point. After imputing missing values, in many cases, a test data point may provide evidence to belong to more than one class. Considering this, in [14], the authors defined two types of errors. One is miss-classification error and the other one is belongingness of a test point to more than one class, which includes the actual class the point belongs to.

In this article, we propose an evidential reasoning-based classification technique to classify the MCAR type of missing data. For this, first, an autoencoder is trained using the complete data set and then by an augmented complete data set. Now, if there are the r classes, r classifiers (as SVM is a good classifier for classification we use r SVMs here) are trained using the latent space representation of the complete data. When a test point with missing values appears, we first impute it using our proposed method. Then, we use the latent space representation of the test data point and classify it using the r trained classifiers. Thus, from each of the r classifiers, we get a probabilistic output for each test point. We take these probabilistic outputs and combine them using the proposed evidential framework. We compute classification errors as in [14]. We have done three experiments to check the performance of our algorithm. To check the efficiency of our algorithm with respect to others, we compared the results of the proposed algorithm with four state-of-the-art techniques. Our results have revealed that the performance of the proposed method is better compared to other methods in most of the cases.

The main contributions of the proposed method are three folds. First, the traditional 1- nearest-neighbor for imputing missing data is modified here. Then, the entire dataset is projected into a suitable latent space. Moreover, we have proposed a new mass function to join r probabilistic outputs from r SVMs.

The remaining part of the article is organized as follows. Basic evidential reasoning is described in Sect. 2. Section 3 describes the proposed method. Experimental results and analysis are demonstrated in Sect. 4. Section 5 concludes the paper.

2 Basics of Evidential Reasoning

Consider an r class classification problem where any input \(\mathbf {x} \in {\mathbb {R}^p}\) belongs to one of the r classes. Let, \(\varOmega \) be the set of classes; \(\varOmega = \{\omega _1, \omega _2,\cdots , \omega _r\}\). Due to imprecision or some other uncertainty associated with the input, the available evidence may suggest a data point to belong to more than one class. So, a test point may belong to one of the \(2^{\varOmega }\) sets of classes. For example, let there be three classes. They are \(\varOmega = \{\omega _1,\omega _2,\omega _3 \}\). Now, a test point may belong to any one of \(2^\varOmega \) possibilities: \(2^\varOmega = \{ \emptyset , \{\omega _1\}, \{\omega _2\}, \{\omega _3\}, \{\omega _1, \omega _2\}, \{\omega _1, \omega _3\}, \{\omega _2, \omega _3\}, \{\omega _1, \omega _2, \omega _3\}\}\). Thus, a test point may belong to meta-classes or in no class.

In evidential reasoning basic belief assignment (BBA) is a function \(m(\cdot ):\) \(2^\varOmega \rightarrow [0,1]\) satisfying two properties: \(\sum _{A\in 2^\varOmega } m(A)=1\) and \(m(\emptyset )=0\). Belief function \(Bel(\cdot )\) and plausibility function \(Pl(\cdot )\) are the lower and upper bounds of imprecise probability associated with BBAs \(\forall A\in 2^\varOmega \) and are defined as,

$$\begin{aligned} \textstyle Bel(A) = \sum _{B\subseteq A} m(B); \textstyle Pl(A) = \sum _{B \cap A \ne \emptyset } m(B). \end{aligned}$$
(1)

\(Bel(\cdot )\) and \(Pl(\cdot )\) may be used for decision-making when it is necessary.

Dempster [22] proposed a rule to combine multiple evidences represented by BBAs. This is usually known as Dempster–Shafer (DS) rule and denoted by the \(\oplus \) symbol. Let, \(m_1(\cdot )\) and \(m_2(\cdot )\) be two BBAs over \(2^\varOmega \) then the combined basic probability analysis(BPA) \(m=m_1 \oplus m_2\) is defined by DS rule as follows:

$$\begin{aligned} \textstyle m(A)=[m_1 \oplus m_2] (A) = \frac{\sum _{B \cap C =A} m_1(B)m_2(C)}{1-\sum _{B \cap C =\emptyset } m_1(B)m_2(C)} \end{aligned}$$
(2)

The determinator of (2) is used here for normalization, whereas, \(\displaystyle \sum _{B \cap C =\emptyset } m_1(B)m_2(C)\) assesses the total conflicting belief mass. This combination rule does not produce intuitively desirable results in high conflicting cases. To overcome this, later many other combinational rules have been introduced [21] which we do not consider for our investigation.

3 Proposed algorithm

Here, we use an autoencoder for feature extraction. First, we use the original and the modified data sets (which is described later) to train the autoencoder. Then, we use the extracted features from the autoencoder to train class-wise (one verses all) SVMs. We use the trained class-wise SVMs to obtain the probabilistic outputs for a test point. With the obtained probabilistic outputs we construct mass functions for the test point. Using the mass functions, we find the class label of the particular test point. Below, we discuss the steps with further details.

3.1 The Architecture of the Autoencoder

We consider linear nodes in the input layer, which is described as follows [26]:

$$\begin{aligned} \textstyle \mathcal {S}\left( x_{ki}\right)&=x_{ki}; i=1,2,\cdots ,p;\end{aligned}$$
(3)
$$\begin{aligned} \textstyle \mathcal {S}\left( x_{k0}\right)&=1;\forall k. \end{aligned}$$
(4)

Here, \(\mathcal {S}(\cdot )\) denotes the activation function of a node; and \(\mathbf {x}_k\) and p are as defined earlier. We consider a single hidden layer with sigmoidal nodes as follows:

$$\begin{aligned} \textstyle z_{kh}&=\sum _{i=0}^{p}w_{ih}^{I}\mathcal {S}\left( x_{ki}\right) ;h=1,2,\cdots ,q;\end{aligned}$$
(5)
$$\begin{aligned} \textstyle \widetilde{z}_{kh}=\mathcal {S}\left( z_{kh}\right)&=\frac{1}{1+e^{-z_{kh}}};h=1,2,\cdots ,q; \end{aligned}$$
(6)
$$\begin{aligned} \mathcal {S}\left( z_{k0}\right)&=1, \forall k. \end{aligned}$$
(7)

Here, for \(\mathbf {x}_k\), \(z_{kh}\) is the net input to the \(h^{th}\) hidden node and \(\mathcal {S}\left( z_{kh}\right) \) is the output from the \(h^{th}\) hidden node. Moreover, \(w_{ih}^{I}\) is the weight connecting the \(i^{th}\) input node to the \(h^{th}\) hidden node and \(w_{0h}^{I}, \forall h,\) is a bias. We consider an output layer with sigmoidal nodes as follows:

$$\begin{aligned} \textstyle y_{kj}&=\sum _{h=0}^{q}w_{hj}^{H}\widetilde{z}_{kh};j=1,2,\cdots ,p;\end{aligned}$$
(8)
$$\begin{aligned} \textstyle \mathcal {S}\left( y_{kj}\right)&=\frac{1}{1+e^{-y_{kj}}};j=1,2,\cdots ,p. \end{aligned}$$
(9)

Here, for \(\mathbf {x}_k\), \(y_{kj}\) is the net input to the \(j^{th}\) output node and \(\mathcal {S}\left( y_{kj}\right) \) is the output from the \(j^{th}\) output node. Furthermore, \(w_{hj}^{H}\) is the weight connecting the \(h^{th}\) hidden node to the \(j^{th}\) output node and \(w_{0j}^{H}, \forall j,\) is a bias. The system error for the \(k^{th}\) training pattern is as follows:

$$\begin{aligned} \textstyle E^k=\frac{1}{2} \sum _{j=1}^{p} \left( x_{kj}-\mathcal {S}\left( y_{kj}\right) \right) ^2. \end{aligned}$$
(10)

3.2 The Learning

Since we are dealing with a classification problem, for every \(\mathbf {x} \in \mathbb {R}^p\), there is an associated class label \(l\in \{1,2,\cdots ,r\}\).

Let \(X_{TR}\) be the training data with n instances, which does not have any missing value. In \(X_{TR}\) there are data points from all r classes; \(X_{TR}= \cup _{i = 1} ^ r \hat{X}_i\), where \(\hat{X}_i = \{ {\mathbf {x}} | {\mathbf {x}} \in {X}_{TR} \text { and }{\mathbf {x}} \text { belongs to the } i^{th} \text { class} \}\). Now we find the \(i^{th}\) class center \(\mathbf {\widetilde{v}}_i\) as

$$\begin{aligned} \begin{array}{l@{}l} \textstyle \mathbf {\widetilde{v}}_i = \frac{1}{|\hat{X}_i|}\sum _{\mathbf {x} \in \hat{{X}_i}} \mathbf {x}. \end{array} \end{aligned}$$
(11)

In this way, we produce r class centers \(\widetilde{V}=\left\{ \mathbf {\widetilde{v}}_1,\mathbf {\widetilde{v}}_2,\cdots ,\mathbf {\widetilde{v}}_{r}\right\} ; \mathbf {\widetilde{v}}_i \in \mathbb {R}^p\). We also cluster each \(\hat{X}_i; i=1,2,\cdots ,r\); into \(n_c\) clusters using the k-means algorithm. In this way, we produce \((n_c \times r)\) cluster centers \(\hat{V}=\left\{ \mathbf {\hat{v}}_1,\mathbf {\hat{v}}_2,\cdots ,\mathbf {\hat{v}}_{(n_c \times r)}\right\} ; \mathbf {\hat{v}}_i\in \mathbb {R}^p\). Then, by combing \(\hat{V}\) and \(\widetilde{V}\), we obtain \(V= \hat{V} \cup \widetilde{V}\). In this way, we produce \((r + n_c \times r)\) cluster centers \(V=\left\{ \mathbf {v}_1,\mathbf {v}_2,\cdots ,\mathbf {v}_{(r + n_c \times r)}\right\} ; \mathbf {v}_i\in \mathbb {R}^p\). Now from \({X}_{TR}\) we generate p modified data sets \({X}^k, k\in \{1,2,\cdots ,p\}\), as follows. To generate \({X}^k\), for \(\mathbf {x}\in {X}_{TR}\), we replace \(x_{k}\) by \(v_{lk}\) if \(l = argmin_{k} \left\{ || \mathbf {x}-\mathbf {v}_k ||^2_{*} \right\} \). Here, we assume that the \(k^{th}\) feature is missing. We imputed the missing value by the \(k^{th}\) feature value of the centroid which is closest to \(\mathbf {x}\) in terms of the distance measure \(||.||_{*}\). Here, \(||.||_{*}\) is computed using all but the \(k^{th}\) feature. Now, we train an autoencoder using \({X}_{TR}\) for \(N_1\) epoches. Next, we retrain the same network for \(N_2\) epoches with the data set \(X^{Total}={X_{TR}}\cup \{{X}^1 \cup {X}^2 \cup \cdots \cup {X}^p\}\). For any \(\mathbf {x}\in {X}_{TR}\) as well as any \(\mathbf {x}\in {X}^k\), the target vector is taken as \(\mathbf {x} \in {X}_{TR}\). Note that the training of the auto encoder does not use the class labels. We explain this method of training in Algorithm 1.

figure a

After training the autoencoder, we pass \(\mathbf {x}_k\in X_{TR}\), \(k \in \{ 1,2,\cdots ,n\}\) to find the latent space representation \(\mathbf {\widetilde{z}}_{k} = ( \widetilde{z}_{k1},\widetilde{z}_{k2},\cdots ,\widetilde{z}_{kq})^T\) of \(\mathbf {x}_k\) using (6). It is indeed the output from the hidden layer of the trained autoencoder. Thus, we obtain the latent space representation of the entire training data set \(X_{TR}\) as \(\widetilde{Z}=\{ \mathbf {\widetilde{z}}_{1},\mathbf {\widetilde{z}}_{2},\cdots ,\mathbf {\widetilde{z}}_{n}\}\). We use \(\widetilde{Z}\) to train r number of SVMs using the one-verses-all strategy.

3.3 Decision Making for a Test Point

Let the set of test points be \(X_{TE}\). For each test data point \(\mathbf {x}_i\in X_{TE}\), \(i \in \{ 1,2,\cdots ,N\}; N=|{X_{TE}}|\) atmost \((p-1)\) features may be missing. We impute the missing values of \(\mathbf {x}_i\) using \(\mathbf {v}_l\), (\(l=1,2,\cdots ,{(r + n_c \times r)}\)) as follows. The \(k^{th}\) missing value of \(\mathbf {x}_i\), \(x_{ik}\), is imputed by \({v}_{lk}\) if \(l = argmin_{j} \left\{ || \mathbf {x}_i-\mathbf {v}_j ||^2_{*} \right\} \). Here \(||.||_{*}\) is computed using all but the missing feature values of \(\mathbf {x}_i\). We pass \(\mathbf {x}_i\) with the imputed values through the trained autoencoder and take the output of the hidden layer, \(\mathbf {\widetilde{z}}_{i} = (\widetilde{z}_{i1},\widetilde{z}_{i2},\cdots ,\widetilde{z}_{iq})^T\) as the latent space representation using (6). Then we use \(\mathbf {\widetilde{z}}_{i}\) in the trained r SVMs. Let, the probabilistic outputs [18] from the \(k^{th}\) SVM for the \(k^{th}\) class and the remaining \((r-1)\) classes be \(P^1_{ik}\) and \(P^0_{ik}\), respectively. Then, we define the \(k^{th}\) BPA as follows:

$$m_k(\{ k\})=P^1_{ik}~\mathrm{and }~m_k(\{\varOmega - k\})=P^0_{ik}; k=1,2,\cdots ,r.$$

We now use Dempsters rule [22] to combine these r BPAs as discussed in Sect. 2 to obtain the composit BPA \(m(\cdot )\).

Now for class k, we compute the Pignistic probability [2] as follows:

$$\begin{aligned} {\begin{matrix} \textstyle P_{m,\mathbf {x}} \{k\} =\sum _\mathbf{A \subseteq \varOmega ,k\in \mathbf A } {m(\mathbf A )}/{|\mathbf A |} \end{matrix}} \end{aligned}$$
(12)

Thus, we have a set of Pignistic probabilities \(\mathcal {P}=\{P_{m,\mathbf {x}}\{1\},P_{m,\mathbf {x}}\{2\},\cdots , P_{m,\mathbf {x}}\{r\}\}\). Let, \(l=argmax_i\{P_{m,\mathbf {x}} \{i\}; \forall i=1,2,\cdots ,r\}\) and \(d=argmax_i\{P_{m,\mathbf {x}} \{i\}; \forall i= 1,2,\cdots ,r; i \ne l\}\). Now, if \((P_{m,\mathbf {x}}\{l\}-P_{m,\mathbf {x}}\{d\})>\epsilon \), we decide that \(\mathbf {x}\) belongs to the class l. Otherwise, we say that \(\mathbf {x}\) belongs to both the classes l and d. Here, \(\epsilon \) is a user defined threshold.

4 Experiments

As in [14], to test the effectiveness of the proposed method, we divide our experiments into three parts and compare our method with four methods. For this comparison, we use miss-classification error as used in [14] and the results provided in [14].

4.1 Experimental Set Up

Based on a few trial experiments, for all datasets, we use the following architecture of the autoencoder: \((p)-(10\times p)-(p)\), where p is the number of features in the dataset. We consider the learning rate \(\eta =0.9\) and the number of clusters in each class \(n_c=\left( r\times 5\right) \). Based on a few trials and errors experiments for all datasets, we choose \(N_1=N_2=10000\). We repeat the process 10 times with 10 different weight initializations and report the average results. We compare the proposed method with four algorithms and with two classifiers: prototype-based credal classification (PCC) with evidential k-nearest neighbors (EK-NN), PCC with evidential neural network (ENN), k-nn imputation (KNNI) method with EK-NN, KNNI method with ENN, FCM imputation (FCMI) with EK-NN, FCMI with ENN, mean imputation (MI) with EK-NN and MI with ENN. The detailed description of these methods can be found in [14].

4.2 Experiment 1

Same as experiment-1 in [14] we consider a three-class data set for experiment-1. Each class contains 305 training and 305 test samples from inside a circular disk. The radius of each circle is 3 units, and the centers of three circles are \(c_1 = (3, 3)^T\), \(c_2 = (13, 3)^T\) and \(c_3 = (8, 8)^T\). As in [14] the values in the second dimension corresponding to y-coordinate of test samples are all missing. Thus there is only one known value, i.e., the first dimension corresponding to the x-coordinate for each test sample. As in [14] we use different meta-class selection thresholds \(\epsilon = 0.30\) and \(\epsilon = 0.45\) to show their influence on the results. Using Fig. 1 we also show the classification results of our method with \(\epsilon = 0.45\). In Table 1 we report the results with \(\epsilon = 0.30\) and \(\epsilon = 0.45\) for our method and PCC. These results show that we obtain the best results by our method and PCC with \(\epsilon = 0.45\). These experiments establish the importance of choosing the threshold. It also shows the superior performance of the proposed method compared to FCMI with EK-NN, KNNI with EK-NN, and MI with EK-NN.

Table 1. Miss-classification error (%) for different methods on the three class synthetic datasets associated with Experiment 1
Fig. 1.
figure 1

Miss-classification result of our method of three class data set

4.3 Experiment 2

In this experiment, we use a three-dimensional synthetic dataset generated using three four-dimensional Gaussian distributions with the following characteristics. The means of the three classes are \( (1, 5, 10, 10)^T\), \((10, 3, 2, 1)^T\), and \((15, 15, 1, 15)^T\). The covariance matrices for the three classes are \(6 \cdot I\), \(5 \cdot I\), and \(7 \cdot I\), respectively, where I is the \(4 \times 4\) identity matrix. Similar to the experiment-3 in [14], we use two datasets. The first one has 100 points from each class and the second one has 200 points from each class. We consider three cases of missing values. In these three cases, exactly 1, 2, and 3 values are missing randomly. Thus, we have six cases: (100, 1), (100, 2), (100, 3), (200, 1), (200, 2), and (200, 3). Here, the first integer of each tuple indicates the number of points in each class and the second one indicates the number of missing values in each sample. Table 2 shows the results of different methods. For the proposed method, we use \(\epsilon = 0.30\). With this above process, we generate ten datasets for each of the six cases and repeat the learning process ten times for each of them. We report the average misclassification accuracies in Table 2. Table 2 depicts that the proposed method performs better than other methods.

Table 2. Miss-classification error (%) for different methods on the three class synthetic datasets associated with Experiment 2
Table 3. Data sets description
Table 4. Misclassification error (%) considering different number of missing values for different methods on the four real-world datasets associated with Experiment 3

4.4 Experiment 3

Following experiment-4 in [14], in this experiment, we use the same four real data sets which are summarized in Table 3. Moreover, following [14], here we perform two-fold cross-validation and consider the same number of missing values for different datasets. The last column of Table 3 lists the considered number of missing values. Similar to Experiment-2, for the proposed method, we use \(\epsilon = 0.30\). We repeat the experiment ten times and report the obtained average classification errors in Table 4. Table 4 reveals that our method performs the best in 10 out of 12 cases. The proposed method performs the worst on the Seed dataset when five and six values are missing. In these two cases, it ranks the third and the fifth among the nine competing algorithms.

5 Conclusion

Here, we have presented a new method using the evidential framework to deal with missing values for the classification problem. For a r class problem, we train r classifiers using the latent space representation of each training data. To get the latent space representation first we train an autoencoder with the complete data. Then, we augment the complete data by deleting each feature once and imputing it using the nearest neighbor to a set of predefined points obtained using a clustering-based scheme. Then, we retrain the network using the modified dataset with a view to getting a better latent space representation.

If a test point suffers from missing values we make an initial guess of the missing value using the nearest neighbor rule and take the latent space representation of that test point using the trained autoencoder. Then that representation is classified using all r trained classifiers. Each classifier gives some confidence that a given point belongs to the associated class. This confidence is then translated into a BPA. The BPAs are then aggregated using Dempster rule. Finally, the Pignistic probability is used to make the final decision. To check the performance of the proposed method we compare our method with four state-of-the-art techniques using three sets of experiments. From these experiments, we find that overall our method performs better than the compared methods. In the future, we intend to propose new methods to choose meta-class selection thresholds dynamically. We also like to propose different methods to assign BPAs from the SVMs outputs.