Introduction

Diabetes mellitus (DM) is known as diabetes in which blood glucose levels are too high [1]. As a result, the disease increases the risk of cardiovascular diseases such as heart attack and stroke etc. [2]. There were about 1.5 million deaths directly due to diabetes and 2.2 million deaths due to cardiovascular diseases, chronic kidney disease, and tuberculosis in 2012 [3]. Unfortunately, the disease is never cured but can be managed by controlling glucose. About 8.8% of adults worldwide were diabetic in 2017 and this number is projected to be 9.9% in 2045 [4]. There are three kinds of diabetes disease: (i) juvenile diabetes (type I diabetes), (ii) type II diabetes, and (iii) type III diabetes (gestational diabetes) [5]. In type I diabetes, the body does not produce proper insulin. Usually, it is diagnosed in children and young adults [6]. Type II diabetes usually develops in adults over 45 years, but also in young age children, adolescents and young adults. With type II diabetes, the pancreas does not produce enough insulin. Almost 90% of all diabetes is type II [7]. The third type of diabetes is gestational diabetes. Pregnant women, who never had diabetes before, but have high blood glucose levels during pregnancy are diagnosed with gestational diabetes.

Diabetic classification is an important and challenging issue for the diagnosis and the interpretation of diabetic data [8]. This is because the medical data is nonlinear, non-normal, correlation structured, and complex in nature [9]. Further, the data has missing values or has outliers, which further affects the performance of machine learning systems for risk stratification. A variety of different machine learning techniques have been developed for the prediction and diagnosis of diabetes disease such as: linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), naïve Bayes (NB), support vector machine (SVM), artificial neural network (ANN), feed-forward neural network (FFNN), decision tree (DT), J48, random forest (RF), Gaussian process classification (GPC), logistic regression (LR), and k-nearest neighborhood (KNN) [9, 10]. These classifiers cannot correctly classify diabetic patients when the data contains missing values or has outliers, and therefore, when the machine learning-based classifiers are used for risk stratification, it does not yield higher accuracy [10,11,12,13,14,15,16].

In statistics, outlier removal and the handling of missing values is an important issue and have never been ignored. Previous machine learning techniques [10] have been unsuccessful mainly because their classifications are either (a) directly on the raw data without feature extraction or (b) on raw data without outlier removal or (c) without adding replacement values for missing values or (d) filling missing values simply with the mean value. Moreover, outlier replacements using computed mean is very sensitive [11]. As a result, their classification accuracy is low. Several authors tried outlier removal or the filling of missing values, but in the non-classification framework [12,13,14,15,16]. Our techniques were motivated by the spirit of these statistical measures embedded in a classification framework. To improve the classification accuracy, we adapted a missing value approach based on group median, outlier removal using medians, and further optimizing the data set by choosing the combination of best feature selection criteria and classification model among the set of six feature selection techniques and ten classification models.

The hypothesis has been laid out in Fig. 1, where input diabetic data undergoes two stage process of data preparation: (i) missing value process to replace the missing value by the group median and (ii) removal of the outliers by the median values. The filtered data then undergoes machine learning risk stratification paradigm, given the set of classifiers. The comparator helps in comparing the classification accuracy when the data has (a) no missing values but has outliers against classification accuracy when the data (b) has no missing values and no outliers.

Fig. 1
figure 1

Preparation of diabetic data by missing value replacement and outlier removal

Among the set of classifiers, we adapted RF [17] to extract and select significant features and also predict diabetic disease using the RF-based classifier. RF-based classifier is the most powerful machine learning technique in both classification and regression [18]. Some key strengths of RF are: (i) suites nonlinear and non-normal data; (ii) avoids over fitting of the data; (iii) provides robustness to noise; (iv) possesses an internal mechanism to estimate error rates; (v) provides the rank of variable importance; (vi) adaptable on both continuous and categorical variables; and (vii) fits well for data imputation and cluster analysis. In our current study, we hypothesize that by (a) replacing missing values with group median and outliers by median, and (b) using feature extraction by RF combined with the RF-based classifier will lead to the highest accuracy and sensitivity compared to conventional techniques like: LDA, QDA, NB, GPC, SVM, ANN, Adaboost, LR, and DT. The performances of these classifiers have been evaluated by using accuracy (ACC), sensitivity (SE), specificity (SP), positive predictive value (PPV), negative predictive value (NPV) and area under the curve (AUC).

Thus, following are the novelties of our current study compared to the previous studies:

  1. 1.

    Design of ML system where, one can remove missing values using group median, check outliers by using inter-quartile range (IQR) and if there exit outliers, replace outliers with the median values.

  2. 2.

    Optimizing the ML system by selecting the best combination of feature selection and classification model among the six features selection techniques (random forest (RF), logistic regression (LR), mutual information (MI), principal component analysis (PCA), analysis of variance (ANOVA), and Fisher discriminant ratio (FDR)) and ten classification models (RF, LDA, QDA, NB, GPC, SVM, ANN, AB, LR, and DT).

  3. 3.

    Understanding the different cross-validation protocols (K2, K4, K5, K10, and JK) for determining the generalization of the ML system and computing the performance parameters such as: ACC, SE, SP, PPV, NPV, and AUC.

  4. 4.

    Demonstration of automated reliability index (RI) and stability index, which are used to check the validity of our study and further, benchmarking our ML system against the existing literature.

  5. 5.

    Demonstration of an improvement in classification accuracy compared against current techniques available in literature by 10% using K10 protocol and 18% using JK protocol under the combination of current framework.

The overall layout of this paper is as follows: Section 2 represents the patient’s demographics, section 3 represents methodology, including feature selection methods and classification methods are discussed in this section. Experimental protocols are given in section 4. Results are discussed in section 5. Section 6 represents the hypothesis validations and performance evaluation. Section 7 represents the discussions in detail and finally conclusion is presented in section 8.

Patient demographics

The diabetic dataset has been taken from the University of California, Irvine (UCI) Repository. This dataset consists of 768 female patients, at least 21 years old of Pima Indian heritage, having 268 diabetic patients and 500 controls. In this dataset, five patients have zero glucose level, diastolic blood pressure is zero in 35 patients, 27 patients have zero body mass indexes, 227 patients have zero skin fold thickness and 374 patients have zero serum insulin level. These zero values have no meaning and is treated as missing values. As a preprocessing step, we divide the dataset into two parts: diabetic and control, and then the missing values are replaced by the median of each group. We also check the outliers by inter-quartile range (IQR). If outliers exist, we have replaced outliers by the median. The flow chart of data preparations is described in Fig. 1. The descriptions of the attributes and brief statistical summary are shown in Table 1.

Table 1 Demographics of the diabetic patient cohort

Methodology

The idea of proposed overall machine learning system is presented in Fig. 2. This follows the conventional model of ML; however the input data is now preprocessed by taking care of missing values and outlier removal. The dotted line divides the system into two segments: training diabetic data or offline (shown on the left) and testing diabetic data or online system (shown on right). The basic difference between the training and testing protocol is that the training system works on the basis of a priori ground truth and testing protocols perform prediction of diabetes. The next stage is the feature extraction followed by feature selection block, whose role is to diminish the system complexity while choosing the dominant features. Six types of feature selection techniques have been adapted, i.e., RF, LR, MI, PCA, ANOVA, and FDR. The features are trained based on the binary class framework model. Using the training database and ground truth, the machine learning parameters use online classifiers (classifier types) such as: LDA, QDA, NB, GPC, SVM, ANN, Adaboost, LR, DT, and RF. These training-based machine learning parameters and dominant features extracted from the test datasets are transformed to predict of diabetic patients.

Fig. 2
figure 2

Architecture of the machine learning system

Feature selection methods

Feature selection is important in the field of machine learning. Often in data science, we have hundreds or even millions of features and we want a way to create a model that only includes the most informative features. It has three benefits as (i) we easily run our model to interpret; (ii) reduce the variation of the model; and (iii) reduce the computational cost and time of the training model. The optimal feature selection removes the complexity of the system and increases the reliability, stability, and classification accuracy. The main feature selections methods are used: PCA, ANOVA, FDR, MI, LR, and RF, presented below:

Principal component analysis

Feature selection technique (FST) always removes the less dominant features and improves the classification accuracy and reduces the computational cost and time consumption of machine learning algorithm. Principal component analysis (PCA) is one of the popular dimension reduction technique. In this study, we adapted pooling methodology along with PCA [19] which extract the important features. The PCA algorithm of feature selection is given below:

  1. 1.

    Calculate the mean vectors across each feature space dimension as:

$$ {\boldsymbol{\upmu}}_{\left(\mathrm{P}\times 1\right)}=\frac{1}{\mathrm{N}}{\mathbf{X}}^{\mathbf{T}}\mathbf{I} $$
(1)

Here, X is a matrix of N × P, where, N is a total number of patients, P is the total number of attributes, and I is a vector of 1’s of size N × 1.

  1. 2.

    To make normalize the data (i.e., zero mean and unit variance), we subtract mean vectors from data matrix as:

$$ {\mathbf{A}}_{\left(\mathrm{N}\times \mathrm{P}\right)}=\mathbf{X}-\boldsymbol{\upmu} $$
(2)
  1. 3.

    Compute the covariance matrix of the dataset by using formula

$$ {\mathbf{S}}_{\left(\mathrm{P}\times \mathrm{P}\right)}=\frac{1}{\mathrm{N}}{\mathbf{A}}^{\mathbf{T}}\mathbf{A} $$
(3)
  1. 4.

    Compute the eigenvalues (λ1, λ2, …, λP) and eigenvectors (e1, e2, …, eP) of the covariance matrix (S).

  2. 5.

    Sort the eigenvalues in descending order and arrange the corresponding eigenvectors in the same order.

  3. 6.

    Choose the number of principal components (m) to be considered using the following criterion:

$$ \frac{\sum_{\mathrm{i}=1}^{\mathrm{m}}{\uplambda}_{\mathrm{i}}}{\sum_{\mathrm{i}=1}^{\mathrm{P}}\uplambda \mathrm{i}}>\mathrm{R} $$
(4)

where, R is the cutoff point varying from 0.90 to 0.95, P is the total number of eigenvalues.

  1. 7.

    Compute the contribution of each feature as the following dominance indices:

$$ {\mathrm{b}}_{\mathrm{n}}=\sum \limits_{\mathrm{z}=1}^{\mathrm{m}}\left|{\mathrm{e}}_{\mathrm{z}\mathrm{n}}\right| $$
(5)

where, ezn indicates the nth entry of en which is the zth eigenvectors, n = 1, 2… P and |ezn|  shows the absolute value of ezn.

Sort the indices bn in descending order and select first m features which will give the reduced number of features (m) (without modifying original feature values) with their dominance level from highest to lowest.

Analysis of variance

The main goal of one-way analysis of variance (ANOVA) test is to perform tests whether or not all the different classes of Y have the same mean as X. To perform ANOVA-test, the following notations are used.

Nj:

Number of classes with Y = j.

μ j :

The sample mean of the predictors X for the target variables Y = j.

\( {\mathbf{S}}_{\mathrm{j}}^2 \) :

The sample variance of the predictors X for the target variables Y = j:

$$ {\mathbf{S}}_{\mathrm{j}}^2=\frac{\sum_{\mathrm{j}=}^{{\mathrm{N}}_{\mathrm{j}}}{\left({\mathbf{X}}_{\mathrm{ij}}-{\boldsymbol{\upmu}}_{\mathrm{j}}\right)}^2}{{\mathrm{N}}_{\mathrm{j}}-1} $$
(6)

μ= The overall mean of the predictors X: \( \boldsymbol{\upmu} =\frac{\sum_{\mathrm{j}=1}^{\mathrm{N}}{\mathrm{N}}_{\mathrm{j}}{\mathrm{X}}_{\mathrm{j}}}{\mathrm{N}} \), where N is the total number of patients and J are the total number of classes. The p-value is calculated based on the F-statistic which p-value is = Prob.{F (J-1, N-1) > F} where, \( \mathrm{F}=\frac{\frac{\sum_{\mathrm{j}=1}^{\mathrm{J}}{\mathrm{N}}_{\mathrm{j}}{\left({\boldsymbol{\upmu}}_{\mathrm{j}}-\boldsymbol{\upmu} \right)}^2}{\left(\mathrm{J}-1\right)}}{\frac{\sum_{\mathrm{j}=1}^{\mathrm{J}}\left({\mathrm{N}}_{\mathrm{j}}-1\right){\mathrm{S}}_{\mathrm{j}}^2}{\left(\mathrm{N}-1\right)}} \) which follows F-distribution with (J-1) and (N-1) degrees of freedom respectively. We select the features whose p-values are less than 0.0001.

Fisher discriminant ratio

Fisher discriminant ratio (FDR) selects the most informative features in such a way that the distance between the data points of within-class should be as large as possible, while the distance between the data points between-class should be as small as possible [20]. The general algorithm of FDR in details is given below.

  1. 1.

    Calculate the sample mean vectors μj of the different class:

$$ {\boldsymbol{\upmu}}_{\mathrm{j}}=\frac{1}{{\mathrm{N}}_{\mathrm{j}}}{\sum \limits}_{\mathrm{X}\in {\mathrm{D}}_{\mathrm{j}}}^{\mathrm{N}}{\mathbf{X}}_{\mathrm{k}}\kern1em ;\mathrm{j}=1,2. $$
(7)
  1. 2.

    Compute the scatter matrices (in-between-class and within-class scatter matrix). The within-class scatter matrix Sw is calculated by the following formula:

$$ {\mathbf{S}}_{\mathrm{w}}=\sum \limits_{\mathrm{j}=1}^{\mathrm{K}}{\mathbf{S}}_{\mathrm{j}},\kern0.75em \mathrm{where},{\mathbf{S}}_{\mathrm{j}}=\sum \limits_{\mathrm{X}\in {\mathrm{D}}_{\mathrm{j}}}^{\mathrm{N}}\left(\mathbf{X}-{\boldsymbol{\upmu}}_{\mathrm{j}}\right){\left(\mathbf{X}-{\boldsymbol{\upmu}}_{\mathrm{j}}\right)}^{\mathrm{T}} $$
(8)
  1. 3.

    The between-class scatter matrix SB is computed by the following:

$$ {\mathbf{S}}_{\mathrm{B}}=\sum \limits_{\mathrm{j}=1}^{\mathrm{K}}{\mathrm{N}}_{\mathrm{j}}\left({\boldsymbol{\upmu}}_{\mathrm{j}}-\boldsymbol{\upmu} \right){\left({\boldsymbol{\upmu}}_{\mathrm{j}}-\boldsymbol{\upmu} \right)}^{\mathrm{T}} $$
(9)

where, μ is the overall mean vectors, μj is the jth sample mean vectors and Nj is the number of classes the respective patients.

  1. 4.

    Finally, the FDR is computed by comparing the relationship between the within-class scatter and between-class scatter matrix by the following formula:

$$ \mathrm{FDR}={\mathbf{S}}_{\mathrm{W}}^{-1}{\mathbf{S}}_{\mathrm{B}} $$
  1. 5.

    Compute the eigenvalues (λ1, λ2, …, λP) and the corresponding eigenvectors (e1, e2, …, eP) for the scatter matrices (FDR=\( {\mathbf{S}}_{\mathrm{W}}^{-1}{\mathbf{S}}_{\mathrm{B}} \)).

  2. 6.

    Sort the eigenvectors by decreasing eigenvalues and choose number of K eigenvectors with the largest eigenvalues to form a P× K dimensional weighted matrix W (where every column represents an eigenvector).

    Use this P× K eigenvector matrix to transform the samples into the new subspace. This can be summarized as follows:

$$ \mathbf{Y}=\mathbf{XW} $$
(10)

where, X is a N× P-dimensional matrix representing the N samples, and Y is the N× K-dimensional samples in the new spaces.

Mutual information

Mutual information (MI) is a well-known dependence measure in information theory. It detects a subset of most informative features [21]. It requires two parameters as its input i.e., the numbers of most informative features to be selected for classification and the number of quantization levels into which the continuous features are binned. Due to redundancy in features, there is over-fitting, and therefore dominant features are selected via this technique. In our current study, the numbers of the important features are selected for our classifier by using t-test based on p-values which are less than 0.0001. For two discrete variables x and y, the mutual information is denoted my MI (x, y) and is defined as:

$$ \mathrm{MI}\ \left(\mathrm{x},\mathrm{y}\right)=\sum \limits_{\mathrm{i},\mathrm{j}}\mathrm{p}\left({\mathrm{x}}_{\mathrm{i}},{\mathrm{y}}_{\mathrm{j}}\right)\log \frac{\mathrm{p}\left({\mathrm{x}}_{\mathrm{i}},{\mathrm{y}}_{\mathrm{j}}\right)}{\mathrm{p}\left({\mathrm{x}}_{\mathrm{i}}\right)\mathrm{p}\left({\mathrm{y}}_{\mathrm{j}}\right)} $$
(11)

where, p(x, y) is the joint probability distributions of x and y, p(x) and p(y) are the marginal probability distribution of x and y.

Logistic regression

Logistic regression (LR) is used when the dependent variable is categorical. The logistic model is used to estimate the probability of a binary response based on one or more predictor variables. We estimate the coefficients of the logistic regression by applying maximum likelihood estimator (MLE) and test the coefficients by applying the z-test. We select the features corresponding to the coefficients where p-values are less than 0.0001.

Random forest

Random forest (RF) directly performs feature selection while the classification rules are built. There are two methods used for variable importance measurements as (i) Gini importance index (GIM), and (ii) permutation importance index (PIM) [22]. In this study, we have used two steps to select the important features: (i) PIM index is used to order the features and (ii) RF is used to select the best combination of features for classification [17]. These same techniques are used on both types of data: data with outlier O1 and data without outlier O2. These reduced features are used for classification.

Ten classification models

Ten classification techniques have been adapted for risk stratification in machine learning framework. They are adapted as per their simplicity and popularity: LDA, QDA, NB, GPC, SVM, ANN, Adaboost, LR, DT, and RF. We also adapted five sets of cross-validation protocols as K2, K4, K5, K10, and JK, respectively, and repeated these protocol 10 trials (T). These above systems are implemented under two different sets of paradigms: while outliers (O1) are present and impute outliers by median (O2). Monitoring outputs of the performance system yields ACC, SE, SP, PPV, NPV, and AUC of ROC which is shown in Fig. 3. Brief discussions on the classifiers are presented here:

Fig. 3
figure 3

Concept showing the hypothesis link between outlier removals in relation to the performance of the ML system

Classifier type 1: Linear discriminant analysis

Ronald Aymer Fisher introduced the linear discriminant analysis (LDA) in 1936. It is an effective classification technique. It classifies n-dimensional space into two-dimensional space that is separated by a hyper-plane. The main objective of this classifier is to find the mean function for every class. This function is projected on the vectors that maximizes the between-groups variance and minimizes the within-group variance [23].

Classifier type 2: Quadratic discriminant analysis

Quadratic discriminant analysis (QDA) is used in machine learning and statistical learning to classify two or more classes by a quadric surface. It is distance based classification techniques and it is an extension of LDA. Unlike LDA, there is no assumption that the covariance matrix for every class is identical. When the normality assumption is true, the best possible test for the hypothesis that a given measurement is from a given class is the likelihood ratio test [24].

Classifier type 3: Naïve bayes

Naïve Bayes (NB) classifier is a powerful and straightforward classifier and particularly useful in large-scale dataset. It is used on both machine learning and medical science (especially, diagnosis of diabetes). It is a probabilistic classifier based on Bayes’ theorem with the strong independent assumption between the features. It is assumed that the presence of particular features in a class is unrelated to any other features [25].

Classifier type 4: Gaussian process classification

In the last decade, Gaussian process (GP) has become a powerful, nonparametric tool that is not only used in regression but also in classification problems in order to handle various problems such as insufficient capacity of the classical linear method, complex data types, the curse of dimension, etc. The main advantages of this method are the ability to provide uncertainty estimates and to learn the noise and smoothness parameters from training data. A GP-based supervised learning technique attempts to take benefit of the better of two different schools of techniques: SVM developed by Vapnik in the early nineties of the last century and Bayesian methods. A GP is a collection of random variables, any finite number of which has a joint Gaussian distribution. A GP is a Gaussian random function and is fully specified by a mean function and covariance function [26]. In our current study, we have used the radial basis kernel (RBF).

Classifier type 5: Support vector machine

Support vector machine (SVM) is a supervised learning technique and widely used in medical diagnosis for classification and regression [27]. SVM minimizes the empirical classification error and maximizes the margin, called hyper-plane between two parallel hyper-planes. The classification of a non-linear data is performed using the kernel trick that maps the input features into high-dimensional space. In our current study, we have used the radial basis kernel (RBF).

Classifier type 6: Artificial neural network

The concept of the artificial neural network (ANN) [28] is inspired by the biological nervous system. The ANN has following key advantage: (i) it is a data driven, self-adaptive method, i.e., it can adjust themselves to the data and (ii) it is a non-linear model, which makes it flexible in modeling real-world problem. In our current study, we have used back propagation algorithm for training ANN and 10 hidden layers to find better results.

Classifier type 7: Adaboost

Adaboost means adaptive boosting, is a machine learning technique. Yoav Freund and Robert Schapire formulated Adaboost algorithm and won golden prizes in 2003 for their work. It can be used in conjunction with different types of algorithm to improve classifier’s performance. Adaboost is very sensitive to handle noisy data and outliers. In some problems, it can be less susceptible to the over fitting problem than other learning algorithms. Every learning algorithm tends to suit some problem types better than others, and typically has many different parameters and configurations to adjust before it achieves optimal performance on a dataset. Adaboost is known as the best out-of-the-box classifier [29].

Classifier type 8: Logistic regression

Logistic regression (LR) is basically a linear model for classification rather than regression. It is a basic model which describes dummy output variables and can be extended for diabetes disease classification [30]. The main advantages of LR are that it is more robust and it may handle non-linear data. Let us consider there are N input features like X1, X2…, XN, and P is the probability of the event that will occur and 1-P is the probability of the event that is not occurred. The mathematical expression of the model as follows:

$$ \log \left(\frac{\mathrm{P}}{1-\mathrm{P}}\right)=\mathrm{logit}\ \left(\mathrm{P}\right)={\upbeta}_0+{\upbeta}_1{\mathrm{X}}_1+,\dots \dots, +{\upbeta}_{\mathrm{N}}{\mathrm{X}}_{\mathrm{N}} $$
(12)

where, β0 is the intercept term and βi (i = 1, 2, 3,…, N) is the regression coefficients.

Classifier type 9: Decision tree

A decision tree (DT) classifier is a decision support tool that uses a tree structure this is built using input features. The main objective of this classifier is to build a model that predicts the target variables based on several input features. One can easily extract decision rules for a given input data which makes this classifier suitable for any kinds of application [31].

Classifier type 10: Random forest

Random forest (RF) is one of the popular supervised techniques in the field of machine learning. It is also an ensemble a multitude of decision trees at training time that outputs the class that is the mode of the classes for classification or average mean prediction for regression of the individual trees [18]. The algorithm of RF is given as follows.

  1. Step 1:

    For a given training dataset, extract a new sample set by repeated N time’s using bootstrap method. For example, we sample of (X1, Y1),…, (XN, YN) from a given training dataset (X1, Y1),…,(Xn, Yn). Samples are not extracted consisting of out of bag data (OOB).

  2. Step 2:

    Build a decision tree based on the results of step 1.

  3. Step 3:

    Repeat step 1 and step 2 and results in many trees (here 100 trees used) and comprise a forest.

  4. Step 4:

    Let every tree in the forest to vote for Xi.

  5. Step 5:

    Calculate the average of votes for every class and the class with the highest number of votes is the classification label for X.

  6. Step 6:

    The percentage of correct classification is the accuracy of RF.

Statistical evaluation

Performances of all classifiers are evaluated by different measurement factors as accuracy (ACC), sensitivity (SE), specificity (SP), positive predictive value (PPV), negative predictive value (NPV) etc. These measurement factors are calculated by using true positive (TP), true negative (TN), false positive (FP), and false negative (FN). Using these measures, the performance measures can be defined as

  • Accuracy

It is the proportion of the sum of the true positive and true negative against total number of population. It can be expressed mathematically as follows:

$$ \mathrm{ACC}\ \left(\%\right)=\left(\frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{FN}+\mathrm{FP}+\mathrm{TN}}\right)\times 100 $$
(13)
  • Sensitivity

It is the proportion of the positive condition against the predicted condition is positive. It can be expressed mathematically as follows:

$$ \mathrm{SE}\ \left(\%\right)=\left(\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}\right)\times 100 $$
(14)
  • Specificity

It is the proportion of the negative condition against the predicted condition is negative. It can be expressed mathematically as follows:

$$ \mathrm{SP}\ \left(\%\right)=\left(\frac{\mathrm{FP}}{\mathrm{FP}+\mathrm{TN}}\right)\times 100 $$
(15)
  • Positive predictive value

The positive predictive value is the proportion of the predicted positive condition against the true condition is positive. It can be expressed mathematically as follows:

$$ \mathrm{PPV}\ \left(\%\right)=\left(\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}\right)\times 100 $$
(16)
  • Negative predictive value

It is the proportion of the predicted negative condition against the true condition is negative. It can be expressed mathematically as follows:

$$ \mathrm{NPV}\ \left(\%\right)=\left(\frac{\mathrm{TN}}{\mathrm{FN}+\mathrm{TN}}\right)\times 100 $$
(17)

Experimental protocols

In this study, we adapted six feature selection techniques (FST), two outlier removal techniques (ORT), and six cross-validation (CV) protocols: K2, K4, K5, K10, and JK-fold CV protocols, and ten different classifiers. We have performed two experimental protocols such as (i) to select best FST over CV protocols and ORT and (ii) comparison of the classifiers. Since the partitions K are random, we repeated the protocols with T = 10 trials in K2, K4, K5, and K10-folds CV protocols.

Experiment 1: Select best cross-validation over outlier removal technique

The main objective of this section is to select the best CV protocols for both O1 and O2. The best CV protocols selection formula can be expressed as follows. Where, \( \mathcal{A} \) (f, c, p) represents the mean accuracy of over different protocols when feature selection technique is “f”, classifier types is “c”, and data types is “p”, and total number of feature selection techniques, classifier types, and data types are F, C, and P, respectively.

$$ \mathcal{A}\left({\mathrm{k}}_{{\mathrm{o}}_{\mathrm{i}}}\right)=\frac{\sum_{\mathrm{f}=1}^{\mathrm{F}=6}{\sum}_{\mathrm{c}=1}^{\mathrm{C}=10}{\sum}_{\mathrm{p}=1}^{\mathrm{P}=768}\mathcal{A}\ \left(\mathrm{f},\mathrm{c},\mathrm{p}\right)}{\mathrm{F}\times \mathrm{C}\times \mathrm{P}},\mathrm{i}=1,2. $$
(18)

Experiment 2: Best feature selection techniques over K-fold CV and ORT

The experiment presented in this section chooses the optimal FST over CV protocols and ORT’s on the basis of classification accuracy, where, \( \mathcal{A}\left(\mathrm{k},\mathrm{c},\mathrm{p},\right) \) represents the accuracy of the classifer computed when protocol type is “k”, classifier type is “c”, patient number is “p”, and total number of protocols types, classifiers, and patients are: K, C, and P, then the mean accuracy of the performance of classification algorithms are evaluated in terms of measures.

$$ \mathcal{A}\left({\mathrm{f}}_{{\mathrm{o}}_{\mathrm{i}}}\right)=\frac{\sum_{\mathrm{k}=1}^{\mathrm{K}=5}{\sum}_{\mathrm{c}=1}^{\mathrm{C}=10}{\sum}_{\mathrm{p}=1}^{\mathrm{P}=768}\mathcal{A}\ \left(\mathrm{k},\mathrm{c},\mathrm{p}\right)}{\mathrm{K}\times \mathrm{C}\times \mathrm{P}},\mathrm{i}=1,2. $$
(19)

Experiment 3: Comparison of the classifiers

The main objective of this experiment is to compare classification techniques based on classification accuracy and then select the best classifier. In this experiment, we adapted ten classifiers on both data: (i) data that contains outlier (O1) and (ii) impute outlier by the median (O2). For each dataset same FST and five sets of CV protocols are used. And compute the mean accuracy of all classifiers over protocols for both O1 and O2 datasets.

$$ \mathcal{A}\left({\mathrm{c}}_{{\mathrm{o}}_{\mathrm{i}}}\right)=\frac{\sum_{\mathrm{k}=1}^{\mathrm{K}=5}{\sum}_{\mathrm{f}=1}^{\mathrm{F}}{\sum}_{\mathrm{f}=1}^{\mathrm{F}=6}{\sum}_{\mathrm{p}}^{\mathrm{P}=768}\mathcal{A}\left(\mathrm{k},\mathrm{f},\mathrm{p}\right)}{\mathrm{K}\times \mathrm{F}\times \mathrm{P}},\mathrm{i}=1,2 $$
(20)

Where, \( \mathcal{A}\left(\mathrm{k},\mathrm{f},\mathrm{p}\right) \) represents the accuracy of the classifer computed when protocol types is “k”, feature selection methods is “f”, and number of patients is “p”, and total number of protocols types, feature selection techniques, and number of patients are: K, F, and P. then the mean accuracy of the performance of classification algorithms are evaluated in terms of measures.

Results

This section presents the results using the above two experimental protocol setup as discussed in section 4.1 (select best FST and protocols over and ORT) and section 4.2 (comparison of the classifiers). In the first experiment, best FST and CV protocols are estimated based on the criteria of the highest accuracy. The second experiment is to understand the behavior based the variation of the classification accuracy with respect to the different CV protocols. The results of these two experiments are shown in section 5.1 and section 5.2, respectively.

Experiment 1: Select best feature selection techniques over K-fold CV and ORT

In this study, we adapted six FST as RF (F1), LR (F2), MI (F3), PCA (F4), ANOVA (F5), and FDR (F6) on both O1 and O2 datasets. For O1 and K2-protocol, F5-based feature selection technique gives the highest accuracy (81.94%). Increasing the value of K, ACC is also increased for both O1 and O2. On the contrary, F2 gives the highest ACC 84.66% of the same protocols for O2. In the same way, for K4, F4 and F2 give the highest ACC 82.73% and 86.16% for O1 and O2. For O2, RF gives the ACC (85.86%) for K10 and ACC (88.45%) for JK. There are also same results for O1. The details are given in Table 2. So we say that RF is the best FST for both O1 and O2.

Table 2 Comparison of mean accuracy of different protocols between O1 and O2 over FST

Experiment 2: Comparison of the classifiers

For notational simplicity, we call the ten classifiers as: LDA (C1), QDA (C2), NB (C3), GPC (C4), SVM (C5), ANN (C6), Adaboost (C7), LR (C8), DT (C9), and RF (C10). This experiment is performed to investigate the comparison of performance of all classifiers with changing the K-folds CV protocols over ORT. Tables 3 and 4 show that increasing the value of K, classification accuracy is also increased for both O1 and O2 dataset. From these results, we intercept as (i) for K2 protocols, F1 and C10 classifier combination gives the highest accuracy (89.09% for O1 and 88.98 for O2) against the other classifiers because F1 extracts the most important features, (ii) increasing the value of K (2 to 4), the accuracy of C10 also increase. Tables 3 and 4 also show that F1 and C10 combination also gives the highest accuracy (89.79% for O1 and 89.58% for O2). Similarly it can be showed that for K10 protocols F1-C10 gives the accuracy 90.91% for and 92.26% for O2. JK protocols all feature selection based RF-based classifier combination gives 99.99~100.00% accuracy (both O1 and O2 datasets). So we say that F1 and C10 is the best combination for both O1 and O2 datasets.

Table 3 Comparisons of all classifiers and FST over protocols in terms of accuracy for O1
Table 4 Comparisons of accuracy of all classifiers and FST over protocols for O2

Hypothesis validation and performance evaluation

Hypothesis validation

As discussed in introduction section that the spirit of this study requires that when the missing values are replaced by the group median along with the replacement of the outliers by the median values, while using the random forest in ML framework should give the highest accuracy against the case when the outliers are either not removed or replaced by means. We demonstrate the results in Table 5, where we compared classification accuracy with outliers (O1) and without outliers (O2). We thus demonstrate that the hypothesis has been validated.

Table 5 Comparison of accuracy of classifier’s between O1 and O2 over protocols and FST

Performance evaluation

Reliability

Reliability and stability index of the ML system is required for evaluation of the performance of the ML system. This can be seen in Fig. 4. The reliability index (RI) has been calculated by the ratio of the standard deviation of the classification accuracy and mean of the classification accuracy over data size (N). The system reliability index (ξN) is calculated by the following formula as:

$$ {\upxi}_{\mathrm{N}}\left(\%\right)=\left(1-\frac{\upsigma_{\mathrm{N}}}{{\boldsymbol{\upmu}}_{\mathrm{N}}}\right)\times 100 $$
(21)
Fig. 4
figure 4

Performance evaluations of machine learning system

where, σN is the standard deviation and μN is the mean of all accuracies for FST and ORT’s. The system reliability index of \( \overline{\upxi} \) by taking the mean of all data can be expressed as follows:

$$ \overline{\upxi}\left(\%\right)=\left(\frac{\sum_{\mathrm{n}=1}^{\mathrm{N}}{\upxi}_{\mathrm{n}}}{\mathrm{N}}\right) $$
(22)

Figures 5 and 6 show that the reliability index (RI) for all Fi-Cj (i = 1, 2…, 6 and j = 1, 2… 10) based 60 combinations as data size increases for O1 and O2 datasets. Further, the system reliability index has been computed by averaging the reliability indexes corresponding to all data sizes as shown Table 6 for O1 and Table 7 for O2 which confirms the best performance of F1 and C10 based combination for O1 and O2.

Fig. 5
figure 5

Comparison of all classifiers over different FST’s based on RI for O1

Fig. 6
figure 6

Comparison of all classifiers over different FST’s based on RI for O2

Table 6 Comparison of all classifiers over different FST’s based on RI for O1
Table 7 Comparison of all classifiers over different FST’s based on RI for O2

Stability analysis

Stability analysis defines the dynamics of control system. Here in our analysis data size can control the dynamics of overall system. We observed that at data system is stable within 2% tolerance limit.

Discussion

This paper represents the risk stratification system to accurately classify diabetes disease into two classes namely: diabetic and control while input diabetic data contains outliers and replaced outliers by median. Moreover, sixty systems have been designed by cross combination of ten classifiers (LDA, QDA, NB, GPC, SVM, ANN, Adaboost, LR, DT, and RF) and six feature selection techniques (RF, LR,MI, PCA, ANOVA, and FDR) and their performances have been compared. The number of features has been selected with help of 0.90 cutoffs points for PCA while t-test has been adopted for LR, MI, FDR, respectively, and also F-test for ANOVA. The classification of diabetes disease has been implemented using one-against all approach for ten classifiers, i.e., LDA, QDA, NB, GPC, SVM, ANN, Adaboost, LR, DT, and RF. Furthermore, four sets (K2, K4, K5, and K10) of cross-validation protocols has been applied for generalization of classification and this process has been repeated for T = 10 times to reduce the variability. For all sixty combinations, the experiments have been performed in one scenario as comparisons of outlier’s removal techniques varying different protocols. Performance evaluations of all classifiers are compared on the basis of ACC, SE, SP, PPV, NPV, and AUC in experiments with varying FST and CV protocols. The ML system was validated for stability and reliability.

The main focus of our study the following components: Comprehensive analysis of RF-based classifier against nine sets of classifiers: LDA, QDA, NB, GPC, SVM, ANN, Adaboost, LR, and DT, respectively while in input diabetic data, is replaced outliers by median and extract features. Our study shows that the classification must be improved if we replaced the missing values by group median and outliers by median and extract features by random forest and classification of diabetes disease by random forest. There are two reasons to improve the classification accuracy as (i) median missing values imputation while in existing papers, several authors were not using any missing imputation techniques and someone replaced missing values by mean; (ii) replaced outliers by median while in previous papers, authors did not use any methods to check outliers.

Benchmarking different machine learning systems

There are several papers in literature on the diagnosis and classification of diabetic patients. Karthikeyani et al. [32] applied SVM with radial basis kernel on diabetes dataset. The dataset consisted of 8 attributes and 768 patients having 268 diabetes and 500 controls. They replaced these meaningless values with their mean and applied SVM to classify diabetes disease and demonstrated a classification accuracy of 74.80%. The same authors (Karthikeyani et al. [33]) extracted three features out of eight using partial least square (PLS) and applied LDA method to classify diabetes leading to an accuracy of 74.40%. Kumari and Chitra [34] introduced SVM with radial basis kernel function for classification. After deleting meaningless observations (zero contained observations), there were 460 observations. From those observations, 200 were used as training and rest of observations were used as a testing dataset, while the algorithm achieved a low accuracy of 75.50%. Parashar et al. [35] applied LDA to select the most importance features of diabetic disease and then selected two best features out of eight features. They also applied SVM and FFNN to classify diabetes disease and SVM gave the accuracy of 75.65%. Bozkurt et al. [36] introduced two ML techniques: AIS and ANN. ANN obtained higher accuracy of 76% compared to AIS. Iyer et al. [37] applied NB and DT for classification of diabetic patients. They replaced missing values with the mean and extracted two features out of eight using the correlation based feature selection (CFS) algorithm. They showed that DT obtained accuracy of 74.79%. Kumar Dewangan and Agrawal [38] used MLP and Bayes net classifiers, where MLP gave the highest accuracy of 81.19%. Bashir et al. [10] introduced Hierarchical Multi-level classifiers bagging with Multi-objective optimized Voting (HM-Bag Moov) technique to classify diabetes and compared to various classification techniques such as NB, SVM, LR, QDA, KNN, RF and ANN. They showed that HM-Bag Moov obtained an accuracy of 77.21%. Sivanesan et al. [39] proposed J48 algorithm to classify diabetic patients and obtained an accuracy of 76.58%. Meraj Nabi et al. [40] applied four different classifiers such as NB, LR, J48, RF, and obtained the best accuracy of 80.43% using LR. Recently, Suri’s team (Maniruzzaman et al. [9]) also applied four different classifiers such as LDA, QDA, NB, and GPC. They showed that GPC-based radial basis kernel gave the highest classification accuracy (~82%) with respect to others. From the above discussion, Table 8 and Fig. 7 confirm that our proposed F1 and C10 method is to identify the better diagnosis with an accuracy of 92.26% for K10 and nearly 100% for JK protocols compared to others. So our proposed system can be used to cross check of diagnosis of diabetes with the doctor’s assessment.

Table 8 Comparative performance of our proposed method against previous studies
Fig. 7
figure 7

Comparison of our proposed method against the existing methods in literature. RED arrows shows the proposed work

Random forest showed encouraging results and identified the most significant features and classify of diabetes disease. It works well on both nonlinear and high dimensional data. In previous study, ML-based DM research has focused on only classification and prediction of diabetic patients. Here, RF capabilities to detect the relevant pattern in the data produced very meaningful results that correlate well with the criteria for diabetes diagnosis and with known risk factors. When we replaced the missing values and outliers are replaced by group mean and mean, then the RF yields 89% classification accuracy. This RF classification accuracy increased by 3% when missing values and outliers are replaced by group median and median, respectively.

Strengths, weakness and extensions

This paper represents the risk stratification system to accurately classify of diabetes disease while there are 768 pregnant patients having two class diabetes and controls. Our study shows that RF-based feature selection technique along with RF-based classifiers with median based outlier’s removal techniques gives a classification accuracy of 92.26% for K10 protocols and nearly 99.99~100% for JK protocols (see Fig. 7). Nevertheless, the presented system can still be improved. Further, preprocessing techniques may be used to replace meaningless values by mean or median and outliers by mean or median. There are many other techniques of feature extraction, feature selection, and classification, and performances of presented combinations of system may be compared the other systems.

Conclusion

Diabetes Mellitus (DM) is a group metabolic diseases in which blood sugar levels are too high. Our hypothesis was that if missing values and outliers are removed by group median and median values, respectively and such a data when used in ML framework using RF-RF combination for feature selection and classification should yield higher accuracy. We demonstrated our hypothesis by showing a 3% improvement and reaching an accuracy of nearly 100% in JK-based cross-validation protocol. Comprehensive data analysis was conducted consisting of ten classifiers, six feature selection methods and five sets of protocols, two outlier’s removal techniques leading to six hundred (600) experiments. Through benchmarking was analyzed and clear improvement was demonstrated. It would be interesting in future to see classification of other kinds of medical data to be adapted in such a framework creating a cost-effective and time-saving option for both diabetic patients and doctors.