1 Introduction

In many applications of data analysis, it is assumed that the output is increasing or decreasing with one or more input features. Such relations between input and output are said to be isotonic/monotonic. Such isotonic conditions are known in abundant classification problems including internet content filtering [1], firm bankruptcy prediction [2], breast cancer diagnosis [3], and cancer prognosis [4]. In cancer diagnosis, for instance, isotonic property can be formed by stating that a tumor with bigger dimension values (cell size, clump thickness, etc.) is most likely to be malignant than those with smaller values. Since isotonic constraints are common in applications, many techniques have been used to handle isotonic constraints such as isotonic separation [5], monotonic k-NN [6], monotonic random forest, etc., [7]. Isotonic separation [5] is a classification technique that separates d-dimensional data in a domain where isotonic property exists. Isotonic separation is a good choice for many medical applications such as cancer diagnosis, diabetes, heart attack, etc. Given a training set with n data points with d features/dimensions and the known isotonic property, the main objective of isotonic separation is to obtain an isotonic function through which the system accurately classifies the data. This can be achieved by converting the given data set into an isotonic data set using a linear programming problem (LPP) or maximum flow network problem and finding the solution from the LPP.

The major concerns in isotonic separation are computational time complexity, problem size reduction, and feature selection. Problem size reduction aims to reduce the number of decision variables or constraints from the LPP. Feature selection aims to select the relevant isotonic consistent features from the data set.

Computational complexity of isotonic separation depends to a great extent on the number of training instances n, the number of constraints m, the dimensionality d, and the time complexity to solve the LPP or max flow network model. Various algorithms have been developed for solving maximum flow network problems [8, 9]. The Ford–Fulkerson method is a polynomial time algorithm, but there is no guarantee to obtain the maximum flow in polynomial time. The Edmond–Karp algorithm solves the maximum flow network problem with a running time of \( O(m^{2} n) \). It is computationally infeasible for large values of n in isotonic separation.

Many techniques, such as the simplex method and the interior point method can be applied to solve LPPs. The simplex method is an efficient polynomial time algorithm but computationally expensive for solving large-scale LPPs because of its exponential time at the worst case [10]. The interior point method is computationally efficient for large-scale LP problems but infeasible in storage requirements and arithmetic operations [11]. The major drawback of traditional LP solvers [12] is that they increase the number of decision variables due to the conversion of inequality constraints to equality constraints. Conventional methods are computationally expensive and not suitable for finding the optimum solution for the large-scale LPP when data set grows. Hence, it is essential to develop better techniques to either reduce the number of decision variables/constraints in the LPP or handle such complexity.

Feature selection plays a crucial role in improving the accuracy and reducing the training time of any classifier. For isotonic separation, it is important that the subset of selected features should make the data set isotonic and improves the performance of the classifier.

This paper proposes an efficient hybrid algorithm which deploys particle swarm optimization (PSO) (Kennedy and Eberhart [13] in the training phase of isotonic separation to address the issue of increasing time complexity. It also proposes a correlation-based feature selection to select isotonic consistent features and thereby improves performance of the proposed classifier.

The proposed hybrid training algorithm formulates the LPP as a graph and arranges decision variables of the LPP using topological ordering. It proposes an efficient customized PSO algorithm in which a new threshold value is computed from data set and used in convergence criterion. The key objectives of the proposed hybrid algorithm, MeHeIS–CPSO (meta-heuristic isotonic separation with a new convergence criterion-based PSO) are twofold.

  1. 1.

    To obtain an optimum or near-optimum solution for large-scale LPP faster than existing approaches with the help of PSO.

  2. 2.

    To improve the performance of classifier by selecting isotonic features with a new correlation-based feature selection technique.

Experiments are conducted on small, medium, and large data sets, and the results are compared with other algorithms. The comparative study reveals that MeHeIS–CPSO converges faster in finding the solution for the LPP and improves the performance of classifier.

The paper is structured as follows. Section 2 highlights the mathematical description and related literature for isotonic separation and PSO and defines the problem statement. Section 3 presents the mathematical model of the proposed MeHeIS–CPSO and demonstrates with a case study. Section 4 gives the properties and correctness of the proposed hybrid algorithm. Section 5 reports the results of MeHeIS–CPSO experiments on small, medium and large data sets. Section 6 concludes with the summary and future scope.

2 Background

This section introduces the terms related to isotonic separation, presents a mathematical model of isotonic separation and analyzes the time complexity. This section deals with the most relevant techniques proposed in the specialized literature for isotonic separation. It also briefs the overview of traditional algorithms for isotonic data and particle swarm optimization.

2.1 Preliminaries

Let \( D = \{ ({\mathbf{x}}_{1} ,y_{1} ),({\mathbf{x}}_{2} ,y_{2} ), \ldots ,({\mathbf{x}}_{n} ,y_{n} )\} \) be a data set which consists of n data points and i be an index of a data point with d-dimensional input vector \( x_{i} = (x_{i1} ,x_{i2} , \ldots ,x_{id} ) \in \Re^{d} \), and an output label \( y_{i} \). Let A denote a set of data points in D, and these data points fall into one of two disjoint classes \( A_{0} \) and \( A_{1} \) based on their output labels.

$$ \begin{aligned} A & = \{ i|(x_{i} ,y_{i} ) \in D,x_{i} \in \Re^{d} ,y_{i} \in \{ 0,1\} \} \\ A_{1} & = \{ i|i \in A\,\,{\text{and}}\,\,y_{i} = 1\} \\ A_{0} & = \{ i|i \in A\,\,{\text{and}}\,\,y_{i} = 0\} \\ \end{aligned} $$
(1)

\( {\mathbf{x}}_{i} \) is said to be dominate \( {\mathbf{x}}_{j} \), if on every dimension, \( {\mathbf{x}}_{i} \) is greater than \( {\mathbf{x}}_{j} \).

$$ {\mathbf{x}}_{i} \succ {\mathbf{x}}_{j} \Leftrightarrow \forall_{p = 1,2, \ldots ,d} x_{ip} \ge x_{jp} $$
(2)

A dominance relation \( \succ \) is a binary relation on A, which contains a set of ordered pairs satisfying dominance property.

$$ R = \{ (i,j)|i,j \in A\,\,{\text{and}}\,\,{\mathbf{x}}_{i} \succ {\mathbf{x}}_{j} \} $$
(3)

The dominance relation \( \succ \) is a quasi-ordering relation because of its reflexive and transitive property. A function \( f:\Re \to \{ 0,1\} \) maps any data point in the d-dimensional space into an output label, is called monotonically increasing or isotonic if the following condition holds for any \( {\mathbf{x}}_{i} ,{\mathbf{x}}_{j} \in D \):

$$ {\mathbf{x}}_{i} \succ {\mathbf{x}}_{j} \to f({\mathbf{x}}_{i} ) \ge f({\mathbf{x}}_{j} ) $$
(4)

A data set \( D \) is said to be isotonic, if it satisfies the following isotonic property:

$$ {\mathbf{x}}_{i} \succ {\mathbf{x}}_{j} \to y_{i} \ge y_{j} \quad {\text{for}}\,\,\forall i,\,\,j \in A\,\,{\text{and}}\,\,{\mathbf{x}}_{i} ,{\mathbf{x}}_{j} \in D $$
(5)

A data point \( {\mathbf{x}}_{i} \) is inconsistent, if there exists another data point \( {\mathbf{x}}_{j} \), such that \( {\mathbf{x}}_{i} \) and \( {\mathbf{x}}_{j} \) violate isotonic constraints. A data set is said to be non-isotonic if inconsistent data points exist. An isotonic region is an area of d-dimensional space where isotonic property exists. A non-isotonic region is an area of d-dimensional space where isotonic property does not exist. If a data set is isotonic, broad isotonic region will be obtained as a model of isotonic separation. Isotonic data sets are prerequisite for isotonic separation. Since non-isotonic data sets reduce the boundary of isotonic regions, classification algorithms which handle isotonic data may have a possibility of misclassifying the data.

2.2 Isotonic separation

Given a data set A, and the assumption of isotonic property, the main objective of isotonic separation is to assign labels for A with minimum misclassifications to transform the data set isotonic and obtain an isotonic function \( \hat{f}:A \to \{ 0,1\} \) which will correctly classify unknown test data. So, a predictor variable, \( \hat{y}_{i} \) is defined for each data point i, as given below.

$$ \hat{y}_{i} = \left\{ {\begin{array}{*{20}l} 1 \hfill &\quad {{\text{if}}\,\,{\text{a}}\,\,{\text{classifier}}\,\,{\text{predicts}}\,\,i\,\,{\text{as}}\,\,1} \hfill \\ 0 \hfill &\quad {{\text{if}}\,\,{\text{a}}\,\,{\text{classifier}}\,\,{\text{predicts}}\,\,i\,\,{\text{as}}\,\,0} \hfill \\ \end{array} } \right. $$
(6)

A penalty \( \omega_{i} \ge 0 \) is assigned for each data point using the following rule:

$$ \omega_{i} = \left\{ {\begin{array}{*{20}l} \alpha \hfill &\quad {{\text{if}}\,\,y_{i} = 1\,\,{\text{and}}\,\,\hat{y}_{i} = 0} \hfill \\ \beta \hfill &\quad {{\text{if}}\,\,y_{i} = 0\,\,{\text{and}}\,\,\hat{y}_{i} = 1} \hfill \\ 0 \hfill &\quad {{\text{if}}\,\,y_{i} = \hat{y}_{i} } \hfill \\ \end{array} } \right. \, $$
(7)

A solution to the following linear programming problem (LPP) will provide the new labels for the data set and transform it into isotonic data set with minimum misclassification penalty.

$$ \begin{aligned} & \hbox{min} \, \alpha \sum\limits_{{i \in A_{1} }} {(1 - \hat{y}_{i} ) + \beta } \sum\limits_{{i \in A_{0} }} {\hat{y}_{i} } \\ & {\text{subject}}\,\,{\text{to}}\,\,{\text{the}}\,\,{\text{constraints}} \\ & \hat{y}_{i} - \hat{y}_{j} \ge 0\quad {\text{for}}\,\,(i,j) \in R\,\, ( {\text{isotonic}}\,\,{\text{constraint)}} \\ & 0 \le \hat{y}_{i} \le 1 \, \quad {\text{for}}\,\,i \in A\,\, ( {\text{boundary}}\,\,{\text{constraint)}} \\ \end{aligned} $$
(8)

In the objective function of the LPP in Eq. (8), \( \sum\nolimits_{{i \in A_{1} }} {(1 - \hat{y}_{i} )} \) and \( \sum\nolimits_{{i \in A_{0} }} {\hat{y}_{i} } \) denote the number of misclassifications in each class. If \( \hat{y}_{i} \) is a binary variable, and then Eq. (8) becomes an integer programming problem. Isotonic constraints are obtained from the ordered pairs of R. Since finding a solution to an integer programming problem is computationally expensive and the constraint matrix in Eq. (8) is the dual representation of a maximum flow network model constraint matrix, \( \hat{y}_{i} \) can be relaxed to a real variable in [0,1].

Let \( Y^{*} = \{ \hat{y}_{i} |i \in A_{0} \cup A_{1} \} \) be an optimum solution to the LPP and isotonic in nature. Let \( A_{1}^{*} \) and \( A_{0}^{*} \) be sets of boundary corner points for A1 and A0, respectively. The boundary points obtained from Eq. (9) are called as isotonic separator, a model for the classifier.

$$ \begin{aligned} A_{1}^{*} & = \{ i|\hat{y}_{i} = 1\,\,{\text{and}}\,\,\not\exists \hat{y}_{j} \in Y^{*} ,\,\,i \ne j, \, \hat{y}_{j} = 1\,\,{\text{and}}\,\,(i,j) \, \in R\} \\ A_{0}^{*} & = \{ i|\hat{y}_{i} = 0\,\,{\text{and}}\,\,\not\exists \hat{y}_{j} \in Y^{*} ,\,\,i \ne j, \, \hat{y}_{j} = 0\,\,{\text{and}}\,\,(j,i) \, \in R\} \\ B & = A_{1}^{*} \cup A_{0}^{*} \\ \end{aligned} $$
(9)

where \( A_{1}^{*} \ne \emptyset \,\,{\text{and}}\,\,A_{0}^{*} \ne \emptyset \).

The optimum solution separates the d-dimensional space into three regions as given in Eq. (10). A region, Z0 where all points are labeled as 0 and a region Z1 where all points are labeled as 1, are the isotonic region for class 0 and class 1, respectively. The non-isotonic region Z2 is an unclassified area where no data point exists.

$$ \begin{aligned} Z_{0} & = \{ xi \in \Re^{d} | \, \exists j \in A_{0}^{*} \,\,{\text{such}}\,\,{\text{that}}\,\,(j,i) \in R\} \\ Z_{1} & = \{ xi \in \Re^{d} | \, \exists j \in A_{1}^{*} \,\,{\text{such}}\,\,{\text{that}}\,\,(i,j) \in R\} \\ Z_{2} & = \Re^{d} \backslash (Z_{0} \cup Z_{1} ) \\ \end{aligned} $$
(10)

For a test data h,\( x_{h} = (x_{h1} ,x_{h2} , \ldots ,x_{hd} ) \) that does not belong to A, classification is done as follows: If h lies in \( Z_{1} \), then it belongs to the class \( A_{1} \). If h lies in \( Z_{0} \), then it belongs to the class \( A_{0} \). If \( h \) lies in \( Z_{2} \), its label cannot be determined using an isotonic consistency condition. So, h is predicted based on the weighted distance between h and boundary points as given below.

$$ \begin{aligned} D_{h1} & = \beta \hbox{min} \left\{ {\sum\limits_{k = 1}^{d} {\hbox{max} (x_{ik} - x_{hk} ,0)\;i \in A_{{_{1} }}^{*} } } \right\} \\ D_{h0} & = \alpha \hbox{min} \left\{ {\sum\limits_{k = 1}^{d} {\hbox{max} (x_{hk} - x_{ik} ,0)\;i \in A_{0}^{*} } } \right\} \\ \hat{y}_{h} & = \left\{ {\begin{array}{*{20}l} 1 \hfill & {{\text{if}}\,\,\,D_{h1} < D_{h0} } \hfill \\ 0 \hfill & {\text{otherwise}} \hfill \\ \end{array} } \right. \\ \end{aligned} $$
(11)

2.2.1 Computational time complexity

Given a training set with \( n \) instances in a d-dimensional data space, the time complexity to check the isotonic consistency constraints is \( O(n^{2} d) \). The computing time to formulate the relation using graph (adjacency matrix) is \( \theta (n^{2} ) \). The time complexity to find out and eliminate transitive pairs is \( \theta (n^{3} ) \). The LPP has n decision variables and m constraints where \( n \le m \le \frac{n(n - 1)}{2} \). The simplex method takes up to \( O((m + n)m^{2} ) \) to solve the LPP. The time complexity to extract boundary points from the optimum solution is \( O(n^{2} ) \). So the time complexity of isotonic separation is \( O(\hbox{max} ((m + n)m^{2} ,n^{3} )) \cong O((m + n)m^{2} ) \).

2.2.2 Challenges in isotonic separation

The major concern arises in isotonic separation is the computational complexity of the training phase. The time complexity of isotonic separation depends on the time complexity in solving the LPP which makes highly complex to use isotonic separation on large data sets. Besides this high computational time complexity, Chandrasekaran et al. identified two major concerns in isotonic separation: feature selection and problem size reduction.

Feature selection aims to select relevant isotonic features to construct a model. There is plethora of metrics available in the literature for feature selection [14]. Information gain tells the amount of information about dependent variable in independent variable. Chi-square statistic is used to find the dependency between dependent and independent variable. Ryu and Yue [2] addressed this issue by using different measures, such as information gain, mutual information, correlation etc., for firm bankruptcy prediction. Ryu et al. [3] adopted a backward sequential elimination method to select relevant features in the data set. However, existing standard feature selection metrics describe the relevance or importance of features for prediction or classification. These metrics do not depict the existence of isotonic property. There is a plethora of metrics available in the literature for feature selection. However, these metrics cannot be used to extract features with isotonic property. Therefore, we use correlation coefficient to extract isotonic features and improve the performance of isotonic separation.

Problem size reduction focuses on reducing the number of constraints or decision variables in the LPP. For a training set with n data points and d features/dimensions, isotonic separation develops a maximum flow network model with n + 2 nodes and m + 2n arcs or a linear programming model with n decision variables and m constraints where \( n \le m \le \frac{n(n - 1)}{2} \). As n increases, the size of the LPP or maximum flow network becomes polynomial. Chandrasekaran et al. [5] addresses this issue by removing reflexive and transitive constraints from the relation R. Ryu and Yue [2] proposed a variant of isotonic separation to reduce the number of constraints/decision variables in the LPP. It makes use of preprocessing to eliminate duplicate instances from the data set and reformulating the LPP. Ryu et al. proposed an improved version of isotonic separation to reduce the number of constraints and decision variables in the LPP. This scheme finds a maximal subset of data points for each class as a preprocessing step [3]. This scheme reduces the decision variables and constraints of the LPP substantially. This method was trained and tested with 699 points which resulted in higher accuracy.

To filter unwanted internet content, isotonic separation was deployed using the platform for internet content selection (PICS) rating scheme [1]. A new approach was proposed to refine the model at runtime and improve the accuracy of the classifier. In other words, this scheme enables the test data to be a part of boundary point. This approach compares the test data with the training set and the boundary points are updated during testing. The limitation of this approach is that it is computationally expensive to update the model during testing. It also requires the availability of training set during testing and consumes more memory and resources. For problems with unknown isotonic properties, these must be tested and identified from the training data [15].

Recently, meta-heuristics were proposed to overcome the potential weakness of isotonic separation, the high computational complexity of solving a large-scale LPP with growing data sets [16]. Evolutionary isotonic separation (EIS) is one of the techniques which deploys a meta-heuristic to address the issue of increasing complexity in solving large-scale LPPs in isotonic separation [17]. This approach deploys a genetic algorithm [18, 19] to find the optimum or near-optimum solution for the LPP. It uses a population with a set of random chromosomes and converts them into feasible chromosomes using a newly introduced slack vector. This approach iteratively works until the convergence criterion (number of misclassifications is closer to user-specified threshold) is met. The threshold value used in this convergence criterion is not exact. Two major concerns arise when considering the practical use of this convergence criterion: model complexity and time complexity. However, EIS yields near-optimum solution which has more misclassifications, and this solution affects the quality of model and classification performance. Even though EIS gives better results, evolutionary framework takes more time to find near-optimum solution for large data sets.

The factors that increase computing time of the evolutionary framework are as follows:

  1. i.

    Many chromosomes are needed in the evolutionary framework to search the optimum solution in the search space.

  2. ii.

    Evolutionary framework chooses decision variables in any order in the conversion of random chromosome into feasible chromosome. It increases the computing time of conversion for the population.

  3. iii.

    Evolutionary framework fails to obtain the exact threshold value in the optimum convergence criterion to terminate the iterative procedure of GA.

The proposed work addresses the above issues by proposing a hybrid algorithm in the training part of isotonic separation. The proposed work resolves the first bottleneck by deploying a PSO-based meta-heuristic to find the optimum solution for the LPP. In general, PSO requires a minimum number of particles to obtain the optimum solution. It overcomes the second issue by modeling LPP as a directed acyclic graph (DAG) and employs topological ordering on DAG. Topological ordering arranges decision variables based on in-degree and reduces number of iterations in the conversion of random chromosome into feasible chromosome. It addresses the third issue by computing a threshold value from training set and use it in the convergence criterion to obtain the near-optimum solution faster.

2.2.3 Machine learning approaches for isotonic classification

There are some proposals for existing machine learning algorithms that handle isotonic data. K-nearest neighbor (k-NN) is one such algorithm to learn isotonic functions. In monotonic k-NN [20], the non-isotonic training data is converted into isotonic data by relabeling the data points with minimum misclassifications using a monotonicity violation graph and a relabeling algorithm. Then, test data are predicted using k-NN. Monotonic iterative prototype selection algorithm (MONIPS) is used to improve the performance of monotonic k-NN [6]. Monotonic nested generalized exemplar learning (MoNGEL) is introduced to classify the data by using a hybrid of instance-based and rule-based learning [21]. ID3 is modified to learn isotonic data using an additional impurity measure, called total ambiguity score. Ensemble learning techniques such as bagging decision trees [22], Random forests [7] are also used for learning monotonic data. Daniels and Velikova [23] proposed neural networks for learning isotonic data. Isotonic models can also be used as a kernel operation in clustering.

2.3 Particle Swarm Optimization

Particle swarm optimization (PSO) [13, 24], an algorithm to solve continuous optimization problems, is inspired by the social behavior of birds in a flock. Each particle has its own position, velocity and best position \( {\mathbf{pbest}}_{i}^{t} \) corresponding to the personal best objective value obtained so far at time t. The global best particle is denoted by \( {\mathbf{gbest}}^{t} \), which represents the best particle found so far at time t in the entire swarm. Each particle updates its position and velocity based on the global best and personal best particles and inertia weight. Inertia weight determines the acceleration or deceleration in the current direction [25]. Normally, inertia weight remains same in all generations. A larger inertia weight may explore new areas in search space at the beginning, and sometimes, it may provide premature solution because there are chances that the solution may go beyond the global best. In order to improve the performance of PSO, varying inertia weight can be used in which the inertia weight is kept large (wmax) at the beginning to explore new areas in the search space and decreases linearly until the inertia weight reaches the minimum lower bound (wmin). When the inertia weight becomes small, the algorithm fine-tunes the solution. After the velocity has been updated, but before the position update, velocity is clamped to the range [− vmax, vmax] to control the excessive roaming of particles outside the search space [26]. Many variants of PSO such as the BPSO [27], the modified BPSO [28], and the probability BPSO [29] have been available in the literature. Typical applications of PSO are combinatorial optimization problems, scheduling, clustering large databases, dynamic clustering, and classification [30, 24]. These works motivate us to use PSO for isotonic separation.

2.4 Problem statement

The LPP in Eq. (8) can be represented as a model \( M = (U,\varOmega ,g) \) where U is a search space defined over a finite set of decision variables \( \hat{y}_{i} ,i = 1,2, \ldots ,n \), \( \varOmega \) is a set of constraints among the variables \( \hat{y}_{i} - \hat{y}_{j} \ge 0\,\,{\text{for}}\,\,(i,j) \in R \) and \( g \) is an objective function \( g(Y) = \alpha \sum\nolimits_{{i \in A_{1} }} {(1 - \hat{y}_{i} ) + \beta } \sum\nolimits_{{i \in A_{0} }} {\hat{y}_{i} } \). A feasible solution \( Y \in U \) is an assignment of values to variables that satisfies all constraints in \( \varOmega \). A solution \( Y^{*} \in U \) is called as optimum iff \( g(Y^{*} ) \le g(Y)\forall Y \in U \). This can be represented as follows:

$$ Y^{*} = \mathop {\arg \hbox{min} }\limits_{Y} (g(Y)) $$
(12)

Given the LPP with n decision variables, m constraints, and an objective function, the main objective of the proposed work is to find a solution \( Y^{*} \) faster than EIS with the help of a newly computed threshold value in convergence criterion and the ordering of decision variables using topological sort. It also aims to improve the predictive ability of the proposed work using a new correlation-based feature selection technique.

3 Meta-heuristic Isotonic Separation with a Convergence Criterion-based PSO (MeHeIS–CPSO)

MeHeIS–CPSO is a hybrid classification technique in which a correlation-based feature selection algorithm, as a preprocessing step, is used to select isotonic features from the data set. It also deploys a meta-heuristic PSO, in the training part of isotonic separation, to acquire the optimum or near-optimum solution for the large-scale LPP. The architecture of the proposed MeHeIS–CPSO is depicted in Fig. 1. The key steps are as follows:

Fig. 1
figure 1

Block diagram of training part of MeHeIS–CPSO

  1. i.

    The best subset of isotonic features is selected from the data set.

  2. ii.

    Given training set is formulated as LPP.

  3. iii.

    The LPP is modeled as a directed acyclic graph and decision variables are ordered using topological sort.

  4. iv.

    Obtain the optimum or near-optimum solution for the LPP using PSO.

  5. v.

    Constructs isotonic separator from the solution

The proposed MeHeIS–CPSO creates a swarm of particles in which each particle represents a solution to the LPP. In general, each particle represents a potential feasible solution to the problem. For a large-scale LPP, it is difficult to create a particle with feasible solution. Therefore, the proposed algorithm creates random particles and performs some kind of preprocessing on swarm to transform these random infeasible particles into feasible. Fitness function is evaluated and fitness value of swarm is checked for convergence. The proposed work introduces a new convergence criterion which is determined based on training set and isotonic property. Convergence criterion formulated in this approach is an exact value and it guarantees reduction of convergence time and improvement in the quality of solution. This criterion measures the number of instances that violate the isotonic constraints. In other words, convergence criterion gives the maximum misclassifications that are allowed in a particle. The velocity of particle is calculated using personal and global experience. Then, position of particle is updated. The framework enables the swarm to satisfy the boundary constraints after updating the swarm. However, particles in the swarm may violate isotonic constraints. These particles are again converted into feasible particles to keep the swarm in feasible bounds. This iterative procedure is terminated when the global best particle reaches the convergence criterion. Upon termination, the global best particle contains the solution for the LPP, and passes on to isotonic separation to construct the model. Then, the d-dimensional space is separated into isotonic and non-isotonic regions and boundary points are obtained. In this way, the MeHeIS–CPSO not only achieve the solution for a large-scale LPP, but also obtain the good capability of generalization.

3.1 Feature selection

To improve performance of the proposed hybrid classifier, feature selection is applied on data sets to select relevant features based on correlation coefficient between features of the same class. The proposed feature selection algorithm is a supervised one, since it considers class label for selecting isotonic features. Generally, correlation coefficient between two variables lies in [− 1, + 1]. Correlation is calculated between any two continuous features for the whole data set. Here, the main aim of feature selection algorithm is to choose the best subset of isotonic features. Two features a1, a2 are said to be isotonic, if there exists a positive correlation between a1 and a2 in the class separated data. The following procedure is applied on data sets to select the best subset of features.

  1. 1.

    Data set is separated into two partitions based on class label.

  2. 2.

    For each feature fi in each partition

    1. a.

      Compute correlation coefficient between every other feature fj in the same partition.

    2. b.

      Repeatedly choose the pair of features (fi, fj) which has positively correlated.

3.2 Generating graph model of LPP

Given a training set with isotonic features obtained from the above step, the LPP (8) is formed. The LPP is represented as a directed acyclic graph (DAG) in which nodes are decision variables and edges are constraints of the LPP. Then, nodes are ordered using topological sort [31]. Intuitionally, topological sort provides a linear ordering of decision variables \( \hat{y}_{1} \prec \hat{y}_{2} \prec \cdots \prec \hat{y}_{n} \, \), such that for every ordered pair (i, j) in R, \( \hat{y}_{i} \) comes before \( \hat{y}_{j} \) in ordering.

Consider a data set with 8 data points and two features x1 and x2 as shown in Table 1. Points 1, 2, 5, and 8 belong to class 0 and the rest belong to class 1. As a result of isotonic property, a relation R is created and reflexive and transitive pairs are then eliminated. The remaining ordered pairs of R as in Eq. (13) become constraints of the LPP and can be represented as a DAG as shown in Fig. 2. The topological order of decision variables in the LPP is \( \{ \hat{y}_{8} ,\hat{y}_{6} ,\hat{y}_{4} ,\hat{y}_{7} ,\hat{y}_{5} ,\hat{y}_{1} ,\hat{y}_{3} ,\hat{y}_{2} \} \). With α = β > 0, the LPP is formulated as in Eq. (14).

$$ R = \left\{ \begin{aligned} (3,2)(4,1)(4,3)(5,2)(6,1)(6,3) \hfill \\ (6,5)(7,2)(8,3)(8,5)(8,7) \hfill \\ \end{aligned} \right\} $$
(13)
$$ \begin{aligned} & \hbox{min} \,\,\hat{y}_{1} + \hat{y}_{2} - \hat{y}_{3} - \hat{y}_{4} + \hat{y}_{5} - \hat{y}_{6} - \hat{y}_{7} + \hat{y}_{8} \\ & {\text{subject}}\,\,{\text{to}}\,\,{\text{the}}\,\,{\text{constraints}} \\ & \hat{y}_{3} - \hat{y}_{2} \ge 0\quad \hat{y}_{4} - \hat{y}_{1} \ge 0\quad \hat{y}_{4} - \hat{y}_{3} \ge 0 \\ & \hat{y}_{5} - \hat{y}_{2} \ge 0\quad \hat{y}_{6} - \hat{y}_{1} \ge 0\quad \hat{y}_{6} - \hat{y}_{3} \ge 0 \\ & \hat{y}_{6} - \hat{y}_{5} \ge 0\quad \hat{y}_{7} - \hat{y}_{2} \ge 0\quad \hat{y}_{8} - \hat{y}_{3} \ge 0 \\ & \hat{y}_{8} - \hat{y}_{5} \ge 0\quad \hat{y}_{8} - \hat{y}_{7} \ge 0 \\ & {\text{where}}\,\,0 \le \hat{y}_{i} \le 1,\;1 \le i \le 8 \\ \end{aligned} $$
(14)
Table 1 Data points obtained based on the figure in [5]
Fig. 2
figure 2

DAG representation of relation R

3.3 Swarm creation

A swarm of N continuous random particles \( {\mathbf{p}}_{1} ,{\mathbf{p}}_{2} , \ldots ,{\mathbf{p}}_{N} \) are created in which each particle \( {\mathbf{p}}_{i} = (p_{i1} ,p_{i2} ,p_{i3} , \ldots ,p_{in} ) \) is an n-dimensional vector in [0,1]. In the context of the problem, a particle denotes a solution to the LPP (8). Here, \( p_{ik} \) denotes a value for a decision variable \( \hat{y}_{k} \) in the LPP and satisfies the boundary constraint of (8). To solve the LPP in Eq. (14), three initial random particles are created with eight values, as shown in Table 2.

Table 2 Swarm of particles

3.4 Parameter initialization

The PSO parameters such as velocity, maximum and minimum inertia weights are initialized. In addition, MeHeIS–CPSO proposes a convergence criterion which deploys the maximum number of misclassifications allowed in optimum solution. This can be measured from the relation R. For each ordered pair \( (i,j) \in R \), \( \hat{y}_{i} - \hat{y}_{j} \ge 0 \, \) be the constraint of the LPP, (yi, yj) are the actual labels and (\( \hat{y}_{i} \), \( \hat{y}_{j} \)) are the solution to be obtained from the LPP. If \( \hat{y}_{i} \) is 1, then the constraint is satisfied irrespective of the value of \( \hat{y}_{j} \) (1 or 0). If \( \hat{y}_{i} \) is 0 then \( \hat{y}_{j} \) should be 0. The isotonic constraint is violated, if any \( \hat{y}_{j} \) is 1. In this scenario, \( \hat{y}_{j} \) should be set as 1 to make the constraint feasible and it leads to an error in training. Let S be a set of data points that violate isotonic constraints and cardinality of S provides the maximum misclassifications that are allowed in meta-heuristic framework. Based on this, a particle is said to be the best particle if the number of misclassifications in the particle is less than or equal to the cardinality of the set S.

$$ S = \{ i|i \in A_{0} , \, j \in A_{1} \,\,{\text{and}}\,\,(i,j) \in R\} $$
(15)

For the above case study, the ordered pairs that violate the isotonic property are identified based on Eqs. (13) and (15). The pairs (8, 3) and (8, 7) violate isotonic constraints, so S ={8} and the cardinality of S is 1. This forms the convergence criteria as the fitness value of the best particle is less than or equal to 1.

3.5 Feasible particle conversion

MeHeIS–CPSO takes advantage of uniform structure in constraints of the LPP in Eq. (8) and transforms random particle into feasible particle using a slack vector [17]. In MeHeIS–CPSO, a slack vector \( {{\Delta p}}_{i} = (\Delta p_{i1} ,\Delta p_{i2} , \ldots ,\Delta p_{in} ) \) for a particle \( {\mathbf{p}}_{i} \), holds the minimum value to make the particle \( {\mathbf{p}}_{i} \) feasible. The conversion process is performed on decision variables in the list obtained from topological sort. Let \( \hat{y}_{j} \) be the next decision variable in the topological sort and the framework fetches all the constraints associated with \( \hat{y}_{j} \), \( \hat{y}_{j} - \hat{y}_{k} \ge 0 \, \forall k\,\,{\text{where}}\,\,k \ne j,1 \le k \le n \), checks the particle against the constraints and updates the slack value \( \Delta p_{ij} \). Let \( d_{jk} \) be the minimum slack value to make \( p_{ij} \) feasible with respect to the constraint \( \hat{y}_{j} - \hat{y}_{k} \ge 0 \, \). Let \( \Delta p_{ij} \) be the minimum value to make \( p_{ij} \) feasible with respect to the set of constraints that include \( \hat{y}_{j} \). Normally \( \Delta p_{ij} \) contains the maximum of \( d_{jk} \).

$$ \begin{aligned} d_{jk} & = \left\{ {\begin{array}{*{20}l} 0 \hfill &\quad {{\text{if}}\,\,p_{ij} - p_{ik} \ge 0} \hfill \\ {p_{ik} - p_{ij} } \hfill &\quad {\text{otherwise}} \hfill \\ \end{array} } \right. \\ \Delta p_{ij} & = \mathop {\hbox{max} }\limits_{1 \le k \le n, \, k \ne j} \{ d_{jk} \} \\ p_{ij} & = p_{ij} + \Delta p_{ij} \\ \end{aligned} $$
(16)

Feasible particle \( {\mathbf{p}}_{i}^{f} \) is obtained as in Eq. (17) by adding slack vector \( \Delta {\mathbf{p}}_{i} \) with random particle \( {\mathbf{p}}_{i} \). These feasible particles are continuous in [0,1].

$$ {\mathbf{p}}_{i}^{f} = {\mathbf{p}}_{i} + \Delta {\mathbf{p}}_{i} $$
(17)

The decision variables and constraints are checked in the order obtained from topological sort. From Table 2, the particle P1 = {0.8 0.3 0.5 0.5 0.4 0.7 0.9 0.2} is considered for this illustration. The procedure starts from the decision variable \( \hat{y}_{8} \) and its associated constraints, i.e., \( \hat{y}_{8} - \hat{y}_{3} \ge 0 \, \hat{y}_{8} - \hat{y}_{5} \ge 0\hat{y}_{8} - \hat{y}_{7} \ge 0 \, \), and calculates the slack values d83 = 0.3 d85 = 0.2 and d87 = 0.7. Since more than one slack values exist for the variable, maximum is chosen as the slack value which will be added to \( \hat{y}_{8} \). The slack values with respect to the next decision variable \( \hat{y}_{6} \) in the ordering are {0, 0, 0.1}. Then, slack values associated with \( \hat{y}_{4} \) are {0.3, 0}. Similarly, slack values for \( \hat{y}_{3} ,\hat{y}_{5} ,\hat{y}_{7} \) are zero. So, the slack vector is \( \Delta {\mathbf{p}}_{1} = \{ 0,0,0,0.3,0,0.1,0,0.7\} {\text{ and }} \)\( {\mathbf{p}}_{1}^{f} = \{ 0.8,0.3,0.5,0.8,0.4,0.8,0.9,0.9\} \, \). The same process is applied for all the particles in the swarm and feasible swarm is obtained.

3.6 Fitness evaluation

Feasible particles \( {\mathbf{p}}_{i}^{f} \) are transformed into binary particles \( {\mathbf{p}}_{i}^{b} \) by setting a threshold \( \theta = 0.5 \). Fitness function g is applied on binary particle \( {\mathbf{p}}_{i}^{b} \) and fitness value (the misclassification penalty), is measured. The fitness value \( g({\mathbf{p}}_{i}^{b} ) \) of the swarm is checked for convergence using \( g({\mathbf{p}}_{i}^{b} ) \le \left| S \right| \). The feasible particle that holds the minimum fitness value is considered as the best particle. The meta-heuristic framework will be terminated when the best particle reaches the convergence criterion. The best particle contains the optimum solution for the LPP and it is fed on to the isotonic separation to construct the model. If the best particle does not match with the convergence criterion, swarm is updated to new positions as described in the next sub section.

$$ p_{ij}^{b} = \left\{ {\begin{array}{*{20}l} 1 \hfill &\quad {p_{ij}^{f} \ge \theta } \hfill \\ 0 \hfill &\quad {\text{otherwise}} \hfill \\ \end{array} } \right. $$
(18)
$$ g\left( {{\mathbf{p}}_{i}^{b} } \right) = \alpha \sum\limits_{{j \in A_{1} }} {\left( {1 - p_{ij}^{b} } \right) + \beta } \sum\limits_{{j \in A_{0} }} {p_{ij}^{b} } $$
(19)

From our example, the binary particles for the feasible particles in the swarm are \( {\mathbf{p}}_{1}^{b} = \{ 1,0,1,1,0,1,1,1\} \,\,{\mathbf{p}}_{2}^{b} {\text{ = \{ 1,1,1,1,1,1,1,1\} }}\,\,{\mathbf{p}}_{3}^{b} = \{ 0,0,0,0,0,0,0,0\} \) and the corresponding fitness values are 2, 4, and 4. The global best particle is the first particle in the swarm. Since fitness value of the global best particle is not matched with threshold value, swarm should be updated as given in the next subsection by changing its velocity and position.

3.7 Swarm adjustment

The global best PSO is a method where the position of each particle is influenced by the best particle in the swarm. It uses a star topology where the information is obtained from all particles in the swarm. In this method, each particle in the swarm is composed of two n-dimensional vectors, the personal best particle and the velocity vector. The personal best particle \( {\mathbf{pbest}}_{i}^{t} \) corresponds to the position in the search space where particle i had the smallest value determined by the fitness function in all t iterations. The velocity vector of particle i in iteration t \( {\mathbf{v}}_{i}^{t} = (v_{i1}^{t} ,v_{i2}^{t} ,v_{i3}^{t} , \ldots ,v_{in}^{t} ) \) drives the optimization process and reflects both the own and social experience. The global best particle \( {\mathbf{gbest}}^{t} \) is the best particle in the swarm in all t iterations, and it yields the minimum number of misclassifications in the context of solving the LPP. The following Eqs. (20) and (21) define how the personal and global best particles are updated, respectively.

$$ \begin{aligned} {\mathbf{best}}^{t} & = \mathop {\arg min}\limits_{{{\mathbf{p}}_{i}^{f} }} (g({\mathbf{p}}_{i}^{b} )) \\ {\mathbf{pbest}}_{i}^{t} & = \left\{ {\begin{array}{*{20}l} {{\mathbf{p}}_{i}^{f} } \hfill & {t = 1} \hfill \\ {\mathop {\arg \hbox{min} }\limits_{j} (g(j))\,\,{\text{where}}\,\,j \in \left\{ {{\mathbf{p}}_{i}^{f} ,{\mathbf{pbest}}_{i}^{t - 1} } \right\}} \hfill & {\text{otherwise}} \hfill \\ \end{array} } \right. \\ \end{aligned} $$
(20)
$$ {\mathbf{gbest}}^{t} = \left\{ {\begin{array}{*{20}l} {{\mathbf{best}}^{t} } \hfill & {t = 1} \hfill \\ {\mathop {\arg \hbox{min} }\limits_{j} (g(j))\,\,{\text{where}}\,\,j \in \left\{ {{\mathbf{best}}^{t} ,{\mathbf{gbest}}^{t - 1} } \right\}} \hfill & {\text{otherwise}} \hfill \\ \end{array} } \right. $$
(21)

The velocity vector for the ith particle at (t + 1)th iteration \( {\mathbf{v}}_{i}^{t + 1} \) is calculated using personal best and global best particle.

$$ {\mathbf{v}}_{i}^{t + 1} = w_{t} {\mathbf{v}}_{i}^{t} + c_{p} {\mathbf{u}}({\mathbf{pbest}}_{i}^{t} - {\mathbf{p}}_{i}^{f} ) + c_{g} {\mathbf{u}}({\mathbf{gbest}}^{t} - {\mathbf{p}}_{i}^{f} ) $$
(22)

where u is a d-dimensional random vector of elements that are uniformly distributed in [0,1]. The personal best acceleration constant, \( c_{p} \), expresses how much confidence a particle has in itself, and the global best position acceleration constant,\( c_{g} \), expresses how much confidence a particle has in its neighbors. MeHeIS–CPSO deploys varying inertia weight scheme to update particle’s velocity. Inertia weight \( w_{t} \) at iteration t is calculated as given below.

$$ w_{t} = w_{\hbox{max} } + \frac{{t - t_{\hbox{max} } }}{{t_{\hbox{max} } }}(w_{\hbox{max} } - w_{\hbox{min} } ) $$
(23)

where \( w_{\hbox{min} } \) is the initial inertia weight, \( w_{\hbox{max} } \) is the final inertia weight, and \( t_{\hbox{max} } \) is the maximum number of iterations. Position of particle is updated using the following Eq. (24).

$$ {\mathbf{p}}_{i} { = (}{\mathbf{p}}_{i}^{f} + {\mathbf{v}}_{i}^{t} ) { }\bmod \, p_{\hbox{max} } $$
(24)

In the perspective of satisfying the boundary constraints, particle’s new position is restricted by a maximum value \( p_{\hbox{max} } \) which is obtained from boundary constraint of (8). Once the swarm is updated, the process from Sect. 3.5 should be performed iteratively till the convergence criterion is reached. At the end of the meta-heuristic framework, the global best particle contains the optimum solution for the LPP, and it is fed on to the isotonic separation to construct the model.

$$ \begin{aligned} \hat{y}_{j} & = \left\{ {\begin{array}{*{20}l} 0 \hfill &\quad {g{\text{best}}_{j}^{t} < \theta \,\,{\text{for}}\,\, 1\le j \le n} \hfill \\ 1 \hfill &\quad {\text{otherwise}} \hfill \\ \end{array} } \right. \\ Y^{*} & = \{ \hat{y}_{j} |\hat{y}_{j} \in \{ 0,1\} \forall j = 1\,\,{\text{to}}\,\,n\} \\ \end{aligned} $$
(25)

Table 3 demonstrates the meta-heuristic framework with a swarm of three particles for two iterations. As explained above, the best particle is retained for the next iteration without any modifications. The remaining particles undergo the process of adjustment. This framework will come to an end when the fitness value is one and the corresponding global best particle is considered as the optimum solution. Assume that, the global best particle obtained is \( Y^{*} = \{ 0,0,1,1,0,1,1,1\} \). Data point P8 is misclassified while comparing with the actual class labels \( Y = \{ 0,0,1,1,0,1,1,0\} \). Boundary points for each class are obtained from the optimum solution of the LPP as follows: \( A_{1}^{*} = \{ 3,7\} A_{0}^{*} = \{ 1,5\} \).

Table 3 Evolution of particles in meta-heuristic framework

Algorithm 3.1 presents the procedure for the meta-heuristic framework. Algorithm 3.2 explains the procedure for converting random particle into feasible particle. Algorithm 3.3 gives the procedure for obtaining the exact threshold value used in convergence criterion.

figure d
figure e
figure f

4 Theoretical analysis

This section analyzes theoretical properties of the proposed MeHeIS–CPSO with proof.

Lemma 1

The convergence criterion always gives the maximum number of misclassifications.

Proof

The objective function of the LPP (8) minimizes the misclassification penalty subject to the isotonic constraints \( \hat{y}_{i} - \hat{y}_{j} \ge 0 \, \). Isotonic constraints are checked against the feature values of instances. Let \( \hat{y}_{i} \), a decision variable in the LPP, denotes the predicted label for an instance i. It takes either 0 or 1. Let \( \hat{y}_{i} - \hat{y}_{j} \ge 0 \, \) be the associated constraint of the LPP. If \( \hat{y}_{i} \) is 1, then the constraint is satisfied irrespective of the value of \( \hat{y}_{j} \) (1 or 0). If \( \hat{y}_{i} \) is 0 then \( \hat{y}_{j} \) should be 0. The isotonic constraint is violated, if any \( \hat{y}_{j} \) is 1. In this scenario, \( \hat{y}_{i} \) should be set as 1 to make the constraint feasible and it leads to inevitable misclassification. Convergence criterion is measured by checking the labels of all isotonic constraints. Hence, the convergence criterion gives the maximum number of misclassifications.

Lemma 2

Topological sort in MeHeISCPSO reduces number of iterations.

Proof

Let \( \hat{y}_{1} \succ \hat{y}_{2} \succ \cdots \succ \hat{y}_{n} \, \) be the topological ordering of decision variables based on their in-degrees for constructing slack vector during the conversion of random particles into feasible particles. Let \( \hat{y}_{i} \) be a decision variable in the ordering. The procedure takes all the constraints associated with \( \hat{y}_{i} \), generates slack value and converts the decision variable into feasible using Eq. (14). Let \( \hat{y}_{i + 1} \) be the next decision variable in the sequence obtained from topological ordering. The set of constraints associated with decision variable \( \hat{y}_{i + 1} \) changes the value of \( \hat{y}_{i + 1} \). This decision variable may or may not be connected through an edge. If there is no direct edge between \( \hat{y}_{i} \) and \( \hat{y}_{i + 1} \), the feasibility of \( \hat{y}_{i} \) will not be affected by \( \hat{y}_{i + 1} \). If there is an edge between \( \hat{y}_{i} \) and \( \hat{y}_{i + 1} \),\( \hat{y}_{i} \) may become infeasible due to the changes in \( \hat{y}_{i + 1} \) and this infeasibility will be rectified in next iteration. Thus, number of iterations is less in MeHeIS–CPSO.

Lemma 3

Time complexity for an iteration of the meta-heuristic framework is\( \theta (N(m + n)) \).

Proof

Let \( m \) be the number of constraints and \( n \) be the number of decision variables in the LPP. Meta-heuristic framework uses \( N \) particles and each particle has \( n \) real values. The computing time needed to generate a random particle is \( \theta (n) \). The time required to convert a random particle into a feasible particle and then to a binary particle is \( \theta (m) \) and \( \theta (n) \), respectively. Since there are \( N \) particles, the total computing time is \( \theta (N(m + n)) \). Fitness evaluation takes \( Nn \) computations. Number of comparisons required to find the global best and the personal best for all particles is \( O(N) \). So, the worst-case time complexity for an iteration of the meta-heuristic framework is \( \theta (N(m + n)) \).

Lemma 4

Space complexity for an iteration of the meta-heuristic framework is\( O\left( {Nn} \right) \).

Proof

Let \( m \) be the number of constraints and \( n \) be the number of variables in the LPP. The meta-heuristic framework uses \( N \) particles and each particle has \( n \) real values. The space for creating a list using topological sort is \( O(n) \). Total memory required to create a swarm is \( O(Nn) \). During fitness evaluation, the memory required to create a slack vector is \( O(n) \). The space required for obtaining the global best, and personal best is \( O(n) \). The total amount of memory required to create a velocity vector for the swarm is \( O(Nn) \). At the worst case, the space complexity for an iteration of the meta-heuristic framework is \( O(Nn) \). □

5 Experimental analysis

To evaluate the effective classification performance of MeHeIS–CPSO, experiments are conducted on different data sets and results are reported.

5.1 Data sets

The prerequisite for isotonic separation is the existence of isotonic property between the features and class labels in the data set. Table 4 lists the description of data sets that satisfy isotonic property. Among these, the first five data sets are available in UCI machine learning repository. A synthetic data set is created with an objective of conducting experiments on large data set [1].

Table 4 Properties of data sets taken for study

In the synthetic data set with 20,000 instances and nine features, all integer values are in [0, 10] and each value denotes the possibility of a positive class (0-low 10-high). Approximately 50% of the data are in [0, 4] and the remaining are in [5, 10]. The class label of instance i is calculated using maximal aggregation in which maximum of feature values in an instance is considered [1]. This instance is classified using an independent Bernoulli random trial probability, because it is empirically proved as a monotonically non-decreasing function on \( {\mathbf{x}}_{i} \) [1]. The probability is calculated as

$$ p_{i} = \left( {\frac{{{\text{sigmoid}}(c(x_{i} - a)) - 0.5}}{{{\text{sigmoid}}(ca)0.5}}} \right)0.5 + 0.5 $$
(26)

where \( x_{i} = \hbox{max} (x_{ij} )\,\,1 \le j \le d \), a = 5, c is a positive real number which controls misclassification error and

$$ {\text{sigmoid(}}t )= \frac{1}{{1 + e^{ - t} }} $$
(27)

is the sigmoid function. The class label \( y_{i} \) is determined by

$$ y_{i} = \left\{ {\begin{array}{*{20}l} 1 \hfill &\quad {{\text{if}}\,\,p_{i} \ge 0.5} \hfill \\ 0 \hfill &\quad {\text{otherwise}} \hfill \\ \end{array} } \right. $$
(28)

5.2 Experimental setup

First feature selection algorithm is applied on the whole data set, and a subset of features is obtained. Table 5 shows the selected features in each data set. Experiments are conducted using tenfold cross-validation [34]. The parameters for the proposed work are obtained based on literature and experiments. The measures analyzed are as follows:

Table 5 Details of selected features after feature selection algorithm

Predictive Measures Accuracy, Precision, Recall, and False-positive rate.

Epoch denotes number of iterations taken by the algorithm to obtain the optimum or near-optimum solution of the LPP.

Average training time the time taken by the CPU for obtaining boundary points from the training set.

Average classification time the time taken by the CPU for constructing model from training set and classifying labels for test set.

Measures of Monotonicity To analyze the existence of monotonicity or isotonicity in data sets are measured by three measures such as non-monotonicity index (NMI), NMI-one and NMI-two using the below equations [35].

Non-monotonicity Index It is the rate of number of data points that violates monotonicity divided by the number of possible ordered pairs in a data set. NMI should be less for isotonic data sets.

$$ {\text{NMI}}(D) = \frac{{\sum\nolimits_{i = 1}^{n} {\sum\nolimits_{j = 1}^{n} {m_{ij} } } }}{{n^{2} - n}} $$
(29)
$$ {\text{where}}\,\,m_{ij} = \left\{ {\begin{array}{*{20}l} 1 \hfill & {{\text{if}}\,\,{\mathbf{x}}_{i} \succ {\mathbf{x}}_{j} \to y_{i} \not \succ y_{j} } \hfill \\ 0 \hfill & {\text{otherwise}} \hfill \\ \end{array} } \right.. $$
(30)

NMI-one: It is the number of clash pairs divided by the total number of ordered pairs in a data set.

$$ \begin{aligned} & {\text{NMI}}_{1} = \frac{{\sum\nolimits_{i = 1}^{n} {N{\text{clash}}_{i} } }}{{n^{2} - n}} \\ & {\text{where}} \\ & N{\text{clash}}_{i} = \left| {\{ j|\exists j,1 \le j \le n,x_{i} \succ x_{j} \to y_{i} \not \succ y_{j} \} } \right| \\ \end{aligned} $$
(31)

NMI-Two: It is the number of data points that violate monotonicity.

$$ \begin{aligned} & {\text{NMI}}_{2} = \frac{{\sum\nolimits_{i = 1}^{n} {{\text{Clash}}(x_{i} )} }}{n} \\ & {\text{where }} \\ & {\text{Clash}}(x_{i} ) = \left\{ {\begin{array}{*{20}l} 0 \hfill & {{\text{if}}\,\,x_{i} \succ x_{j} \to y_{i} \succ y_{j} \forall j,1 \le j \le n} \hfill \\ 1 \hfill & {\text{otherwise}} \hfill \\ \end{array} } \right. \\ \end{aligned} $$
(32)

Statistical analysis is also done to find the significance of the proposed work using paired t test and one-way ANOVA with post hoc t test. All the experiments are performed on Intel quad-core processor and results are demonstrated.

5.2.1 Parameters

The parameters and their best values of the meta-heuristic framework for the data sets are found by running MeHeIS–CPSO for different values of parameters on the whole data set. The best values of parameters for which MeHeIS–CPSO provides higher accuracy are taken for the study and listed in Table 6.

Table 6 Parameters of MeHeIS–CPSO on different data sets

Penalty Parameters The penalty parameters α and β are set based on the literature [1]. In all the data sets, both misclassifications are treated as serious and hence, penalized equally.

Inertia weight In varying inertia type, wmax and wmin are set to 0.9 and 0.6 respectively for all data sets except synthetic data set. In synthetic data set, these values are set as 0.9 and 0.7. Acceleration coefficients are set based on the influence of the personal best and global best particles [36].

Threshold in Convergence Criterion It is measured from the training set using Algorithm 3.3.

Swarm size An important parameter in PSO algorithms is the number of particles in the swarm. The optimal number of particles for a specific algorithm is problem dependent. The study starts with 10 particles and runs the algorithm. If the algorithm fails to yield the solution, then the swarm size is increased to 30. Since the search space is wide for LPPs with fewer constraints, many particles are required to find the solution. For large-scale LPPs with many variables and constraints, the search space is small and few particles are sufficient to find the solution.

5.3 Results and discussion

MeHeIS–CPSO proposes three significant characteristics: topological ordering of decision variables, correlation-based feature selection algorithm, and a threshold used in convergence criterion. So, it is mandatory to validate significance and efficiency of the above characteristics of MeHeIS–CPSO through experiments.

5.3.1 Efficiency of correlation-based feature selection algorithm in MeHeIS–CPSO

In order to show the efficiency of feature selection algorithm on MeHeIS–CPSO, experiments are conducted in two scenarios: MeHeIS–CPSO without feature selection and MeHeIS–CPSO with feature selection. The measures to be studied between these two scenarios are monotonicity measures, predictive measures, Epoch, average training time and average classification time.

Since the prerequisite for isotonic separation is isotonicity, data sets are studied for isotonic property and reported in Table 7. These measures denote the rate of violation of isotonic property in the data sets. Results reveal that violation of isotonicity is reduced with data sets after feature selection. This reduction in violation influences in improving the classification performance of isotonic separation.

Table 7 Violation of isotonicity in data sets with all features and reduced features

Table 8 reports mean constraints in the LPP, epochs for convergence, training time and classification time for MeHeIS–CPSO with and without feature selection. Results show that mean constraints of the LPP, epochs, training time, and classification time are increased in MeHeIS–CPSO with feature selection. It indicates that isotonicity of the data sets is increased as NMI is reduced in these data sets after feature selection. As a result of increasing in isotonicity, many ordered pairs will be in the quasi relation R and the number of constraints in the LPP will also be increased. Hence, the proposed model takes many iterations to find the optimum or near-optimum solution for the LPP generated from data set with reduced features than data set with all features. Due to these reasons, training and classification time are increased in the proposed model. Paired t test is also done for each measure between the proposed algorithm with feature selection and without feature selection. It is observed that there is no significant difference between the proposed work with and without feature selection in terms of mean epoch for convergence, training time, and classification time.

Table 8 Comparative study of mean constraints, iterations, training time and classification time of MeHeIS–CPSO with feature selection and without feature selection on data sets

Predictive measures are reported in Table 9 using 95% confidence interval on the mean values [37]. Statistical validation is also done to find the significance of feature selection algorithm using paired—t test for each measure. * in the column indicates that the proposed algorithm with feature selection is statistically significant in 95% confidence level. Experimental results reveal that the proposed hybrid algorithm gives better results with feature selection than without feature selection. Results of paired t test show that the proposed MeHeIS–CPSO with feature selection is statistically significant based on predictive measures.

Table 9 Experimental and statistical results of predictive measures of the proposed MeHeIS–CPSO with and without feature selection on all data sets (* indicates that the proposed MeHeIS–CPSO with feature selection algorithm is statistically significant in 95% confidence level.)

We conjecture that the proposed MeHeIS–CPSO with feature selection selects isotonic features and reduces non-monotonicity in the data set. As a result, it generates an LPP with many constraints, and it leads the framework toward convergence by taking many iterations. Training time and classification time depends to a great extent on the number of iterations to convergence. Hence, MeHeIS–CPSO with feature selection consumes much time for constructing model than MeHeIS–CPSO without feature selection. Since MeHeIS–CPSO with feature selection increases isotonicity, precise and broader isotonic regions are obtained as a result of training. Even though the proposed work with reduced features increases number of iterations, training time and classification time, the difference is negligible and not significant. Statistical analysis also shows that there is no significant difference in terms of the number of constraints, training time and classification time between the proposed MeHeIS–CPSO with and without feature selection.

5.3.2 Efficiency of threshold in convergence criterion and topological sorting in MeHeIS–CPSO

In order to validate the usage of threshold in convergence criterion and topological sorting in MeHeIS–CPSO framework, experiments are conducted in two scenarios: MeHeIS–CPSO (MeHeIS–CPSO with feature selection) and MeHeIS-TPSO (MeHeIS-Traditional PSO, i.e., MeHeIS–CPSO without feature selection, topological sorting and threshold). In MeHeIS-TPSO, the convergence criterion is that the fitness value of the global best particle is either zero or some arbitrary threshold value. Convergence details, predictive measures, mean epochs and mean classification time are reported.

Figure 3 shows the convergence details of MeHeIS–CPSO, MeHeIS-TPSO and EIS. Results reveal that MeHeIS–CPSO converges faster than EIS and MeHeIS-TPSO in all the data sets. The inference for this quick convergence in the proposed MeHeIS–CPSO is due to the small swarm size, topological ordering of decision variables, and the high probability that the swarm converges to an optimum solution due to optimal convergence criterion. In EIS, the convergence is too slow because of large population size, no specific ordering of decision variables and absence of exact convergence criterion. Even though the experimental setup and parameters are same for MeHeIS–CPSO and MeHeIS-TPSO, the later one takes many iterations to converge, since it fails to order decision variables and fix the exact convergence criterion.

Fig. 3
figure 3

Comparison of average iterations between MeHeIS–TPSO, MeHeIS–CPSO and EIS

Figure 4 shows the significance of topological sorting and threshold. It clearly demonstrates that the number of iterations will be reduced when topological sorting and threshold are used in the proposed work. It is evident that topological sorting and threshold value in convergence criterion play critical role in improving the convergence of MeHeIS–CPSO framework.

Fig. 4
figure 4

Comparison of average iterations in the evolution of MeHeIS–TPSO to MeHeIS–CPSO

Figure 5 demonstrates the average training time of IS, EIS, MeHeIS–CPSO, and MeHeIS-TPSO algorithms. Results show that IS takes the least time to construct a model on WBCD. For WEBSPAM-UK2007 and synthetic data sets, MeHeIS–CPSO works significantly. When isotonic separation is applied on large data sets, such as WEBSPAM-UK2007 and synthetic data sets, the LPP becomes large. So, isotonic separation could not generate the optimum solution for large-scale LPP and construct model. Since MeHeIS-TPSO sets the threshold value in the convergence criterion as zero or arbitrary user-specified threshold, it consumes much time to construct a model. EIS also takes more time due to the large population size and arbitrary threshold value in convergence criterion. Due to the small swarm size, topological ordering of decision variables in the LPP, and the optimum threshold value in the convergence criterion, MeHeIS–CPSO provides significant results in terms of training time.

Fig. 5
figure 5

Comparison of mean training time between MeHeIS–TPSO, MeHeIS–CPSO, EIS, and isotonic separation

Table 10 reports the mean predictive measures for MeHeIS–CPSO, MeHeIS-TPSO, and predecessors EIS and IS. In WEBSPAM-UK2007 and synthetic data sets, the solution from the LPP solvers could not be got, due to many constraints and decision variables in the LPP. So, the results (Accuracy measures) of isotonic separation are not able to be generated and compared. From this, it is evident that the proposed MeHeIS–CPSO with feature selection is superior to all its predecessors in terms of predictive measures on all data sets.

Table 10 Comparison of predictive measures between the proposed MeHeIS–CPSO and MeHeIS-TPSO and predecessors

In MeHeIS–CPSO, one-way ANOVA sets the null hypothesis that there is no significant difference in the means of training time of algorithms on all the three data sets. Here, the significance of considering training time for statistical analysis is that it varies for different algorithms such as IS, EIS, MeHeIS-TPSO, and the proposed MeHeIS–CPSO, and the classification time is constant once the model is constructed. F test results show that there is at least one significant difference among the mean training time of the variants of isotonic separation (p = 0.0001 for all three data sets). Then post hoc t tests are conducted and results are reported in Table 11. The post hoc multiple comparisons [38, 39] show that training time of IS is statistically significant on small data sets such as Haberman, Australian credit approval, Pima Indians and WBCD. The MeHeIS–CPSO shows the statistical significance on WEBSPAM-UK2007 and synthetic data sets at the confidence level of 0.95.

Table 11 Statistical analysis of training time between variants of isotonic separation on different data sets

Based on the above study, it makes us more convinced that the proposed MeHeIS–CPSO with new correlation-based feature selection is significant in classifying large data sets. The significance of topological sorting is to arrange the decision variables in such a way that ordering helps to make the data set isotonically consistent. Topological ordering of decision variables is used for selecting the next variable to be labeled in PSO and generating near-optimum solution for the LPP. The relevance of threshold in convergence criterion is to generate the optimum stopping condition. As a result MeHeIS–CPSO gives near-optimum solution for the LPP and shows the significant performance and faster convergence.

Results from theoretical, empirical and statistical analysis show that MeHeIS–CPSO with correlation-based feature selection is superior to isotonic separation in predictive performance on all data sets. In terms of training time, results prove that isotonic separation is a good choice for classification on smaller data sets. When data set grows with respect to features and data points, isotonic separation suffers from high computational complexity in solving large-scale LPP. So, MeHeIS–CPSO is superior to isotonic separation and other algorithms in terms of training time and performance measures.

5.3.3 Efficiency of MeHeIS–CPSO with state-of-the-art algorithms

Performance measures are compared between MeHeIS–CPSO and state-of-the-art machine learning algorithms [40] such as C4.5 [41], SVM [42], Bayesian classifier, Back propagation network and k-nearest neighbor. Results in Table 12 show that the MeHeIS–CPSO is superior to other algorithms on all data sets. In WEBSPAM-UK2007 and synthetic data sets, the solution from the LPP solvers in MATLAB could not be got, due to more constraints and decision variables in the LPP. So, the parameters (Accuracy measures) of isotonic separation are not able to be generated and compared. From this, it is evident that the proposed MeHeIS–CPSO with feature selection is superior to all its predecessors and state-of-the-art machine learning algorithms in terms of predictive accuracy measures on all data sets.

Table 12 Comparative study of mean performance measures of MeHeIS–CPSO, with other machine learning algorithms on different data sets

Table 13 presents the comparative study of accuracy measures of MeHeIS–CPSO with other monotonic classifiers in the literature. Results show that MeHeIS–CPSO outperforms other algorithms.

Table 13 Comparative study of accuracy measures among MeHeIS–CPSO with other isotonic classifiers MONIPS, MoNGEL, and monotonic random forest

Experimental results conclude that the proposed MeHeIS–CPSO plays a vital role in classification for learning an isotonic function from the data. The feature selection algorithm used in the proposed MeHeIS–CPSO reduces the non-monotonicity in the data set through which the performance of the algorithm is improved significantly. The proposed PSO-based meta-heuristic framework solves the large-scale LPP formulated in isotonic separation and yields a near-optimum solution with the help of PSO, threshold and topological ordering of decision variables in PSO. PSO framework helps to solve the LPP with minimum swarm size. Topological ordering of decision variables enables the PSO algorithm to choose the right sequence of variables without any overlapping between them during the evaluation of constraints. Threshold in PSO gives the exact convergence criterion. All the above significant properties make MeHeIS–CPSO as a suitable technique for using isotonic separation on large data. It is significantly better than EIS in faster convergence and performance measures. It is also superior to other state-of-the-art machine learning algorithms in terms of performance measures.

6 Conclusion and future work

The proposed MeHeIS–CPSO classifier is a hybrid algorithm which deploys meta-heuristic framework and feature selection metric using correlation to address the issues of finding a near-optimum solution to the large-scale LPP and improve the predictive measures, respectively.

MeHeIS–CPSO overcomes the drawbacks of its predecessor EIS in reducing the time to solve the LPP by arranging the decision variables of the LPP in PSO. It also improves the performance of the classifier using a new strategy in feature selection using correlation coefficient. To assess the predictive ability of MeHeIS–CPSO empirically, experiments are conducted on data sets with isotonic property. Experimental results show that MeHeIS–CPSO is superior to EIS and isotonic separation in training time, classification time and classification measures on large data sets. It also outperforms other state-of-the-art machine learning techniques in terms of performance measures. However, feature selection increases size of the LPP, the proposed hybrid classifier tackles the LPP and constructs model effectively.

Topics that remain to be explored in future include evaluating the meta-heuristic isotonic separation in various domains such as internet content filtering, mail spam filtering, and other publicly available large data sets with isotonic consistent features. MeHeIS–CPSO can be generalized for multi-way classification. Graph-based approaches can also be explored along with semi-supervised learning techniques to find the optimum solution for a large-scale LPP with minimum time complexity. Reinforcement learning techniques can also be proposed to find the solution for the LPP. Conformal prediction can be applied for isotonic separation to predict output using the appropriate non-conformity measure. We are exploring on proposing the non-conformity measure for isotonic separation.

In addition, the problem of finding existence of isotonic properties among features and output class variable constitutes a future research problem. Evolutionary and swarm optimization techniques are being studied for finding isotonic property in the data set. New feature selection metrics can also be proposed to select the isotonic features using stepwise forward selection or backward sequential elimination method. Parallel and randomized algorithms can be explored to scale up isotonic separation.