1 Introduction

Multiple sequence alignment (MSA) is a crucial tool in molecular biology and genome analysis. It has been considered as one of the important tasks in bioinformatics [1]. It helps to construct a phylogenetic tree of related DNA sequences, to predict the function and structure of unknown protein sequences, by aligning with other sequences whose function and structure is already known, and to allow comparison of the structural relationships between sequences by simultaneously aligning multiple sequences and constructing connections between the elements in different sequences [2].

Discovering optimal alignment in multiple biological sequence data is known as a NP-complete problem [3]. It has been identified as a combinatorial optimization problem [4], which is solved by using an exact or approximate algorithms. These algorithms lead to exploit various genetic information to determine evolutionary relationships among living beings [3].

Recently, the trend has shifted to the use of iterative algorithms in order to tackle the MSA problem. These approaches are based on the improvement of the given initial alignment through a series of some iterations until a stopping criterion is reached. They include genetic algorithm (GA) [5], simulated annealing algorithm (SA) [6], particle swarm optimization (PSO) [7], GA-ACO algorithm [8], Ant Colony Algorithm [9] and so on. Generally, these metaheuristics are able to find nearly optimal solutions for large instances in a reasonable processing time.

In this study, we propose a hybrid approach called SPSO algorithm to solve the MSA problem. The developed model is the cooperation between both PSO algorithm and simulated annealing technique. The remainder of the paper is organized as follows: Sect. 2 presents a brief review of the researches related to the proposed framework. In Sect. 3, both PSO and SA concepts are described. In Sect. 4, our proposed SPSO algorithm is explained in detail. In Sect. 5, the simulation results are provided. Finally, the study is concluded in Sect. 6.

2 Background

A brief review of some related works in the multiple sequence alignment field using iterative methods is presented in this section. Riaz et al. [10] presented a tabu search algorithm to align multiple sequences. The framework of his work consists to implement the adaptive memory features typical of tabu searches in order to obtain multiple sequences alignment where the quality of an alignment is measured by the COFFEE objective function. In [11], the authors proposed a novel approach to multiple sequence alignment based on Particle Swarm Optimization (PSO) to improve a sequence alignment previously obtained using Clustal X.

In Ref. [12], the authors presented an approach to the MSA problem by applying genetic algorithm with a reserve selection mechanism to avoid premature convergence in GA. A better results are obtained compared with those produced by classical GA. The authors in [13] proposed an algorithm based on binary PSO algorithm to address the multiple sequence alignment problem. Simulation results using SP score measure and nine BaliBASE tests case showed that the proposed BPSO algorithm has superior performance when compared to ClustalW and SAGA algorithms.

An artificial bee colony algorithm for solving MSA problem is introduced in [14]. In Ref. [15], Cutello et al. presented an immune inspired algorithm (IMSA) to tackle the multiple sequence alignment problem using ad-hoc mutation operators. Experimental results on BALIBASE v.1.0 show that IMSA is superior to PRRP, CLUSTALX, SAGA, DIALIGN, PIMA, MULTIALIGN and PILEUP8. In [16], simulated annealing technique was applied to solve MSA problem using a set of DNA benchmarks of HIV virus genes of human and simian.

In Ref. [17], the authors proposed a hybrid algorithm using a GA and cuckoo search algorithm to improve multiple sequence alignment. The obtained results are compared with ClustalW by using five different datasets. Recently, an efficient method by using multi-objective genetic algorithm (MSAGMOGA) to discover optimal alignments is proposed in [18]. Experiments on the BAliBASE 2.0 database confirmed that MSAGMOGA obtained better results than MUSCLE, SAGA and MSA-GA methods.

3 Preliminaries

3.1 Outline of the Particle Swarm Optimization (PSO)

Particle swarm optimization (PSO) is an adapted algorithm developed the first time by Kennedy and Eberhart [19], inspired by bird flocking and fish schooling. PSO used a population of individuals called particles and two primary operators: velocity update and position update. During each generation, each particle moves toward the particles according to its best position and the global best position. In addition, a new velocity value for each particle is calculated based on its current velocity, the distance from its previous best position and the distance from the global best position. The evolution of the swarm is governed by the following equations:

$$ V^{{\left( {k + 1} \right)}} \, = \,w.V^{(k)} \, + \,c_{1} .rand_{1} . \, \left( {pbest^{(k)} \, - \,X^{(k)} } \right)\, + \,c_{2} .rand_{2} .\left( {gbest\left( {^{k} } \right) \, \, - \,X^{(k)} } \right). $$
(1)
$$ X^{{\left( {k + 1} \right)}} \, = \,X^{(k)} \, + \,V^{(k + 1)} . $$
(2)

where:

X is the position of the particle,

V is the velocity of the particle,

w is the inertia weight,

pbest is the best position of the particle,

gbest is the global best position of the swarm,

rand1, rand2 are random values between 0 and 1,

c1, c2 are positive constants which determine the impact of the personal best solution and the global best solution on the search process, respectively, k is the iteration number.

Concerning the stopping condition, generally PSO algorithm terminates when a set number of times or until a minimum error is achieved. All parameters of PSO algorithm are fixed experimentally in order to have a good compromise between the convergence time of the algorithm and the final solution quality.

3.2 Outline of the Simulated Annealing (SA)

Simulated annealing (SA) is a general probabilistic local search algorithm proposed by Kirkpatrick et al. [20] to solve difficult optimization problems. It is inspired by the annealing of solids in physics. SA models the slow cooling process of solids to achieve the minimum energy as an analogy, reaching the minimum function value. As a result, it attains an optimal/near-optimal solution by implementing an iterative cooling process from a high temperature, at which solid particles are in the liquid phase. Simulated annealing utilizes a control parameter, temperature T, for the cooling process. The solid is allowed to attain the thermal equilibrium for every T degree that has its energy E probabilistically distributed, as given in Eq. (3), where \( \text{k}_{\text{b}} \) is the Boltzmann constant.

$$ P(E)\, = \,e^{{(\frac{ - E}{{k_{b} t}})}} . $$
(3)

In the combinatorial optimization context, if we aim to find a good solution then we move from a solution to one of its neighbors in the search space according to a probabilistic criterion. If the cost decreases then the solution is retained and the move is accepted. Otherwise, the move is accepted only with a probability depending on the cost increase and the temperature parameter T [20].

4 Proposed Method

PSO performs excellently in the case of global search but it is not efficient in local search. It suffers from its weak local search ability and the local minima limit. On the other hand, SA is good in local search while less good in global search. However, it takes advantage of the acceptance of candidate solutions by the use of metropolis criteria and must escape from local optimum to search in other solution space.

In order to construct an intelligent algorithm which can be effectively avoid weaknesses and fully use the advantages of both PSO and SA algorithms, we propose a hybrid approach which combines the PSO with simulated annealing, so the new hybrid algorithm called Simulated Particle Swarm optimization (SPSO) conducts both global search and local search in every iteration. According to this hybridization manner, the probability to obtain better solutions significantly increases. At each iteration, the proposed hybrid SPSO algorithm consists of applying PSO algorithm in order to guide global search, and use SA to improve the gbest which helps PSO to escape from local optimum and increase the convergence speed of SPSO algorithm. The flowchart of the proposed SPSO is presented in Fig. 1.

Fig. 1.
figure 1

SPSO algorithm flowchart.

4.1 PSO Components of MSA Problem

Particle Representation.

Each particle represents a potential solution to the MSA problem, effectively it corresponds to a sequence alignment. A particle is then represented as a set of vectors, where each vector specifies the positions of the gaps in each one of the sequences to be aligned [21].

Swarm Initialization.

The size of the whole swarm is determined by the user. The initial set of particles is generated by adding gaps into each sequence at random position, thus, all the sequences have the same length L in which its value is 1.2 times of the longest sequences [21].

Fitness Evaluation.

A parameter to determine which alignment will survive in the next generation is its fitness value. A formal definition of the sum-of-pairs (SPS) of multiple sequence alignment is introduced which is used as a tool to compute fitness. The score assigned to each alignment is the sum of the scores (SP) of the alignment of each pair of sequences. The score of each pair of sequences is the sum of the score assigned to the match of each pair of symbols, which is given by the substitution matrix. The score of a multiple alignment is given as follows:

$$ Score(A)\, = \,\sum\limits_{i = 1}^{k - 1} {\sum\limits_{j = i + 1}^{k} {S(A_{i} ,\,A_{j} )} } . $$
(4)

where the S(A i , A j ) is the alignment score between two given sequences A i and A j .

Particle Move.

In the PSO algorithm, each particle moves towards the leader at a speed proportional to the distance between the particle and the leader. In this paper, this distance will be measured as the proportion of gaps that do not match in the sequences, according to the formula:

$$ Distance\, = \,\frac{no\,matching\,\,gaps}{total\,\,gaps}. $$
(5)

To move particles towards the leader, an operator similar to the crossover operator from genetic algorithms is used [21]. It consists to select a crossover point which divides the alignment into two segments, and then a segment of the particle is replaced with a segment from the leader. This replacement is achieved removing from the particle the gaps that are in the segment, and then adding the gaps from the leader’s segment.

4.2 SA Components of MSA Problem

According to the basic elements of simulated annealing cited above, components of our proposed algorithm for MSA problem are as follows:

Cost Function.

The cost function is used to evaluate the quality of each alignment in the swarm. In order to retain the same paradigm to measure the quality of the solution, we decide in this study to use the sum-of-pairs (SPS) as a cost function which is the same one of that used in the PSO algorithm.

Initial Solution.

The generation of an initial solution is an important step towards getting a final improved alignment. In our developed method, the initial solution is constructed by insertion of gaps randomly in each sequence of the alignment.

Generation of Neighbors.

In the multiple sequence alignment context, neighbors of a current solution are obtained by employing some perturbations to the gaps positions into the different sequences. In this work, we choose to apply a simple efficient strategy to change positions of these gaps, this mechanism is called LocalShuffle operator, its main idea is as follows: firstly, picks a random amino acid from a randomly chosen sequence in the alignment and checks whether one of its neighbors is a gap. If this is the case, the algorithm swaps the selected amino acid with a gap neighbor. If both neighbors are gaps then one of them is picked randomly [22].

Choice of Cooling Schedule.

An effective cooling schedule is essential to reducing the amount of time required by the algorithm to find an optimal solution. Several temperature decreasing schemes have been proposed in the literature, they include static schedules and adaptive schedules [16]. Here, we propose to use the most common cooling function which is defined by Tk+1 = α.Tk. This function decreases the temperature value by a α factor, where α ∈ [0.70, 1.0[. The pseudo-code of our SA is given below.

figure a

After the description of the key components of both PSO and SA algorithms, the pseudo-code of our SPSO procedure is summarized as follows:

figure b

5 Simulation and Results

The proposed approach is implemented in Java language. All tests have been fulfilled on a PC with an 2.66 GHz Intel Pentium IV processor and 4 GB RAM. We conducted some experiments in order to demonstrate the effectiveness of our SPSO algorithm. For that, a set of benchmark sequences with different identities and lengths are chosen from the BAliBase 1.0 library [23]. Characteristics of the used data and all parameters setting of the algorithm SPSO are summarized in Tables 1 and 2 respectively.

Table 1. Characteristics of benchmark sequences.
Table 2. Parameter settings for experiments.

All obtained results by GA, PSO, ABC and our SPSO algorithm using the best, the worst and the average SPS values of 10 experiments are summarized in Tables 3, 4 and 5 respectively.

Table 3. Comparison results for short sequences.
Table 4. Comparison results for medium sequences.
Table 5. Comparison results for long sequences

The results plotted in Tables 3, 4 and 5 show clearly the considerable improvement of scores by the proposed SPSO approach. Indeed, it produces alignments of much better quality than that of the other cited methods in both specified categories.

In order to further assess the potent of our SPSO algorithm, a second experiment using another datasets is performed, where the goal is to compare our SPSO approach with TLPSO-MSA [24] technique. In this experiment, three different sets of protein coming from the BaliBASE 3.0 database [25] are selected. The average SP and TC scores [26] are portrayed in Table 6.

Table 6. Comparison results on the selected test cases.

From Table 6, it can be seen that our SPSO outperforms clearly TLPSO-MSA method in terms of TC score for all cases. In terms of SP score, it finds results better than TSPSO-MSA for RV11 and RV20 protein families and produces a competitive results on RV12 dataset.

6 Conclusion

In this work, we contributed to the ongoing research by proposing a hybrid model for finding optimized alignments to the MSA problem. The developed approach uses the characteristics of the random search, global convergence of the PSO and the potent of the simulated annealing to update the global optimal solution. The performance of the proposed SPSO algorithm is judged on a set of BaliBase benchmark problems and is favorably compared with other algorithms in the literature. The results demonstrate that the proposed approach is overall more effective to find better alignments in a reasonable processing time.

In the future, we will revise our score function to make this score more realistic, also, we can use another intelligent heuristic to generate the initial whole swarm in order to increase the convergence speed of the algorithm. In addition, we can incorporate other efficient mechanisms for the neighborhood structure to improve the global solution quality. A comparison of the proposed method with some other aligners such as Clustal W, SAGA or MULTALIGN is possible to verify its effectiveness.