An OpenMPbased tool for finding longest common subsequence in bioinformatics
 140 Downloads
Abstract
Objective
Finding the longest common subsequence (LCS) among sequences is NPhard. This is an important problem in bioinformatics for DNA sequence alignment and pattern discovery. In this research, we propose new CPUbased parallel implementations that can provide significant advantages in terms of execution times, monetary cost, and pervasiveness in finding LCS of DNA sequences in an environment where Graphics Processing Units are not available. For general purpose use, we also make the OpenMPbased tool publicly available to end users.
Result
In this study, we develop three novel parallel versions of the LCS algorithm on: (i) distributed memory machine using message passing interface (MPI); (ii) shared memory machine using OpenMP, and (iii) hybrid platform that utilizes both distributed and shared memory using MPIOpenMP. The experimental results with both simulated and real DNA sequence data show that the shared memory OpenMP implementation provides at least twotimes absolute speedup than the best sequential version of the algorithm and a relative speedup of almost 7. We provide a detailed comparison of the execution times among the implementations on different platforms with different versions of the algorithm. We also show that removing branch conditions negatively affects the performance of the CPUbased parallel algorithm on OpenMP platform.
Keywords
Longest common subsequence (LCS) DNA sequence alignment Parallel algorithms for LCS LCS on MPI and OpenMP Tool for finding LCSAbbreviations
 CUDA
compute unified device architecture
 GPU
graphics processing unit
 LCS
longest common subsequence
 MPI
message passing interface
 OpenMP
open multiprocessing
 UCR
University of California Riverside
 NCBI
National Centre for Biotechnology Information
Introduction
Finding Longest Common Subsequence (LCS) is a classic problem in the field of computer algorithms and has diversified application domains. A subsequence of a string is another string which can be derived from the original string by deleting none or few characters (contiguous or noncontiguous) from the original string. A longest common subsequence of two given strings is a string which is the longest string that is a subsequence of both the strings. The sequential version of the LCS algorithm using “equalunequal” comparisons takes \(\varOmega \left( {\text{mn}} \right)\) time, where m and n represent the length of the two sequences being compared [1, 2]. It is necessary to mention that the problem of finding the LCS of more than two strings is NPhard in nature [3, 4].
LCS has various applications in multiple fields including DNA sequence alignment in bioinformatics [5, 6, 7], speech and image recognition [8, 9], file comparison, optimization of database query etc. [10]. In the field of bioinformatics, pattern discovery helps to discover common patterns among DNA sequences of interest which might suggest that they have biological relation among themselves (e.g., similar biological functions) [11]. In discovering patterns between sequences, LCS plays an important role to find the longest common region between two sequences. Although a praiseworthy amount of efforts have been made in the task of pattern discovery, with the increase of sequence lengths, algorithms seemingly face performance bottlenecks [12]. Furthermore, with the advent of nextgeneration sequencing technologies, sequence data is increasing rapidly [13], which demands algorithms with minimum possible execution time. Parallel algorithms can play a vital role in this regard.
 1.
A new OpenMPbased publicly available tool for finding length of LCS of DNA sequences for the end users.
 2.
A detailed benchmarking of the newly developed CPUbased parallel algorithms using different performance metrics on both simulated and real DNA sequence data, where we found that our OpenMPbased algorithm provides atleast 2 times absolute speedup (compared to the best sequential version) and 7 times relative speedup (compared to using only 1 thread).
 3.
A comparison of the newly developed OpenMPbased LCS algorithm with and without branch conditions.
Main text
Preliminaries
Here, \(R\) is a score table consisting of the lengths of the longest common subsequences of all the possible prefixes of the two strings. The length of longest common subsequence of \(A\) and \(B\) can be found in the cell \(R\left[ {m,n} \right]\) of table \(R\). From Eq. 1, we can see that the value of a cell \(R\left[ {i,j} \right]\) in the scoring table R depends on \(R\left[ {i  1,j  1} \right]\), \(R\left[ {i,j  1} \right]\) and \(R\left[ {i  1,j} \right] .\)
Rowwise independent algorithm (Version 1)
Here, c denotes the index of character \(A\left[ {i  1} \right]\) in string \(C\).
Rowwise Independent Algorithm (Version 2)
From the two versions of rowwise independent algorithms, we can see that the calculation of values of table P only depends on the same row. In contrast, the calculation of the values of score table R depends on the previous row only.
Methodology
For the calculation of the P table, each row is independent and can be calculated in a parallel way. Therefore, in our MPI implementation, we scattered the P table to all the processes in the beginning. After calculating the corresponding chunk values, process number zero gathers the partial results from all the other processes. For the calculation of score table R, elements in each row can be scattered among the processes and gathered afterwards. This scatter and gather operations need to be done for every row. Hence, the communication and synchronization overheads are expected to be higher for the MPI implementation approach.
A shared memory implementation can largely mitigate the communication and synchronization overheads of distributed memory implementations which inspired us to develop the shared memory (OpenMP) implementation. In case of the OpenMP implementation, we used worksharing construct #pragma omp parallel for (an OpenMP directive for sharing iterations of a loop among the available threads) to compute the elements of a single row of the score table R in parallel. We tried different scheduling strategies (static, dynamic, and guided) for sharing works among the threads. The calculation of the P table was also shared among threads. This time, the outer loop was parallelized using #pragma omp parallel for construct, as every row is independent of each other.
In the hybrid MPIOpenMP approach, we selected the optimum number of processes and threads from the experiments of MPI and OpenMP approach. After that we scattered every row among processes and inside a single process we further shared the chunk of rows among threads using #pragma omp parallel for. To account for longer DNA sequences, we optimized the space complexity of all the three implementations where we kept only the current and the previous row of the score table.
Results and discussion
Data sets and specifications of the computer
Information of real DNA sequence data sets collected from NCBI [19]
#  Species types  Sequence A  Sequence B 

1  Virus  Potato spindle tuber viroid (360 bp)  Tomato apical stunt viroid (359 bp) 
2  Rottboellia yellow mottle virus (4194 bp)  Carrot mottle virus (4193 bp)  
3  Rehmannia mosaic virus (6395 bp)  Tobacco mosaic virus (6395 bp)  
4  Potato virus A (9588 bp)  Soybean mosaic virus N (9585 bp)  
5  Chicken megrivirus (9566 bp)  Chicken picornavirus 4 (9564 bp)  
6  Microbacterium phage VitulaEligans (17,534 bp)  Rhizoctonia cerealis alphaendornavirus 1 (17,486 bp)  
7  Lucheng Rn rat coronavirus (28,763 bp)  Helicobacter phage Pt1918U (28,760 bp)  
8  Lactococcus phage ASCC368 (32,276 bp)  Uncultured mediterranean phage uvMED (32,133 bp)  
9  Eukaryotes  Athene cunicularia (Chromosome 25, 1,505,370 bp)  Bombus terrestris (Chromosome LG B18, 3,078,061 bp) 
10  Athene cunicularia (Chromosome 25, 1,505,370 bp)  Bombus terrestris (Chromosome LG B01, 16,199,981 bp) 
All the experiments were run on University of Manitoba’s oncampus cluster computing system (Mercury machine). The cluster consists of four fully connected computing nodes with 2gigabit ethernet lines between every pair of nodes. Each node consists of two 14core Intel Xeon E52680 v4 2.40 GHz CPUs with 128 GB of RAM. Having a total of 28 cores inside, with the help of hyperthreading, each node is capable of running twice as many hardware threads (56 threads) at a time.
Comparison among different approaches
Comparison between the two versions of the algorithm in OpenMP approach
In the above experiments, we used version 2 (without branching) of the rowwise independent algorithm. In order to compare the execution times of the two versions (version 1 and version 2), we also developed the version 1. Figure 2c illustrates the execution times for the two versions with varying sequence sizes and 1 thread only where we can see that version 1 performs relatively better than version 2 of the algorithm. Although version 2 has removed branching conditions, it has added more computations which might be the reason for its relatively bad execution times. Furthermore, CPU architectures are much better at branch predictions than GPUs. Therefore, the second version of the rowwise independent parallel algorithm performed well on GPUs [16] but not on CPUs.
Limitations
Our study investigated parallelization of the rowwise independent version of the LCS algorithm only, as it provided ease in parallelization using the MPI, and OpenMP frameworks. As we found that the version of the rowwise independent algorithm with branching performs better than the other version, we will investigate this version in more detail in the future. We will also investigate other versions of the algorithm with the goal of finding better parallelization strategies.
Availability and requirements
 Project name:

LCS row parallel (CPU)
 Project home page:
 Operating systems:

Platform independent
 Programming language:

C
 Other requirements:

gcc 4.8.5 or later, OpenMPI version 1.10.7 or later, OpenMP version 3.1 or later
 License:

MIT License
 Any restrictions to use by nonacademics:

None.
Notes
Authors’ contributions
RS formulated the problem, developed the implementations and drafted the manuscript. PT and PH conceived the study design. PH directed the data collection and analysis procedure. PT, PH and PI interpreted the results and significantly revised the manuscript. All authors read and approved the final manuscript.
Acknowledgements
We would like to thank all the members of the Hu Lab for their valuable suggestions.
Competing interests
The authors declare that they have no competing interests.
Availability of data and materials
The source code, used data set, and documentation is available at https://github.com/RayhanShikder/lcs_parallel.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Funding
This work was supported in part by Natural Sciences and Engineering Research Council of Canada and the University of Manitoba, which provided with the research assistantship for Rayhan Shikder to perform the study.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
 1.Ullman JD, Aho AV, Hirschberg DS. Bounds on the complexity of the longest common subsequence problem. J ACM. 1976;23:1–12.CrossRefGoogle Scholar
 2.Wagner RA, Fischer MJ. The stringtostring correction problem. J ACM. 1974;21:168–73.CrossRefGoogle Scholar
 3.Maier D. The complexity of some problems on subsequences and supersequences. J ACM. 1978;25:322–36.CrossRefGoogle Scholar
 4.Garey MR, Johnson DS. Computers and intractability: A guide to the theory of npcompleteness (series of books in the mathematical sciences), ed. Comput Intractability. 1979. p. 340.Google Scholar
 5.Ossman M, Hussein LF. Fast longest common subsequences for bioinformatics dynamic programming. Population (Paris). 2012;5:7.Google Scholar
 6.Pearson WR, Lipman DJ. Improved tools for biological sequence comparison. Proc Natl Acad Sci. 1988;85:2444–8.CrossRefGoogle Scholar
 7.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–10.CrossRefGoogle Scholar
 8.Guo A, Siegelmann HT. Timewarped longest common subsequence algorithm for music retrieval. In: ISMIR. 2004.Google Scholar
 9.Petrakis EGM. Image representation, indexing and retrieval based on spatial relationships and properties of objects. Rethymno: University of Crete; 1993.Google Scholar
 10.Kruskal JB. An overview of sequence comparison: time warps, string edits, and macromolecules. SIAM Rev. 1983;25(2):201–37.CrossRefGoogle Scholar
 11.Ning K, Ng HK, Leong HW. Analysis of the relationships among longest common subsequences, shortest common supersequences and patterns and its application on pattern discovery in biological sequences. Int J Data Min Bioinform. 2011;5:611–25.CrossRefGoogle Scholar
 12.Hu J, Li B, Kihara D. Limitations and potentials of current motif discovery algorithms. Nucleic Acids Res. 2005;33:4899–913.CrossRefGoogle Scholar
 13.Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, et al. Big data: astronomical or genomical? PLoS Biol. 2015;13:e1002195.CrossRefGoogle Scholar
 14.Babu KN, Saxena S. Parallel algorithms for the longest common subsequence problem. In: HiPC. 1997. p. 120–5.Google Scholar
 15.Crochemore M, Iliopoulos CS, Pinzon YJ, Reid JF. A fast and practical bitvector algorithm for the longest common subsequence problem. Inf Process Lett. 2001;80:279–85.CrossRefGoogle Scholar
 16.Yang J, Xu Y, Shang Y. An efficient parallel algorithm for longest common subsequence problem on gpus. In: Proceedings of the world congress on engineering. 2010. p. 499–504.Google Scholar
 17.Li Z, Goyal A, Kimm H. Parallel Longest Common Sequence Algorithm on Multicore Systems Using OpenACC, OpenMP and OpenMPI. In: 2017 IEEE 11th international symposium on embedded multicore/manycore systemsonchip (MCSoC). 2017. p. 158–65.Google Scholar
 18.Random DNA Sequence Generator. http://www.faculty.ucr.edu/~mmaduro/random.htm. Accessed 2 Apr 2018.
 19.National Center for Biotechnology Information (NCBI). https://www.ncbi.nlm.nih.gov/. Accessed 20 Sept 2018.
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.