libFLASM: a software library for fixedlength approximate string matching
 1.4k Downloads
 1 Citations
Abstract
Background
Approximate string matching is the problem of finding all factors of a given text that are at a distance at most k from a given pattern. Fixedlength approximate string matching is the problem of finding all factors of a text of length n that are at a distance at most k from any factor of length ℓ of a pattern of length m. There exist bitvector techniques to solve the fixedlength approximate string matching problem in time \(\mathcal {O}(m\lceil \ell /w \rceil n)\) and space \(\mathcal {O}(m\lceil \ell /w\rceil)\) under the edit and Hamming distance models, where w is the size of the computer word; as such these techniques are independent of the distance threshold k or the alphabet size. Fixedlength approximate string matching is a generalisation of approximate string matching and, hence, has numerous direct applications in computational molecular biology and elsewhere.
Results
We present and make available libFLASM, a free opensource C++ software library for solving fixedlength approximate string matching under both the edit and the Hamming distance models. Moreover we describe how fixedlength approximate string matching is applied to solve real problems by incorporating libFLASM into established applications for multiple circular sequence alignment as well as single and structured motif extraction. Specifically, we describe how it can be used to improve the accuracy of multiple circular sequence alignment in terms of the inferred likelihoodbased phylogenies; and we also describe how it is used to efficiently find motifs in molecular sequences representing regulatory or functional regions. The comparison of the performance of the library to other algorithms show how it is competitive, especially with increasing distance thresholds.
Conclusions
Fixedlength approximate string matching is a generalisation of the classic approximate string matching problem. We present libFLASM, a free opensource C++ software library for solving fixedlength approximate string matching. The extensive experimental results presented here suggest that other applications could benefit from using libFLASM, and thus further maintenance and development of libFLASM is desirable.
Keywords
Approximate string matching Fixedlength approximate string matching Dynamic programming Software libraryAbbreviations
 ACSM
Approximate circular string matching
 ASM
Approximate string matching
 FLASM
Fixedlength approximate string matching
 MCSA
Multiple circular sequence alignment
 ME
Motif extraction
 MSA
Multiple sequence alignment
Background
Computational problem
Fixedlength approximate string matching (FLASM) is a generalisation of the classic approximate string matching (ASM) problem. While numerous algorithms exist to tackle the latter [1], few algorithms exist to tackle the former. Given two strings, a pattern x and text t, and a positive integer k, the ASM problem consists of finding all factors of t that are at a distance at most k from x with respect to a distance model. With FLASM, the problem instead focuses on identifying all factors of t that are at a distance at most k from any factor of some fixedlength ℓ≤m of x. For example, given a text t=ATGGCAAGT, a pattern x=AAGATG, and a factor length ℓ=3, the factors of length ℓ in x are AAG,AGA,GAT,ATG. Of these factors, only the first and last find exact matches in t. Given a positive integer k<ℓ, as the distance threshold for a match, there could be more matches. If k=1, the factor AGA would match AGT with one error. This is the FLASM problem.
In molecular biology, errors occur at a steady rate in genetic codes with the average singlenucleotide substitution mutation rate in humans estimated to be 1.20×10^{−8} per nucleotide per generation [2]. Apart from singlenucleotide substitution, there are nucleotide deletion and insertion mutations, but these occur much less frequently [2]. Algorithms that operate on genetic sequences need to work with datasets containing substitutions, insertions, and deletions. Although specialised error models exist for molecular biology [3], the edit [4] and Hamming distance [5] models are often used as reasonable accuracyspeed tradeoffs; these models have practical applications beyond the scope of biology.
The edit distance model uses substitutions, deletions, and insertions of letters to find the minimum number of operations to make two strings match. While each of these different operations can have a different edit cost, an edit cost of 1 is often used for all operations; and we simply call this cost the “edit distance”. The Hamming distance model is a simplified variant of edit distance which only allows substitutions and likewise, a substitution has a cost of 1.
The FLASM problem can be solved efficiently under the edit [6] or the Hamming [7] distance model; both algorithms make extensive use of bitvector techniques to compute the solution. These techniques take advantage of lowlevel processor operations on computer words. Fast bitvector operations can substitute regular operations, slashing the time complexity by a factor of w, the computer word size. Furthermore, bitvectors are used to encode alignment information succinctly, thus improving the space complexity of the algorithms by the same factor. Specifically, the problem can be solved in time \(\mathcal {O}(m\lceil \ell /w \rceil n)\) and space \(\mathcal {O}(m\lceil \ell /w\rceil)\) under both distance models. FLASM is wellsuited to solving numerous problems in computational molecular biology, including, for instance, multiple circular sequence alignment (MCSA) [8] and motif extraction [9, 10, 11]. FLASM is also very useful for solving problems outside the confines of computational biology, such as approximate circular string matching [12] as well as implementing the Chang and Marr index [13]. A brief description of these applications follows below.
Biological motivation
Circular sequences are found in various places in the domain of biology [14, 15]. The bulk of bacterial DNA is stored in a circular chromosome but bacteria have additional genes stored in plasmids, circular loops of doublestranded DNA. Bacteria may harbour a variety of plasmids, giving them distinctive properties such as the ability to produce toxins or resist the effects of antibiotics [16]. Viral genomes also happen to be circular and the ability to identify regions of interest helps us in the discovery of druggable sites. Circular DNA is also present in specialised organelles, chloroplasts and mitochondria, of eukaryotic plant and animal cells. Mitochondrial DNA (mtDNA), can be found inside mitochondria [17] and is commonly used in phylogenetic reconstruction and research into ancestry and evolution. Additionally, some proteins are noted to share homologous domains but do not align because of swaposins, domains that are relocatable and have the property of circular permutability [18, 19]. Additionally, some proteins bind on their N and C termini in order to form a circular chain [15]. The wide presence of the circular structures in biology attests the importance of analysing circular sequences and finding algorithms suitable for its study [20].
Circular sequences have no point of reference by which they are sequenced or aligned to one another and treating them as linear sequences leads to poor alignments. By identifying the correct rotation for a pair of circular sequences, sequence alignment can be carried out to produce more reliable results. This is evident when analysing the linearised human (NC_001807) and chimpanzee (NC_001643) mtDNA sequences which start at different biological regions. Without refining the sequences, the pairwise sequence alignment of the mtDNA using EMBOSS Needle [21] gives a similarity score of 85.1 % with 1,195 gaps. Aligning different rotations of the same sequences yields a similarity of 91 % with only 77 gaps [8]. MCSA involves aligning three or more circular sequences simultaneously, which is a common task in computational molecular biology. As similar to the standard setting, this alignment can be used to find patterns within protein sequences and specifically, identify homology between new and existing groups of related sequences [22]. Just as importantly, it helps in identifying novel regions or mutations that give a species or breed its distinctive properties or highlights the cause of disease. A few tools exist to tackle the MCSA problem [8, 23, 24].
Motif extraction (ME), or motif discovery, involves detecting overrepresented DNA motifs as well as conserved DNA motifs in a set of orthologous DNA sequences. Such conserved motifs may serve as potential candidates for transcription factor binding sites for a regulatory protein [25]. The pattern, which is usually fairly short, 5 to 20 baselong, can be located in different genes or several times within the same gene. ME, however, may also be relevant for extracting longer regions within DNA sequences. A study in [26] shows that there exist 481 regions longer than 200 bases that are absolutely conserved in the genomes of the human, rat, and mouse. This fact suggests the possibility of the existence of long motifs in the presence of substitutions, insertions or deletions, underscoring the necessity of ME for larger lengths. Many tools exist to tackle the ME problem for single motifs [9, 27, 28, 29].
In addition to this simple form of single motifs, structured motifs are another special type of DNA motifs. A structured DNA motif is made up of two (or even more) smaller conserved sites with a spacer (gap) located between these sites. The spacer is found in the middle of the motif due to the transcription factors binding as a dimer. This means that the transcription factor is made up of two subunits, having two separate contact points with the DNA sequence. A nonconserved spacer of mostly fixed or slightly variable length separates these contact points. Such conserved structured motifs may serve as potential candidates for transcription factor binding sites for a composite regulatory protein [30]. A few tools exist to tackle the ME problem for structured motifs [11, 31, 32].
ME can also be used on immunoglobulins and T cell receptors, which aid in the immunity of humans and other vertebrates. Immunoglobulins, or antibodies, have the main function of eliminating foreign objects and pathogens such as bacteria, by attaching to them and neutralizing their effects on the body [33]. Obtaining the structural information of an antigen is essential for monoclonal antibody engineering, which has proven successful for the treatment of diseases such as Rheumatoid Arthritis [34]. These antibodies are created by introducing synthetic DNA into specific regions of the antibody, proving that ME aids with the discovery of common regions in multiple antigens.
It is wellknown that ASM has numerous applications in computational molecular biology [1]. Averagecase optimal algorithms for ASM [35] are based on a preprocessing step, namely, implementing the Chang and Marr index [13]. This step involves constructing an array D[0..σ ^{ q }−1] of integers, such that for every string s of length q over an alphabet of size σ, s is searched for in pattern x, and D[s] contains the smallest distance between s and any match found in x. Storing D requires space \(\mathcal {O}(\sigma ^{q})\) and computing it naïvely takes time \(\mathcal {O}(\sigma ^{q} q m)\) for the edit and Hamming distance models.
Approximate circular string matching (ACSM) is the problem of finding all factors of a given text that are at a distance k from any of the m rotations of a given pattern of length m. ACSM has various applications such as finding permutations in proteins [36], but it has other uses outside of biology and has been applied extensively in pattern recognition (see [37], for instance). Many averagecase algorithms exist to tackle the ACSM problem efficiently for low error ratios k/m [12, 35, 38, 39].
Our contributions
In this article, we present and make available libFLASM, a free opensource C++ software library for solving FLASM under both the edit and the Hamming distance models. For the Hamming distance model, we use our own implementation (a preliminary version appeared in [40]) of the MaxShiftM algorithm [7], and for the edit distance model we generalise the SeqAn [41] implementation of Myers’ bitvector algorithm [6] for approximate string matching. Both of these implementations are based on dynamicprogramming approaches accelerated by the use of bitvector techniques to achieve their goals. Also, both implementations are able to handle arbitrary factor lengths; in particular, factor lengths greater than the computer word size. Moreover we make the following contributions by providing extensive experimental results to support our claims.

a) We incorporate libFLASM into a stateoftheart tool to improve the accuracy of MCSA in terms of the inferred likelihoodbased phylogenies;

b) We incorporate libFLASM into a stateoftheart tool for ME of patterns longer than what was previously possible;

c) We show how libFLASM can be used efficiently for implementing the Chang and Marr index;

d) We show how libFLASM can be used efficiently for ACSM with high error ratios.
Technical background
We provide some definitions following Crochemore et al. [42]. We define a string x of length m as an array x[0..m−1] where every x[i], 0≤i<m is a letter taken from an alphabet Σ of size Σ=σ. String ε denotes the empty string which has length 0. Given a string y, we call string x a factor of y if there exist two strings u and v, such that y=uxv. Consider the strings x,y,u, and v, such that y=uxv. We call x a prefix of y if u=ε; we call x a suffix of y if v=ε. When x is a factor of y, we say that x occurs in y. Each occurrence of x can be denoted by a position in y. Specifically, we say that x occurs at the starting position i in y when y[i..i+m−1]=x. It is often relevant to note the ending position i+m−1.
A circular string of length m may be informally defined as a standard linear string where the firstoccurring and lastoccurring letters are wrapped around and positioned next to each other. Considering this definition, the same circular string can be regarded as a set of m linear strings, which are all considered to be equivalent. Similarly, given a string x of length m, we denote by x ^{ i }=x[i..m−1]x[0..i−1], 0<i<m, the ith rotation of x and x ^{0}=x. For instance, by looking at the string x=x ^{0}=baababac, we obtain the following rotations of x: x ^{1}=aababacb, x ^{2}=ababacba, x ^{3}=babacbaa, etc.
A qgram is defined as any string of length q over alphabet Σ. Given an alphabet Σ of size σ and a value q, a debruijn sequence B(q,σ) is a circular string of length σ ^{ q } such that every possible qgram appears as a sequence of consecutive letters exactly once [43]. Several algorithms exist for this construction requiring time linear with respect to the length of the output sequence; for instance, see [44].
Definition 1.

Insertion: insert a letter in y, not present in x; (ε,b), b≠ε

Deletion: delete a letter in y, present in x; (a,ε), a≠ε

Substitution: replace a letter in y with a letter in x; (a,b), a≠b,and a,b≠ε.
Here we only consider edit distance with unitcost operations. Given an integer k>0, if δ _{ E }(x,y)≤k we say that x and y have at most k differences.
Definition 2.
Given a string x of length m and a string y of length m, the Hamming distance, denoted by δ _{ H }(x,y), is defined as the minimum number of substitution operations required to transform one string into the other.
Given an integer k>0, if δ _{ H }(x,y)≤k we say that x and y have at most k mismatches.
We provide some further definitions following Carvalho et al. [31]. We define a single motif as a string of letters on an alphabet Σ. We are given an integer value k denoting a distance threshold (error threshold). We then say that a motif on Σ koccurs in another string s on Σ, if there is a distance (edit or Hamming, for instance) of at most k between the motif and a factor of s. A set s _{1},…,s _{ N } of strings on Σ, where N≥2, the quorum 1≤q≤N, the error threshold k, and the length ℓ for the motifs are taken as input for the single motif extraction problem. Specifically, it involves identifying all motifs of length ℓ, with each motif koccurring in at least q input strings. In this case, such single motifs are called valid.
 1.
\(m^{\prime }_{1}, \ldots, m'_{\beta }\) occur in s and
 2.
the distance between the ending position of \(m^{\prime }_{i}\) and the starting position of \(m^{\prime }_{i+1}\) in s is in interval \([d_{\min _{i}}, d_{\max _{i}}]\), for all 1≤i<β.
We are now in a position to define the FLASM problem.
Definition 3.
Given a pattern x of length m, a text t of length n≥m, an integer ℓ≤m, and an integer k<ℓ, the FLASM problem under the Hamming distance model finds all factors u of t such that δ _{ H }(u,v)≤k, where v is any factor of length ℓ of x.
We extend the above definition to the edit distance model.
Definition 4.
Given a pattern x of length m, a text t of length n≥m, an integer ℓ≤m, and an integer k<ℓ, the FLASM problem under the edit distance model finds all factors u of t such that δ _{ E }(u,v)≤k, where v is any factor of length ℓ of x.
Theorem 1 ([7]).
The FLASM problem under the Hamming distance model can be solved in time \(\mathcal {O}(m\lceil \frac {\ell }{w}\rceil n)\) and space \(\mathcal {O}(m\lceil \frac {\ell }{w}\rceil)\), where w is the size of the computer word.
Given a pattern x of length m, a text t of length n≥m, and an integer k<m, Myers bitvector algorithm solves the ASM problem under the edit distance model in time \(\mathcal {O}(n\lceil \frac {m}{w}\rceil)\) and space \(\mathcal {O}(\lceil \frac {m}{w}\rceil)\) [6]. Applying this algorithm for all \(\mathcal {O}(m)\) factors of length ℓ of x separately solves the FLASM problem under the edit distance model.
Theorem 2.
The FLASM problem under the edit distance model can be solved in time \(\mathcal {O}(m\lceil \frac {\ell }{w}\rceil n)\) and space \(\mathcal {O}(m\lceil \frac {\ell }{w}\rceil)\), where w is the size of the computer word.
Applications
In this section we explain how Theorems 1 and 2 can be applied to solve problems in computational molecular biology and elsewhere.
Application I: multiple circular sequence alignment
MCSA is a generalisation of the multiple sequence alignment (MSA) problem. Informally, in MSA, the goal is to compare and visualise a set of N input sequences s _{1},…,s _{ N } such that comparing their bases with each other reduces the total cost of this alignment through a minimal application of substitutions, insertions, and deletions. As obtaining an optimal MSA is computationally expensive (using dynamicprogramming techniques) [45], an alternative approach is to use heuristic techniques [22, 46]. These techniques make an implicit assumption: the left and rightmost position of each sequence is relevant; however this does not necessarily apply to circular sequences. Circular genomes are split and sequenced at a possibly random position. So when it comes to comparing them, if they do not start with the same biological region, it is quite possible to obtain very suboptimal MSAs. To resolve this and reduce the total cost of the produced MSA, the sequences can be refined to different rotations before going through the MSA algorithm. Theorems 1 and 2 can be applied to find most similar factors among pairs of sequences (s _{ i },s _{ j }), which can then determine suitable rotations for all input sequences via agglomerative hierarchical clustering (see [8], for details).
Application II: motif extraction
For single ME, a factor of length ℓ can be considered as a single motif, extracted once or several times from a subset of N input sequences s _{1},…,s _{ N }. The extraction of motifs under the Hamming and edit distance models can be performed by applying Theorems 1 and 2, respectively, for all pairs of input sequences. Specifically, it suffices to perform all pairwise sequence comparisons (s _{ i },s _{ j }) and store the number of occurrences for each factor of length ℓ of every sequence. We can thus determine all single motifs of length ℓ, such that each motif koccurs in at least q of the input sequences. The assumption of this approach is that all reported motifs are strictly valid: they occur at least once in some sequence with no error (see [9], for details).
Structured ME is a generalisation of single ME. A structured motif is made up of single motifs with distance intervals of varying size between them. Theorems 1 and 2 can be applied to find occurrences of individual motifs m _{1},m _{2},…,m _{ β }, where β is the number of single motifs within a structured motif; and then merge these occurrences efficiently to form structured motifs. The same validity assumption holds for structured motifs (see [11], for details).
Application III: Chang and Marr index
The Chang and Marr index aims to find the minimal distance between every qgram over Σ and any factor of pattern x. We propose here the following implementation. Given a pattern x of length m consisting of σ distinct letters and a value for q, the debruijn sequence B(q,σ) is constructed using the algorithm of [44]. We also initialise an integer array D: a numerical representation of the σ ^{ q } qgrams permits constant time access to D. The minimal distance for each qgram represented in D can be computed using the Hamming or edit distance between a factor of length q from the debruijn sequence and any factor of x. A linearised version of the debruijn sequence and pattern x can be used as the input pattern and text, respectively, for FLASM; we also need to set ℓ:=q. Hence the computation can be done in time \(\mathcal {O}(\sigma ^{q} \lceil q / w \rceil m)\), instead of \(\mathcal {O}(\sigma ^{q} q m)\), by applying Theorems 1 and 2.
Application IV: approximate circular string matching
ACSM is the problem of finding all factors of a text t that are at a distance k from any of the rotations of a pattern x. We can consider all rotations of x of length m by creating string x ^{′}=x[0..m−1]x[0..m−2] and sliding a window of length m across x ^{′}. Consider the pattern x=AAGATG; we obtain rotations: AAGATG,AGATGA,GATGAA,ATGAAG,TGAAGA,GAAGAT. The ACSM problem consists in searching for these rotations in t. Theorems 1 and 2 can be directly applied by using x ^{′} as the input pattern and setting ℓ:=m. This means all factors of length ℓ=m of x ^{′}, hence all rotations of x, will be used to search in t.
Implementation
libFLASM was implemented and packaged as a dynamic library in the C++ programming language and was compiled with gcc v.4.7.3 using optimisation flags O3, funrollloops, and msse4.2. The implementation is distributed under the GNU General Public License (GPL), and it is available freely at repository https://github.com/webmasterar/libFLASM.

t The text to search

n The length of the text

x The pattern text

m The length of the pattern

ℓ The length of the factor

k The maximum distance allowed between a factor of x and a factor of t

r A flag to indicate if all or the best matches should be returned

j is the ending position of the match in t

i is the ending position of the match in x

e is the distance of the match
The edit distance function was trivially implemented using SeqAn [41], a free opensource C++ library of algorithms for sequence analysis. Specifically, we used the implementation of Myers’ bitvector algorithm [6] to perform approximate string matching under the edit distance model. We adapted it to search for all factors of x of length ℓ in t in order to solve the FLASM problem. For the Hamming distance function, we used our own implementation [40] of the MaxShiftM algorithm [7], which solves the FLASM problem under the Hamming distance model. Notice that both functions provided by the library work for an arbitrary length ℓ using multiple computer words, thus delivering the \(\mathcal {O}(m \lceil \frac {\ell }{w}\rceil n)\)time and \(\mathcal {O}(m\lceil \frac {\ell }{w}\rceil)\)space complexity of the proposed algorithms.
libFLASM may easily be incorporated into any computational pipeline; some examples are given below.
Incorporation of libFLASM into BEAR
libFLASM was incorporated into BEAR (BEst Aligned Rotations) [8], a stateoftheart tool for improving MCSA. BEAR uses the library to find most similar factors, under a prespecified distance model, between pairs of the input sequences, which can then determine suitable rotations for all input sequences. The output of BEAR can then be used as input for any MSA program.
BEAR takes a MultiFASTA file containing a set of N sequences s _{1},s _{2},…,s _{ N } as input, performs all pairwise sequence comparisons, and stores the results in a matrix M of size N×N. Each cell M[i,j] stores information about the rotation and distance of every pairwise sequence comparison (s _{ i },s _{ j }). This information is then used as input for standard agglomerative hierarchical clustering [47] in order to group closelyrelated sequences together and apply the rotations to output a refined dataset of the N sequences. This refined dataset can in turn be passed to an MSA program such as MUSCLE [46] or Clustal Ω [48] to produce the final MSA.
BEAR provides a selection of algorithms to do the pairwise sequence comparison. One of these approaches is based on FLASM. This was previously restricted to using only factors of length ℓ≤w, but with the incorporation of libFLASM, it is now possible to use arbitrary values for ℓ under the edit or the Hamming distance model.
Incorporation of libFLASM into MoTeXII
libFLASM was incorporated into MoTeXII [11] (the successor of MoTeX [9]), a stateoftheart tool to identify single and structured motifs. MoTeXII uses the library to find occurrences, under a prespecified distance model, of each factor of length ℓ of every sequence in the input sequences.
MoTeXII takes a MultiFASTA file containing a set of N sequences s _{1},s _{2},…,s _{ N } as input. For single ME, it performs all pairwise sequence comparisons and stores the number of occurrences for each factor of length ℓ of every sequence. Hence it can determine all valid motifs of length ℓ. This was previously restricted to finding motifs of length ℓ≤w, but with the incorporation of libFLASM, it is now possible to find motifs of any length ℓ. A similar approach is followed for structured ME.
Using libFLASM to implement the Chang and Marr index
The libFLASM library repository contains an example showing the use of libFLASM to implement the Chang and Marr index, that is represented by an integer array D of size σ ^{ q }. Given a pattern x and a value q, the index can be implemented under the Hamming or edit distance model. The minimal distance between each qgram in the linearised version of the debruijn sequence and any factor of x is computed using libFLASM by setting the factor length to ℓ:=q. A list of tuples of the form <i,e> can thus be created, where i is the ending position of a qgram s in the debruijn sequence and e is the minimal distance found between s and any factor in x. By using a numerical representation of s, we can easily construct array D.
Using libFLASM for performing approximate circular string matching
The libFLASM library repository comes with another example program that can be used to perform FLASM under a prespecified distance model. There was no need to modify this program to make it suitable for ACSM. All that is needed is to doubleup the pattern x of length m, by concatenating it with itself to create string x ^{′}=x[0..m−1]x[0..m−2], and then set the factor length to ℓ:=m.
Results and discussion
The experiments were run on a 64bit GNU/Linux operating system on a machine with a quadcore 64bit Intel 2.8GHz CoreI7 processor with 8GB of RAM.
Experiment I: performance
The results clearly show the linear growth of the time required to complete the computation with respect to the factor length ℓ in accordance with the time complexity of \(\mathcal {O}(m\lceil \frac {\ell }{w}\rceil n)\). Note also that the first three points, which use factors of length 32, 48, and 64 are represented on the graph as a straight horizontal line, indicating that there is no increase in the time required to compute them. This is trivially explained by the \(\lceil \frac {\ell }{w}\rceil \) factor in the time complexity.
Experiment II: approximate circular string matching
We then evaluated the performance of libFLASM against stateoftheart algorithms for solving the ACSM problem. Two fast averagecase algorithms, both of which support the edit and Hamming distance models, were identified from the literature: CMFN [35] and ACB [38]. The corresponding implementation of the algorithms were obtained via communication with the authors.
ACB is an algorithm designed with the purpose of solving the ACSM problem. CMFN is an algorithm designed to solve the general problem of multiple approximate string searching—that is, searching for a set of N patterns x _{1},x _{2},…,x _{ N } simultaneously in the text. We thus adapted CMFN to search for all rotations of x in order to compare the speed of the three programs. This was done by considering all rotations of x as the set of input patterns. Unfortunately, the implementation of ACB is restricted to searching patterns of length less than or equal to the computer word size (m≤w).
We ran the three programs with the following settings. ACB was run with the option k to set the distance threshold and the list of patterns were read in on the console from a file. CMFN was run with the options: D B Sb t 6 k k. This uses a qgram length of 6, which is noted for its suitability in [35]. Hamming distance was enabled using the option s. To test libFLASM, we used the example application that is packaged with the library. The distance model, the factor length ℓ, and the distance threshold k options were set accordingly.
The results of this experiment show that ACB and CMFN are fast for small values of k, but once k is increased we find that libFLASM becomes competitive and goes on to perform significantly better. This is explained by the fact that the time complexity of ACB and CMFN depends on k, while libFLASM is independent of k. This is clearly captured by the graphs in Fig. 2. Apart from its speed and robustness, another benefit of using libFLASM, not shown here, is alphabet independence.
Experiment III: multiple circular sequence alignment
Nine synthetic datasets were generated using INDELible [49] in order to test the accuracy of BEAR with libFLASM for improving MCSA. INDELible is a program which generates DNA or protein sequences with substitutions, insertions, and deletions at rates defined by the user. For input datasets we used files containing 12, 25, or 50 DNA sequences (number denoted by α) of length approximately 2,500 (length denoted by γ). Three unique substitution rates (denoted by θ) were applied, per dataset, using the substitution model JC69 [50], at 5 %, 20 %, and 35 %. The insertion and deletion rates were set, respectively, to 4 % and 6 % (denoted by κ and ω), relative to a substitution rate of 1. We will call these datasets the Original datasets.
We then proceeded to randomly rotate each of the sequences in the datasets to create a new set of files. We call these the Random datasets. The goal of this experiment was to use BEAR with libFLASM under the edit distance model to refine the random rotation of each of the sequences in the Random datasets. The refined datasets we would obtain after rotating the sequences are called the Restored datasets. We ran BEAR using the FLASM method for pairwise sequence comparisons under the edit distance model. We used two combinations of factor length ℓ and distance threshold k to run the experiments: ℓ=40, k=10; and ℓ=100, k=45.
We then used MUSCLE [46], a fast and accurate MSA program, to produce the alignments in PHYLIP format for each Original dataset. The alignments were then passed to RAxML [51], a program for inferring a phylogenetic tree, under the maximum likelihood criterion, for a given alignment. RAxML was used again to compare the trees against each other via calculating the pairwise Robinson and Foulds (RF) distance [52]. In particular we calculated the RF distance between the Original trees and the Random trees, as well as the distance between the Original trees and the Restored ones, to measure how well the programs had performed in refining the sequences in each of the datasets.
RF distance between the Original and Random datasets as well as the RF distance between the Original and Restored datasets using Cyclope and BEAR
Dataset <α,γ,θ,κ,ω>  Random  Cyclope  BEAR  BEAR 

ℓ=40  ℓ=100  
<12,2500,0.05,0.06,0.04>  0.000  0.000  0.000  0.000 
<12,2500,0.20,0.06,0.04>  0.000  0.000  0.000  0.000 
<12,2500,0.35,0.06,0.04>  0.000  0.000  0.000  0.000 
<25,2500,0.05,0.06,0.04>  0.000  0.000  0.000  0.000 
<25,2500,0.20,0.06,0.04>  0.000  0.000  0.000  0.000 
<25,2500,0.35,0.06,0.04>  0.045  0.045  0.000  0.000 
<50,2500,0.05,0.06,0.04>  0.085  0.000  0.021  0.000 
<50,2500,0.20,0.06,0.04>  0.043  0.000  0.000  0.000 
<50,2500,0.35,0.06,0.04>  0.043  0.000  0.021  0.000 
Elapsedtime comparison in seconds of Cyclope and BEAR
Dataset <α,γ,θ,κ,ω>  Cyclope  BEAR ℓ=40  BEAR ℓ = 100 

<12,2500,0.05,0.06,0.04>  79.09  15.92  46.53 
<12,2500,0.20,0.06,0.04>  77.47  15.06  44.52 
<12,2500,0.35,0.06,0.04>  76.76  14.85  45.44 
<25,2500,0.05,0.06,0.04>  332.69  69.81  203.78 
<25,2500,0.20,0.06,0.04>  342.94  69.28  208.85 
<25,2500,0.35,0.06,0.04>  344.50  71.14  208.82 
<50,2500,0.05,0.06,0.04>  1,317.81  293.45  851.07 
<50,2500,0.20,0.06,0.04>  1,303.51  300.37  837.66 
<50,2500,0.35,0.06,0.04>  1,359.90  286.88  854.83 
Experiment IV: motif extraction
We carried out experiments on real data retrieved from the Kyoto Encyclopedia of Genes and Genomes (KEGG) Database [53]. Three sets of data were obtained made up of: RNA polymerase proteins; viruses; and hypothetical proteins. Single ME was carried out on these datasets. All datasets contain 11 sequences: the first dataset is made up of RNA Polymerase sequences, with all sequences containing 4 distinct matching motifs of varying length; the second dataset is made up of virus sequences, all containing 4 matching motifs; the third dataset is made up of hypothetical proteins, containing 2 distinct matching motifs.
Single motif extraction from real datasets
Dataset  Parameters  Motif  Quorum (%) 

RNA  <350,110>  RNA polymerase  100 
Rpb2, domain 6  
Polymerase  <60,28>  RNA polymerase  100 
Rpb2, domain 4  
<40,12>  RNA polymerase  100  
Rpb2, domain 5  
<90,30>  RNA polymerase  100  
Rpb2, domain 7  
Viruses  <350,150>  Viral methyltransferase  100 
<130,50>  Cucumber mosaic  100  
virus 1a protein  
<70,48>  Cucumber mosaic virus 1a  100  
protein C terminal  
<250,130>  Viral (Superfamily 1) RNA  100  
helicase  
Hypothetical  <130,45>  Type III restriction enzyme,  100 
res subunit  
Proteins  <60,30>  Helicase conserved  100 
Cterminal domain 
Structured motif extraction from synthetic datasets
Parameters  Implanted  Implanted 

structured  structured  
motifs  motifs  
extracted  
<(80,15)[5,15](60,10)[5,20](230,20)>  25  25 
<(100,15)[5,15](80,10)[5,20](250,20)>  25  25 
<(120,15)[5,15](100,10)[5,20](270,20)>  25  25 
<(140,15)[5,15](120,10)[5,20](290,20)>  25  25 
<(160,15)[5,15](140,10)[5,20](310,20)>  25  25 
<(180,15)[5,15](160,10)[5,20](330,20)>  25  25 
<(200,15)[5,15](180,10)[5,20](350,20)>  25  25 
Experiment V: Chang and Marr index
Elapsedtime comparison in seconds for implementing the Chang and Marr index using a pattern of length 32
Edit distance  Hamming distance  

qgram length  Naïve (s)  libFLASM (s)  Naïve (s)  libFLASM (s) 
5  0.01  0.00  0.01  0.00 
6  0.08  0.02  0.08  0.01 
7  0.67  0.9  0.55  0.05 
8  6.20  0.50  4.81  0.25 
9  34.00  2.74  23.99  1.43 
10  145.56  11.76  96.71  6.24 
Elapsedtime comparison in seconds for implementing the Chang and Marr index using a pattern of length 64
Edit distance  Hamming distance  

qgram length  Naïve (s)  libFLASM (s)  Naïve (s)  libFLASM (s) 
5  0.04  0.01  0.04  0.00 
6  0.23  0.03  0.22  0.02 
7  1.45  0.15  1.31  0.09 
8  10.76  0.82  9.27  0.46 
9  95.01  5.29  76.21  2.76 
10  673.17  24.51  520.12  12.51 
The time taken to compute the Chang and Marr index using a naïve implementation for the Hamming and edit distance can be seen under the columns titled Naïve. That which is computed using the libFLASM library can be seen under the columns titled libFLASM. It is evident that using libFLASM significantly decreases the time taken to compute the Chang and Marr index. This can clearly be seen as the value of q increases, when we obtain an implementation which is faster by more than an order of magnitude.
Conclusions
FLASM is a generalisation of the classic ASM problem and, hence, has numerous direct applications in computational molecular biology and elsewhere. In this article, we presented libFLASM, a free opensource C++ library aimed at solving the FLASM problem under the edit and Hamming distance models. Specifically, given a pattern x of length m, a text t of length n≥m, an integer ℓ≤m, and an integer k<ℓ, it finds all factors of t that are at a distance at most k from any factor of length ℓ of x. The main advantage of libFLASM is that it implements stateoftheart algorithms to achieve a time complexity of \(\mathcal {O}\left (m\lceil \frac {\ell }{w}\rceil n\right)\), and a space complexity of \(\mathcal {O}\left (m\lceil \frac {\ell }{w}\rceil \right)\), which are independent of the distance threshold k and the alphabet size.
libFLASM is freely distributed and can be incorporated easily into any computational pipeline. As proof of concept, we incorporated libFLASM into BEAR, a stateoftheart tool to improve the accuracy of MCSA in terms of the inferred likelihoodbased phylogenies. Furthermore, we incorporated libFLASM into MoTeXII, a stateoftheart tool for ME of patterns longer than was previously possible. Finally, we showed how libFLASM can be used efficiently for ACSM with high error ratios as well as implementing the Chang and Marr index. The extensive experimental results presented here suggest that other applications could benefit from using libFLASM, and thus further maintenance and development of libFLASM is desirable.
Availability and requirements

Project name: libFLASM

Project home page: https://github.com/webmasterar/libFLASM

Operating system: GNU/Linux

Programming language: C++

Other requirements: N/A

License: GNU GPL
Notes
Acknowledgments
We would like to acknowledge the EPSRC and the Department of Informatics at King’s College London for their financial support.
Funding
LAKA is supported by an EPSRC grant (Doctoral Training Grant #EP/M50788X/1). AR is supported by the Graduate Teaching Scholarship scheme of the Department of Informatics at King’s College London.
Availability of data and materials
The datasets generated during and/or analysed during the current study are available at the GitHub repository, https://github.com/webmasterar/libFLASM.
Authors’ contributions
SPP conceived and designed the study. LAKA and AR developed the library and conducted the experiments. LAKA and AR wrote the manuscript with the contribution of SPP. The final version of the manuscript is approved by all authors.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
References
 1.Navarro G. A guided tour to approximate string matching. ACM Comput Surv. 2001; 33(1):31–88. doi: 10.1145/375360.375365.CrossRefGoogle Scholar
 2.Kong A, Frigge ML, Masson G, Besenbacher S, Sulem P, Magnusson G, Gudjonsson SA, Sigurdsson A, Jonasdottir A, Jonasdottir A, Wong WS, Sigurdsson G, Walters GB, Steinberg S, Helgason H, Thorleifsson G, Gudbjartsson DF, Helgason A, Magnusson OTT, Thorsteinsdottir U, Stefansson K. Rate of de novo mutations and the importance of father’s age to disease risk. Nature. 2012; 488(7412):471–5. doi: 10.1038/nature11396.CrossRefPubMedPubMedCentralGoogle Scholar
 3.Gotoh O. An improved algorithm for matching biological sequences. J Mol Biol. 1982; 162(3):705–8. doi: 10.1016/00222836(82)903989.CrossRefPubMedGoogle Scholar
 4.Damerau FJ. A technique for computer detection and correction of spelling errors. Commun ACM. 1964; 7:171–6. doi: 10.1145/363958.363994.CrossRefGoogle Scholar
 5.Hamming RW. Error detecting and error correcting codes. Bell Syst Tech J. 1950; 29:147–60. doi: 10.1002/j.15387305.1950.tb00463.x.CrossRefGoogle Scholar
 6.Myers G. A fast bitvector algorithm for approximate string matching based on dynamic programming. J ACM (JACM). 1999; 46:395–415. doi: 10.1145/316542.316550.CrossRefGoogle Scholar
 7.Crochemore M, Iliopoulos CS, Pissis SP. A parallel algorithm for fixedlength approximate stringmatching with kmismatches In: Elomaa T, Mannila H, Orponen P, editors. Algorithms and Applications: Essays Dedicated to Esko Ukkonen on the Occasion of His 60th Birthday. Berlin, Heidelberg: Springer: 2010. p. 92–101, doi: 10.1007/9783642124761_6.Google Scholar
 8.Barton C, Iliopoulos CS, Kundu R, Pissis SP, Retha A, Vayani F. Accurate and Efficient Methods to Improve Multiple Circular Sequence Alignment In: Bampis E, editor. Experimental Algorithms: 14th International Symposium, SEA 2015, Paris, France, June 29 – July 1, 2015, Proceedings. Cham: Springer: 2015. p. 247–58, doi: 10.1007/9783319200866_19.Google Scholar
 9.Pissis SP, Stamatakis A, Pavlidis P. Motex: A wordbased HPC tool for motif extraction In: Gao J, editor. ACM Conference on Bioinformatics, Computational Biology and Biomedical Informatics. ACMBCB 2013. Washington, DC, USA: ACM: 2013. p. 13, doi: 10.1145/2506583.2506587.Google Scholar
 10.Pissis SP, Goll C, Pavlidis P, Stamatakis A. Accelerating string matching on MIC architecture for motif extraction In: Wyrzykowski R, Dongarra J, Karczewski K, Wasniewski J, editors. Parallel Processing and Applied Mathematics  10th International Conference, PPAM 2013. Lecture Notes in Computer Science. Warsaw, Poland: Springer: 2013. p. 258–67, doi: 10.1007/9783642551956_24.Google Scholar
 11.Pissis SP. MoTeXII: structured motif extraction from largescale datasets. BMC Bioinformatics. 2014; 15(1):1–12. doi: 10.1186/1471210515235.CrossRefGoogle Scholar
 12.Barton C, Iliopoulos CS, Pissis SP. Fast algorithms for approximate circular string matching. Algorithms Mol Biol. 2014; 9:9. doi: 10.1186/1748718899.CrossRefPubMedPubMedCentralGoogle Scholar
 13.Chang WI, Marr TG. Approximate string matching and local similarity. In: Proceedings of the 5th Annual Symposium on Combinatorial Pattern Matching. CPM ’94. London, UK: Springer: 1994. p. 259–73.Google Scholar
 14.Helinski DR, Clewell DB. Circular DNA. Annu Rev Biochem. 1971; 40(1):899–942. doi: 10.1146/annurev.bi.40.070171.004343.CrossRefPubMedGoogle Scholar
 15.Craik DJ, Allewell NM. Thematic minireview series on circular proteins. J Biol Chem. 2012; 287:26999–7000. doi: 10.1074/jbc.R112.390344.CrossRefPubMedPubMedCentralGoogle Scholar
 16.Davison J. Genetic exchange between bacteria in the environment. Plasmid. 1999; 42(2):73–91. doi: 10.1006/plas.1999.1421.CrossRefPubMedGoogle Scholar
 17.Wang Z, Wu M. Phylogenomic reconstruction indicates mitochondrial ancestor was an energy parasite. PLoS ONE. 2014; 9(10):1–11. doi: 10.1371/journal.pone.0110685.Google Scholar
 18.Ponting CP, Russell RB. Swaposins: circular permutations within genes encoding saposin homologues. Trends Biochem Sci. 1995; 20(5):179–80. doi: 10.1016/S09680004(00)890039.CrossRefPubMedGoogle Scholar
 19.Lo WC, Lee CC, Lee CY, Lyu PC. CPDB: a database of circular permutation in proteins. Nucleic Acids Res. 2009; 37:328–32. doi: 10.1093/nar/gkn679.CrossRefGoogle Scholar
 20.Grossi R, Iliopoulos CS, Mercas R, Pisanti N, Pissis SP, Retha A, Vayani F. Circular sequence comparison: algorithms and applications. Algorithms Mol Biol. 2016; 11:12. doi: 10.1186/s1301501600766.CrossRefPubMedPubMedCentralGoogle Scholar
 21.Rice P, Longden I, Bleasby A. EMBOSS: the European molecular biology open software suite. Trends Genet TIG. 2000; 16(6):276–7. doi: 10.1016/s01689525(00)020242.CrossRefPubMedGoogle Scholar
 22.Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positionspecific gap penalties and weight matrix choice. Nucleic Acids Res. 1994; 22:4673–680. doi: 10.1093/nar/22.22.4673.CrossRefPubMedPubMedCentralGoogle Scholar
 23.Mosig A, Hofacker IL, Stadler PF. Comparative analysis of cyclic sequences: Viroids and other small circular RNAs In: Giegerich R, Stoye J, editors. Lecture Notes in Informatics: 2006. p. 93–102. doi:10.1.1.87.406. Proceedings GCB.
 24.Fernandes F, Pereira L, Freitas AT. CSA: an efficient algorithm to improve circular DNA multiple alignment. BMC Bioinformatics. 2009;10. doi: 10.1186/1471210510230.
 25.van Helden J, André B, ColladoVides J. Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J Mol Biol. 1998; 281(5):827–42. doi: 10.1006/jmbi.1998.1947.CrossRefPubMedGoogle Scholar
 26.Bejerano G, Pheasant M, Makunin I, Stephen S, Kent WJ, Mattick JS, Haussler D. Ultraconserved elements in the human genome. Science. 2004; 304(5675):1321–5. doi: 10.1126/science.1098119.CrossRefPubMedGoogle Scholar
 27.Sinha S, Tompa M. YMF: A program for discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res. 2003; 31(13):3586–8.CrossRefPubMedPubMedCentralGoogle Scholar
 28.Pavesi G, Mereghetti P, Mauri G, Pesole G. Weeder web: discovery of transcription factor binding sites in a set of sequences from coregulated genes. Nucleic Acids Res. 2004; 32(WebServerIssue):199–203.CrossRefGoogle Scholar
 29.Marschall T, Rahmann S. Efficient exact motif discovery. Bioinformatics. 2009;25(12). doi: 10.1093/bioinformatics/btp188.
 30.Eskin E, Pevzner PA. Finding composite regulatory patterns in DNA sequences. In: Proceedings of the Tenth International Conference on Intelligent Systems for Molecular Biology, August 3–7, 2002. Edmonton, Alberta, Canada: 2002. p. 354–63, doi: 10.1093/bioinformatics/18.suppl_1.S354.
 31.Carvalho AM, Marsan L, Pisanti N, Sagot MF. RISOTTO: fast extraction of motifs with mismatches. In: Proceedings of the 7th Latin American Symposium on Theoretical Informatics (LATIN’06). Lecture Notes in Computer Science. Valdivia, Chile: Springer: 2006. p. 757–68.Google Scholar
 32.Zhang Y, Zaki M. EXMOTIF: efficient structured motif extraction. Algorithms Mol Biol. 2006; 1(1):1–18.CrossRefGoogle Scholar
 33.Lefranc M, Pommié C, Ruiz M, Giudicelli V, Foulquier E, Truong L, ThouveninContet V, Lefranc G. IMGT unique numbering for immunoglobulin and T cell receptor variable domains and Ig superfamily Vlike domains. Dev Comp Immunol. 2002; 27:55–77. doi: 10.1016/j.dci.2004.07.003.CrossRefGoogle Scholar
 34.Campbell J, Lowe D, Sleeman MA. Developing the next generation of monoclonal antibodies for the treatment of rheumatoid arthritis. Br J Pharmacol. 2011; 162:1470–84. doi: 10.1111/j.14765381.2010.01183.x.CrossRefPubMedPubMedCentralGoogle Scholar
 35.Fredriksson K, Navarro G. Averageoptimal single and multiple approximate string matching. J Exp Algorithmics. 2004;9. doi: 10.1145/1005813.1041513.
 36.Weiner J, BornbergBauer E. Evolution of circular permutations in multidomain proteins. Mol Biol Evol. 2006;23. doi: 10.1093/molbev/msj091.
 37.Maes M. Polygonal shape recognition using stringmatching techniques. Pattern Recognit. 1991; 24(5):433–40. doi: 10.1016/00313203(91)90056B.CrossRefGoogle Scholar
 38.Hirvola T, Tarhio J. Approximate Online Matching of Circular Strings In: Gudmundsson J, Katajainen J, editors. Experimental Algorithms: 13th International Symposium, SEA 2014, Copenhagen, Denmark, June 29 – July 1, 2014. Proceedings. Cham: Springer: 2014. p. 315–25, doi: 10.1007/9783319079592_27.Google Scholar
 39.Barton C, Iliopoulos CS, Pissis SP. Averagecase optimal approximate circular string matching In: Dediu A. H., Formenti E, MartínVide C, Truthe B, editors. Language and Automata Theory and Applications  9th International Conference, LATA 2015. Lecture Notes in Computer Science. Nice, France: Springer: 2015. p. 85–96, doi: 10.1007/9783319155791_6.Google Scholar
 40.Pissis S, Retha A. Generalised implementation for fixedlength approximate string matching under Hamming distance and applications. In: Parallel and Distributed Processing Symposium Workshop (IPDPSW), 2015 IEEE International: 2015. p. 367–74, doi: 10.1109/IPDPSW.2015.106.
 41.Doring A, Weese1 D, Rausch T, Reinert K. SeqAn an efficient, generic C++ library for sequence analysis. BMC Bioinformatics. 2008:1–9. doi: 10.1186/14712105911.
 42.Crochemore M, Hancart C, Lecroq T. Algorithms on Strings. New York, NY, USA: Cambridge University Press; 2007.CrossRefGoogle Scholar
 43.Bruijn NGD. A combinatorial problem. Indagationes Mathematicae. 1946; 8(4):461–7.Google Scholar
 44.Ford LR. A cyclic arrangement of mtuples. Report No. P1071, Rand Corporation, Santa Monica, California (April 23, 1957).Google Scholar
 45.Wang L, Jiang T. On the complexity of multiple sequence alignment. J Comput Biol. 1994; 1(4):337–48. doi: 10.1089/cmb.1994.1.337.CrossRefPubMedGoogle Scholar
 46.Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004; 32(5):1792–7. doi: 10.1093/nar/gkh340.CrossRefPubMedPubMedCentralGoogle Scholar
 47.Sokal R. A statistical method for evaluating systematic relationships. Univ Kans Sci Bull. 1958; 38:1409–38.Google Scholar
 48.Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J, Thompson JD, Higgins DG. Fast, scalable generation of highquality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol. 2011; 7(1). doi: 10.1038/msb.2011.75. http://msb.embopress.org/content/7/1/539.full.pdf.
 49.Fletcher W, Yang Z. INDELible: A flexible simulator of biological sequence evolution. Mol Biol Evol. 2009; 26(8):1879–88. doi: 10.1093/molbev/msp098.CrossRefPubMedPubMedCentralGoogle Scholar
 50.Jukes TH, Cantor CR. CHAPTER 24  evolution of protein molecules In: Munro HN, editor. Mammalian Protein Metabolism. New York: Academic Press: 1969. p. 21–132, doi: 10.1016/B9781483232119.500097.Google Scholar
 51.Stamatakis A. RAxML version 8: A tool for phylogenetic analysis and postanalysis of large phylogenies. Bioinformatics. 2014. doi: 10.1093/bioinformatics/btu033.
 52.Robinson DF, Foulds LR. Comparison of phylogenetic trees. Math Biosci. 1981; 53(1–2):131–47. doi: 10.1016/00255564(81)900432.CrossRefGoogle Scholar
 53.Kanehisa M, Goto S, Sato K, Fujibuchi W, Bono H. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 1999; 27:29–34. doi: 10.1093/nar/27.1.29.CrossRefPubMedPubMedCentralGoogle Scholar
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.