# A short note on dynamic programming in a band

**Part of the following topical collections:**

## Abstract

### Background

Third generation sequencing technologies generate long reads that exhibit high error rates, in particular for insertions and deletions which are usually the most difficult errors to cope with. The only exact algorithm capable of aligning sequences with insertions and deletions is a dynamic programming algorithm.

### Results

In this note, for the sake of efficiency, we consider dynamic programming in a band. We show how to choose the band width in function of the long reads’ error rates, thus obtaining an \(O(N^{\frac {3}{2}})\) algorithm in space and time. We also propose a procedure to decide whether this algorithm, when applied to semi-global alignments, provides the optimal score.

### Conclusions

We suggest that dynamic programming in a band is well suited to the problem of aligning long reads between themselves and can be used as a core component of methods for obtaining a consensus sequence from the long reads alone.

The function implementing the dynamic programming algorithm in a band is available, as a standalone program, at: https://forgemia.inra.fr/jean-francois.gibrat/BAND_DYN_PROG.git

## Keywords

NGS Read correction Semi-global alignment## Background

High throughput sequencing technologies are progressing at a very fast pace. Recently, third generation sequencing technologies have been launched on the market and are becoming available to the life science community. These novel sequencing technologies are based on single molecule techniques and are characterized by very long reads exhibiting a high error rate. At the time of writing, Pacific Biosciences (http://www.pacb.com) machines produce reads whose length distribution has a mean of 15 kbp. The longest sequenced reads have a length of about 55 kbp. These reads have an error rate of 10-15% (typically, 3% mismatches, 7% insertions and 4% deletions). These figures are likely to change with the techniques’ improvements, but give a general idea of the characteristics of 3rd generation technologies so far.

These characteristics have a strong impact on the algorithms used to assemble or map the reads produced by these technologies. Current algorithms have been designed to cope with 2nd generation sequencing data, i.e., a very large number of short reads (at most a few hundred-base pair-long) with very low error rates (< 1%). The length of 3rd generation reads and their high error rates, particularly for insertions and deletions, is likely to make these algorithms ill-adapted to efficiently process 3rd generation sequencing technology data. Algorithms better able to cope with reads exhibiting large numbers of insertions and deletions are needed.

## Method

Until now, the only exact algorithm for aligning sequences with insertions and deletions remains the dynamic programming (DP) algorithm proposed by Needleman & Wunsch almost 50 years ago [8]. This algorithm has been modified by Smith & Waterman [9] to provide local alignments in addition to the original global and semi-global alignments. This canonical algorithm is *O*(*N**M*) in time and space, where *N* and *M* are the lengths of the two sequences to be aligned. To illustrate this point, if one wants to compare two long reads of length 50 kbp as described above, this requires allocating enough memory for a matrix of 2.5 10^{9} elements. If this is a matrix of floats or integers (4 bytes), this requires 10 GB of memory.

A number of works have been carried out to try alleviating this problem. Following a proposal of Hirschberg [3], different authors have presented algorithms to reduce the space requirement of the algorithm to *O*(*N*) [4, 7]. The time requirement, though, remains *O*(*N**M*).

Other groups have addressed the time requirement issue. Masek & Paterson [6] have proposed an algorithm displaying an \(O\left (\frac {NM}{\log {N}}\right)\) runtime, but this algorithm was based on a particular scoring scheme consisting of rational numbers only and did not allow local alignments. Crochemore & Rytter [1] proposed a more general-purpose algorithm that overcame these limitations and had an \(O(\frac {hNM}{\log {N}})\) time requirement, *h* being the entropy of the sequences. However, so far, there is no known algorithm that achieves the above subquadratic runtime and whose space requirement is linear.

*O*(

*w*

*N*) time and space requirement, where 2

*w*+1 is the width of the band and

*w*<<

*N*. If the alignment path leaves the band, one must restart the algorithm with an increased value of

*w*, for instance, by doubling it. This process continues until one is certain that the path remains within the band. In the worst case, the runtime is still

*O*(

*N*

*M*).

Then two questions arise. How should the band width be chosen? How does one know that this algorithm produces an optimal alignment, i.e., that the alignment path remains within the band?

### Optimal choice of the band width

As shown in Fig. 1, each insertion in the first sequence (or deletion in the second sequence) moves the path one position to the right, conversely a deletion in the first sequence (or insertion in the second sequence) moves the path one position downwards. Given the observed sequencing technology error rates, what is the probability that the path leaves the band of width 2*w*+1 when aligning reads of length *N* due to a random accumulation of steps in a particular direction?

*x*axis, starting at position 0. The probability to take one step to the right is

*p*

_{ r }, the probability to take one step to the left is

*p*

_{ l }and the probability to remain stationary is 1−

*p*

_{ r }−

*p*

_{ l }. The probability to take

*r*steps to the right,

*l*steps to the left and to remain stationary

*s*times during a walk of length

*N*is given by the multinomial distribution:

*r*+

*l*+

*s*=

*N*. To answer the above question, we consider the travelled distance:

*d*=

*r*−

*l*and compute its variance: \(\mathbb {V}\text {ar}(d) = \mathbb {E}(d^{2}) - [\mathbb {E}(d)]^{2}\) with:

*r*, the second derivative of the MGF is:

*t*

_{ r }=0 in this expression gives the second moment:

*p*

_{ r }=

*p*

_{ l }=

*p*. Then:

*N*

*p*

_{ i }insertions and

*N*

*p*

_{ d }deletions (

*p*

_{ i }and

*p*

_{ d }are respectively the probabilities of insertion and deletion). However, not all these indels produce gaps in the sequence alignment. For instance, 2 deletions at the same position in the sequences do not result in a gap. Also, an insertion in one sequence and a nearby deletion in the other sequence can result in a mismatch depending on the local sequence configuration (since it is better to incur a penalty for a mismatch than for 2 indels). Therefore,

*p*is given by:

where *O*(*p*_{ i },*p*_{ d }) is a term gathering the probabilities of cases similar to the last one above that are difficult to estimate beforehand. As shown in section “Results and discussion”, this term is small compared to the others in *p*.

Having estimated the standard deviation of *d*, *σ*_{ d }, we can choose *w*=3*σ*_{ d }. With this value of *w*, there is < 0.3% chance that the alignment path leaves the band. For the sake of illustration, *σ*_{ d }≃111 with reads of length *N*=30,000, using the PacBio error rates mentioned in the introduction (7% insertions, 4% deletions, i.e., *p*≃0.1035). The algorithm is thus \(O\left (N^{\frac {3}{2}}\right)\) in time and space.

*w*=3

*σ*

_{ d }, leading to an algorithm whose execution time is proportional to 6

*σ*

_{ d }

*N*(neglecting the above < 0.3% cases). Another possibility is, first, to choose

*w*=

*σ*

_{ d }. On average, the alignment path remains in the band for 68% of the studied cases. For the remaining 32%, one restarts the computation with

*w*=2

*σ*

_{ d }. For a further 28% of the cases, the alignment path remains in the band. For the last 4%, one restarts again the computation with

*w*=3

*σ*

_{ d }. The total computation time is proportional to:

This scheme allows a gain of a factor ∼ 2 on the computing time.

### Is the score found by the banded DP algorithm optimal?

The usual answer found in text books and articles (e.g., [5]), valid for *global* alignments only, is the following.

*l*

_{1}and the length of the second one is

*l*

_{2}(with

*l*

_{1}>

*l*

_{2}), a cell

*M*[

*i*,

*j*] of the dynamic programming matrix is within the band if −

*w*−(

*l*

_{1}−

*l*

_{2})≤

*i*−

*j*≤+

*w*. As shown on Fig. 1, the alignment path leaves the band if some of the cells on this path have

*i*and

*j*indexes such that

*i*−

*j*>

*w*(dashed path) or

*i*−

*j*<−

*w*−(

*l*

_{1}−

*l*

_{2}) (solid path). The best score obtained for such a path is given by:

where *g* is the gap penalty and *m* is the match score. Indeed, this path generates *w*+1−(*l*_{1}−*l*_{2}) gaps in the horizontal sequence, *w*+1 in the vertical sequence and at most *l*_{2}−(*w*+1) matches. Let *s*_{ band } be the best score found by applying the dynamic programming algorithm in a band and *s*_{ opt } be the best score obtained with the complete matrix. If the path does not leave the band, then *s*_{ band }=*s*_{ opt }. If *s*_{ band }≥*b**e**s**t*_{ out } than *s*_{ band } is optimal since it is larger than the best score of any possible alignment that travels outside of the band. Notice that the above formula provides an upper bound of *b**e**s**t*_{ out }, since it is assumed that all the aligned characters in the 2 sequences correspond to matches, excluding the possibility of mismatches.

*w*=6, is not sufficient to fit the path. In the right panel, the band width has been increased to

*w*=7 and the path remains in the band. Although this change is marginal, it has a dramatic impact on the alignment provided by the algorithm.

What criterion should one use to detect instances where the path leaves the band in semi-global alignments? In this case, the above global alignment criterion is not applicable since one does not know in advance where will be the ends of the path.

*N*

_{ m }, in alignments of sequences of length

*N*and the corresponding variance are given by:

where *p*_{ u }=1−*p*_{ m }−*p*_{ i }−*p*_{ d } is the probability of no modification of the nucleotides in the sequences. *p*_{ u } is an upper bound of the true probability. Empirically, we checked that this provides a very good approximation of the mean of the number of matches. We consider that the path has left the band if the corresponding score is less than \(\mathbb {E}(N_m) - 3\sigma _{N_{m}}\) (lower half of the 99% confidence interval).

With these 2 criteria, we propose the following procedure to decide whether the banded DP alignment has provided the optimal solution: i) we check whether the path reaches the band edge. If it is true, we restart the alignment with a larger value of *w*. If it is false, we further check whether the number of matches is outside the 99% confidence half interval. If it is true, we restart the alignment with a larger value of *w*, else we consider that the banded DP algorithm has found the optimal solution.

However, as shown in Fig. 2, the path can reach the edge of the band then move along it providing the optimal solution (right panel). Conversely, the alignment can provide a suboptimal solution although the path, apparently, does not reach the band edge (middle panel). To measure the magnitude of these effects, we performed a number of simulations as described in the next section.

## Results and discussion

### Test of the theoretical standard deviation

*p*

_{ u }), a mismatch (

*p*

_{ m }), an insertion (

*p*

_{ i }) and a deletion (

*p*

_{ d }) of a nucleotide. We aligned all the pairs using the dynamic programming algorithm and we determined the maximal run of indels, in a particular direction, in the alignments, i.e.,

*d*above. For each set of 1000 aligned pairs of sequences we computed the mean and standard deviation of

*d*. As awaited, the computed means are very close to zero. Simulation results for the standard deviations are shown in Table 1.

Comparison of simulated and theoretical standard deviations

error rate set 1 | error rate set 2 | error rate set 3 | ||||
---|---|---|---|---|---|---|

Length | | | | | | |

2000 | 28.8 ± 0.3 | 30.0 | 20.2 ± 0.3 | 21.5 | 25.1 ± 0.3 | 26.1 |

3000 | 35.6 ± 0.5 | 36.7 | 25.2 ± 0.4 | 26.4 | 30.8 ± 0.3 | 32.0 |

4000 | 41.4 ± 0.5 | 42.4 | 28.8 ± 0.3 | 30.5 | 35.6 ± 0.5 | 37.0 |

6000 | 50.1 ± 0.7 | 51.9 | 35.5 ± 0.3 | 37.3 | 44.0 ± 0.5 | 45.3 |

As expected, theoretical standard deviations, *σ*_{ th }, calculated without the *O*(*p*_{ i },*p*_{ d }) term are larger than the experimental ones (they are all outside the 99% confidence interval around the experimental mean value, *σ*_{ ex }). However, the difference is sufficiently small to be of no consequence for our purpose.

### Test of the two criteria for score optimality

*w*=

*σ*

_{ d }).

Two-way table of banded DP alignments

optimal score | suboptimal score | |
---|---|---|

did not reach the band edge | 68.4% | 0.5% |

reached the band edge | 2.2% | 28.9% |

The proposed algorithm, would be optimal if the 2 off-diagonal elements contained 0% alignments. Here, 2.2% of the alignments provide the optimal solution and reach the band edge. It means that one will restart the alignment with a larger value of *w* although the result is correct. This wastes some time, but has no incidence on the correctness of the final alignment. On the contrary, when the path of the alignments does not reach the band and provides a suboptimal score (which is the case for 0.5% of the alignments) one will wrongly accept a suboptimal alignment.

Applying the 2nd criterion after the 1st criterion, the percentage of alignments that reach the band edge but provide an optimal score decreases from 2.2 to 1.8% and the percentage of alignments that do not reach the band edge but provide a suboptimal score drops from 0.5 to 0.2%, improving by 60% the latter problematic cases.

The procedure to determine whether the DP algorithm in a band finds the optimal score proposed in this note is thus not completely foolproof, but provides the correct answer in the vast majority of cases.

## Conclusion

The advent of third generation sequencing technologies that are characterized by long reads exhibiting high error rates, most notably for insertions and deletions, which are the most difficult errors to cope with, call for the development of new methods, better adapted to these features. Although the DP algorithm can hardly be defined as “new”, the problem at hand, aligning long reads that differ “only” by random sequencing errors, is ideally suited for using dynamic programming in a band.

In this note, we showed how to choose the band width in function of the reads’ error rates, resulting in a subquadratic time and space algorithm and proposed a procedure to determine whether the alignment path stays in the band, thus providing the optimal score. Therefore, the DP algorithm in a band is a good contender to align long reads between themselves, and can constitute a key component of methods for correcting long read sequencing errors and obtaining a consensus sequence (to be published).

## Notes

### Acknowledgements

The author would like to thank Dr M. Mariadassou for helpful suggestions regarding the 1D random walk.

### Funding

This work has been supported by the French National Institute for Agriculture Research (INRA).

### Author’s contributions

JFG conceived the analysis, performed the computations and wrote the article. The author read and approved the final manuscript.

### Ethics approval and consent to participate

Not applicable

### Competing interests

The author declares that he has no competing interests.

### Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## References

- 1.Crochemore M, Rytter W. Text Algorithms. New York: Oxford University Press; 1994. p. 412. ISBN 0-19-508609-0.Google Scholar
- 2.Fickett J.Fast optimal alignment. Nucleic Acids Res. 1984; 12:175–9.CrossRefPubMedPubMedCentralGoogle Scholar
- 3.Hirschberg DS. A linear space algorithm for computing maximal common subsequences. Commun ACM. 1975; 18:341–3.CrossRefGoogle Scholar
- 4.Huang X, Hardison RC, Miller W. A space-efficient algorithm for local similarities. Comput Appl Biosci. 1990; 6:373–81.PubMedGoogle Scholar
- 5.Jackson BN, Aluru S. Pairwise Sequence Alignment In: Aluru S, editor. Handbook of Computational Molecular Biology. Chapman and Hall/CRC: 2005. p. 1–32.Google Scholar
- 6.Masek WJ, Paterson MS. A faster algorithm computing string edit distances. J Comput Syst Sci. 1980; 20:18–31.CrossRefGoogle Scholar
- 7.Myers EW, Miller W. Optimal alignments in linear space. Comput Appl Biosci. 1988; 4:11–7.PubMedGoogle Scholar
- 8.Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970; 48:443–53.CrossRefPubMedGoogle Scholar
- 9.Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981; 147:195–7.CrossRefPubMedGoogle Scholar

## Copyright information

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.