Gap Filling as Exact Path Length Problem
One of the last steps in a genome assembly project is filling the gaps between consecutive contigs in the scaffolds. This problem can be naturally stated as finding an \(s\)-\(t\) path in a directed graph whose sum of arc costs belongs to a given range (the estimate on the gap length). Here \(s\) and \(t\) are any two contigs flanking a gap. This problem is known to be NP-hard in general. Here we derive a simpler dynamic programming solution than already known, pseudo-polynomial in the maximum value of the input range. We implemented various practical optimizations to it, and compared our exact gap filling solution experimentally to popular gap filling tools. Summing over all the bacterial assemblies considered in our experiments, we can in total fill 28% more gaps than the best previous tool and the gaps filled by our method span 80% more sequence. Furthermore, the error level of the newly introduced sequence is comparable to that of the previous tools.
KeywordsSource Vertex Multidimensional Knapsack Problem Assembly Graph Dynamic Programing Table Previous Tool
Unable to display preview. Download preview PDF.
- 3.Durbin, R., et al.: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press (1998)Google Scholar
- 12.Luo, R., et al.: SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 1(18) (2012)Google Scholar