Advertisement

On the Complexity of Sequence to Graph Alignment

  • Chirag Jain
  • Haowen Zhang
  • Yu Gao
  • Srinivas AluruEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11467)

Abstract

Availability of extensive genetics data across multiple individuals and populations is driving the growing importance of graph based reference representations. Aligning sequences to graphs is a fundamental operation on several types of sequence graphs (variation graphs, assembly graphs, pan-genomes, etc.) and their biological applications. Though research on sequence to graph alignments is nascent, it can draw from related work on pattern matching in hypertext. In this paper, we study sequence to graph alignment problems under Hamming and edit distance models, and linear and affine gap penalty functions, for multiple variants of the problem that allow changes in query alone, graph alone, or in both. We prove that when changes are permitted in graphs either standalone or in conjunction with changes in the query, the sequence to graph alignment problem is \(\mathcal {NP}\)-complete under both Hamming and edit distance models for alphabets of size \({\ge }2\). For the case where only changes to the sequence are permitted, we present an \(O(|V|+m|E|)\) time algorithm, where m denotes the query size, and V and E denote the vertex and edge sets of the graph, respectively. Our result is generalizable to both linear and affine gap penalty functions, and improves upon the run-time complexity of existing algorithms.

Notes

Acknowledgements

This work is supported in part by US National Science Foundation grant CCF-1816027. Yu Gao was supported by the ACO Program at Georgia Institute of Technology.

References

  1. 1.
    Amir, A., Lewenstein, M., Lewenstein, N.: Pattern matching in hypertext. J. Algorithms 35(1), 82–99 (2000)MathSciNetzbMATHCrossRefGoogle Scholar
  2. 2.
    Antipov, D., Korobeynikov, A., McLean, J.S., Pevzner, P.A.: hybridSPAdes: an algorithm for hybrid assembly of short and long reads. Bioinformatics 32(7), 1009–1015 (2015)CrossRefGoogle Scholar
  3. 3.
    Backurs, A., Indyk, P.: Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). In: Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, pp. 51–58. ACM (2015)Google Scholar
  4. 4.
    Beretta, S., Bonizzoni, P., Denti, L., Previtali, M., Rizzi, R.: Mapping RNA-seq data to a transcript graph via approximate pattern matching to a hypertext. In: Figueiredo, D., Martín-Vide, C., Pratas, D., Vega-Rodríguez, M.A. (eds.) AlCoB 2017. LNCS, vol. 10252, pp. 49–61. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-58163-7_3CrossRefGoogle Scholar
  5. 5.
    Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms. MIT Press, Cambridge (2009)zbMATHGoogle Scholar
  6. 6.
    Dilthey, A., Cox, C., Iqbal, Z., Nelson, M.R., McVean, G.: Improved genome inference in the MHC using a population reference graph. Nat. Genet. 47(6), 682 (2015)CrossRefGoogle Scholar
  7. 7.
    Eggertsson, H.P., et al.: Graphtyper enables population-scale genotyping using pangenome graphs. Nat. Genet. 49(11), 1654 (2017)CrossRefGoogle Scholar
  8. 8.
    Garg, S., Rautiainen, M., Novak, A.M., Garrison, E., Durbin, R., Marschall, T.: A graph-based approach to diploid genome assembly. Bioinformatics 34(13), i105–i114 (2018)CrossRefGoogle Scholar
  9. 9.
    Garrison, E., et al.: Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 36, 875–879 (2018)CrossRefGoogle Scholar
  10. 10.
    Gotoh, O.: An improved algorithm for matching biological sequences. J. Mol. Biol. 162(3), 705–708 (1982)CrossRefGoogle Scholar
  11. 11.
    Heydari, M., Miclotte, G., Van de Peer, Y., Fostier, J.: BrownieAligner: accurate alignment of illumina sequencing data to de Bruijn graphs. BMC Bioinform. 19(1), 311 (2018)CrossRefGoogle Scholar
  12. 12.
    Huang, L., Popic, V., Batzoglou, S.: Short read alignment with populations of genomes. Bioinformatics 29(13), i361–i370 (2013)CrossRefGoogle Scholar
  13. 13.
    Kuosmanen, A., Paavilainen, T., Gagie, T., Chikhi, R., Tomescu, A., Mäkinen, V.: Using minimum path cover to boost dynamic programming on DAGs: co-linear chaining extended. In: Raphael, B.J. (ed.) RECOMB 2018. LNCS, vol. 10812, pp. 105–121. Springer, Cham (2018).  https://doi.org/10.1007/978-3-319-89929-9_7CrossRefGoogle Scholar
  14. 14.
    Lee, C., Grasso, C., Sharlow, M.F.: Multiple sequence alignment using partial order graphs. Bioinformatics 18(3), 452–464 (2002)CrossRefGoogle Scholar
  15. 15.
    Limasset, A., Cazaux, B., Rivals, E., Peterlongo, P.: Read mapping on de Bruijn graphs. BMC Bioinform. 17(1), 237 (2016)CrossRefGoogle Scholar
  16. 16.
    Liu, B., Guo, H., Brudno, M., Wang, Y.: deBGA: read alignment with de Bruijn graph-based seed and extension. Bioinformatics 32(21), 3224–3232 (2016)CrossRefGoogle Scholar
  17. 17.
    Manber, U., Wu, S.: Approximate string matching with arbitrary costs for text and hypertext. In: Advances in Structural and Syntactic Pattern Recognition, pp. 22–33. World Scientific (1992)Google Scholar
  18. 18.
    Myers, E.W.: An overview of sequence comparison algorithms in molecular biology. University of Arizona, Department of Computer Science (1991)Google Scholar
  19. 19.
    Myers, E.W.: The fragment assembly string graph. Bioinformatics 21(Suppl\(\_\)2), ii79–ii85 (2005)Google Scholar
  20. 20.
    Navarro, G.: Improved approximate pattern matching on hypertext. Theoret. Comput. Sci. 237(1–2), 455–463 (2000)MathSciNetzbMATHCrossRefGoogle Scholar
  21. 21.
    Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. (CSUR) 33(1), 31–88 (2001)CrossRefGoogle Scholar
  22. 22.
    Nguyen, N., et al.: Building a pan-genome reference for a population. J. Comput. Biol. 22(5), 387–401 (2015)MathSciNetCrossRefGoogle Scholar
  23. 23.
    Novak, A.M., et al.: Genome graphs. Preprint at bioRxiv (2017).  https://doi.org/10.1101/101378
  24. 24.
    Park, K., Kim, D.K.: String matching in hypertext. In: Galil, Z., Ukkonen, E. (eds.) CPM 1995. LNCS, vol. 937, pp. 318–329. Springer, Heidelberg (1995).  https://doi.org/10.1007/3-540-60044-2_51CrossRefGoogle Scholar
  25. 25.
    Pevzner, P.A., Tang, H., Waterman, M.S.: An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. 98(17), 9748–9753 (2001)MathSciNetzbMATHCrossRefGoogle Scholar
  26. 26.
    Rautiainen, M., Marschall, T.: Aligning sequences to general graphs in O(V + mE) time. Preprint at bioRxiv (2017).  https://doi.org/10.1101/216127
  27. 27.
    Rowe, W.P., Winn, M.D.: Indexed variation graphs for efficient and accurate resistome profiling. Bioinformatics 1, 8 (2018)Google Scholar
  28. 28.
    Salmela, L., Rivals, E.: LoRDEC: accurate and efficient long read error correction. Bioinformatics 30(24), 3506–3514 (2014)CrossRefGoogle Scholar
  29. 29.
    Sirén, J., Välimäki, N., Mäkinen, V.: Indexing graphs for path queries with applications in genome research. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 11(2), 375–388 (2014)CrossRefGoogle Scholar
  30. 30.
    Thachuk, C.: Indexing hypertext. J. Discrete Algorithms 18, 113–122 (2013)MathSciNetzbMATHCrossRefGoogle Scholar
  31. 31.
    Vaddadi, K., Tayal, K., Srinivasan, R., Sivadasan, N.: Sequence alignment on directed graphs. J. Comput. Biol. 26(1), 53–67 (2018)Google Scholar
  32. 32.
    Wang, J.R., Holt, J., McMillan, L., Jones, C.D.: FMLRC: hybrid long read error correction using an FM-index. BMC Bioinform. 19(1), 50 (2018)CrossRefGoogle Scholar
  33. 33.
    Wick, R.R., Judd, L.M., Gorrie, C.L., Holt, K.E.: Unicycler: resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput. Biol. 13(6), e1005595 (2017)CrossRefGoogle Scholar
  34. 34.
    Zhang, H., Jain, C., Aluru, S.: A comprehensive evaluation of long read error correction methods. Preprint at bioRxiv (2019).  https://doi.org/10.1101/519330

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Chirag Jain
    • 1
  • Haowen Zhang
    • 1
  • Yu Gao
    • 1
  • Srinivas Aluru
    • 1
    Email author
  1. 1.College of ComputingGeorgia Institute of TechnologyAtlantaUSA

Personalised recommendations