Enhancing Statistical Multiple Sequence Alignment and Tree Inference Using Structural Information

  • Joseph L. Herman
Part of the Methods in Molecular Biology book series (MIMB, volume 1851)


For highly divergent sequences, there is often insufficient information to reliably construct alignments and phylogenetic trees. Since protein structure may be strongly conserved despite large divergences in sequence, structural information can be used to help identify homology in such cases.

While there exist well-studied models of sequence evolution, structurally informed alignment methods have typically made use of geometric measures of deviation that do not take into account the underlying mutational processes. In order to integrate structural information into sequence-based evolutionary models, we recently developed a stochastic model of structural evolution on a phylogenetic tree and implemented this as the StructAlign plugin for the StatAlign statistical alignment package.

In this chapter, we will outline the types of analyses that can be carried out using StructAlign, illustrating how the inclusion of structural information can be used to inform joint estimation of alignments and trees. StructAlign can also be used to infer branch-specific rates of structural evolution, and analysis of an example globin dataset highlights strong variation in the inferred rate across the tree. While structure is more highly conserved within clades, the rate of structural divergence as a function of sequence variation is larger between functionally divergent proteins. Allowing for the rate of structural divergence to vary over the tree results in an improved fit to the empirically observed pairwise RMSD values.

Key words

Protein structure Structural alignment RMSD Statistical alignment Alignment uncertainty Bayesian hierarchical models MCMC Parallel tempering Molecular phylogenetics Globins 


  1. 1.
    Godzik A (1996) The structural alignment between two proteins: is there a unique answer? Protein Sci 5:1325–1338CrossRefPubMedPubMedCentralGoogle Scholar
  2. 2.
    Sela I, Ashkenazy H, Katoh K, Pupko T (2015) GUIDANCE2: accurate detection of unreliable alignment regions accounting for the uncertainty of multiple parameters. Nucleic Acids Res 43:W7–W14CrossRefPubMedPubMedCentralGoogle Scholar
  3. 3.
    Morrison DA, Ellis JT (1997) Effects of nucleotide sequence alignment on phylogeny estimation: a case study of 18S rDNAs of apicomplexa. Mol Biol Evol 14:428–441CrossRefPubMedGoogle Scholar
  4. 4.
    Ogden TH, Rosenberg MS (2006) Multiple sequence alignment accuracy and phylogenetic inference. Syst Biol 55:314–328CrossRefPubMedGoogle Scholar
  5. 5.
    Wong KM, Suchard MA, Huelsenbeck JP (2008) Alignment uncertainty and genomic analysis. Science 319:473–476CrossRefPubMedGoogle Scholar
  6. 6.
    Lunter G, Rocco A, Mimouni N, Heger A, Caldeira A, Hein J (2008) Uncertainty in homology inferences: assessing and improving genomic sequence alignment. Genome Res 18:298–309CrossRefPubMedPubMedCentralGoogle Scholar
  7. 7.
    Herman JL, Novák Á, Lyngsø R, Szabó A, Miklós I, Hein J (2015) Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs. BMC Bioinformatics 16:108CrossRefPubMedPubMedCentralGoogle Scholar
  8. 8.
    Nelesen S, Liu K, Zhao D, Linder CR, Warnow T (2008) The effect of the guide tree on multiple sequence alignments and subsequent phylogenetic analyses. In: Proceedings of the 2008 Pacific Symposium on Biocomputing. World Scientific. p 25–36Google Scholar
  9. 9.
    Lunter G, Drummond AJ, Miklós I, Hein J (2005) Statistical alignment: recent progress, new applications, and challenges. In: Statistical Methods in Molecular Evolution. Statistics for Biology and Health. Springer, New York, NYGoogle Scholar
  10. 10.
    Redelings BD, Suchard MA (2005) Joint Bayesian estimation of alignment and phylogeny. Syst Biol 54:401–418CrossRefPubMedGoogle Scholar
  11. 11.
    Westesson O, Lunter G, Paten B, Holmes I (2012) Accurate reconstruction of insertion-deletion histories by statistical phylogenetics. PLoS One 7:e34572CrossRefPubMedPubMedCentralGoogle Scholar
  12. 12.
    Holmes IH (2017) Historian: accurate reconstruction of ancestral sequences and evolutionary rates. Bioinformatics 33:1227–1229CrossRefPubMedPubMedCentralGoogle Scholar
  13. 13.
    Redelings BD (2014) Erasing errors due to alignment ambiguity when estimating positive selection. Mol Biol Evol 31:1979–1993CrossRefPubMedPubMedCentralGoogle Scholar
  14. 14.
    Satija R, Pachter L, Hein J (2008) Combining statistical alignment and phylogenetic footprinting to detect regulatory elements. Bioinformatics 24:1236–1242CrossRefPubMedGoogle Scholar
  15. 15.
    Satija R, Novák Á, Miklós I, Lyngsø R, Hein J (2009) BigFoot: Bayesian alignment and phylogenetic footprinting with MCMC. BMC Evol Biol 9:217CrossRefPubMedPubMedCentralGoogle Scholar
  16. 16.
    Philippe H, Brinkmann H, Lavrov DV, Littlewood DTJ, Manuel M, Wörheide G, Baurain D (2011) Resolving difficult phylogenetic questions: why more sequences are not enough. PLoS Biol 9:e1000602CrossRefPubMedPubMedCentralGoogle Scholar
  17. 17.
    Kumar S, Filipski AJ, Battistuzzi FU, Kosakovsky Pond SL, Tamura K (2012) Statistics and truth in phylogenomics. Mol Biol Evol 29:457–472CrossRefPubMedGoogle Scholar
  18. 18.
    Talavera G, Castresana J (2007) Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. Syst Biol 56:564–577CrossRefPubMedGoogle Scholar
  19. 19.
    Wu M, Chatterji S, Eisen JA (2012) Accounting for alignment uncertainty in phylogenomics. PLoS One 7:e30288CrossRefPubMedPubMedCentralGoogle Scholar
  20. 20.
    Gatesy J, DeSalle R, Wheeler W (1993) Alignment-ambiguous nucleotide sites and the exclusion of systematic data. Mol Phylogenet Evol 2:152–157CrossRefPubMedGoogle Scholar
  21. 21.
    Lee MS (2001) Unalignable sequences and molecular evolution. Trends Ecol Evol 16:681–685CrossRefGoogle Scholar
  22. 22.
    Löytynoja A, Goldman N (2008) Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science 320:1632–1635CrossRefPubMedGoogle Scholar
  23. 23.
    Hasegawa H, Holm L (2009) Advances and pitfalls of protein structural alignment. Curr Opin Struct Biol 19:341–348CrossRefPubMedGoogle Scholar
  24. 24.
    Johnson MS, Šali A, Blundell TL (1990) Phylogenetic relationships from three-dimensional protein structures. Methods Enzymol 183:670–690CrossRefPubMedGoogle Scholar
  25. 25.
    Bujnicki JM (2000) Phylogeny of the restriction endonuclease-like superfamily inferred from comparison of protein structures. J Mol Evol 50:39–44CrossRefPubMedGoogle Scholar
  26. 26.
    Lundin D, Poole AM, Sjöberg B-M, Högbom M (2012) Use of structural phylogenetic networks for classification of the ferritin-like superfamily. J Biol Chem 287:20565–20575CrossRefPubMedPubMedCentralGoogle Scholar
  27. 27.
    Chothia C, Lesk AM (1986) The relation between the divergence of sequence and structure in proteins. EMBO J 5:823CrossRefPubMedPubMedCentralGoogle Scholar
  28. 28.
    Panchenko AR, Wolf YI, Panchenko LA, Madej T (2005) Evolutionary plasticity of protein families: coupling between sequence and structure variation. Proteins 61:535–544CrossRefPubMedPubMedCentralGoogle Scholar
  29. 29.
    Illergård K, Ardell DH, Elofsson A (2009) Structure is three to ten times more conserved than sequence: a study of structural response in protein cores. Proteins 77:499–508CrossRefPubMedGoogle Scholar
  30. 30.
    Echave J, Spielman SJ, Wilke CO (2016) Causes of evolutionary rate variation among protein sites. Nat Rev Genet 17:109–121CrossRefPubMedPubMedCentralGoogle Scholar
  31. 31.
    Worth CL, Gong S, Blundell TL (2009) Structural and functional constraints in the evolution of protein families. Nat Rev Mol Cell Biol 10:709–720CrossRefPubMedGoogle Scholar
  32. 32.
    Gilson AI, Marshall-Christensen A, Choi J-M, Shakhnovich EI (2017) The role of evolutionary selection in the dynamics of protein structure evolution. Biophys J 112:1350–1365CrossRefPubMedPubMedCentralGoogle Scholar
  33. 33.
    Choi SC, Hobolth A, Robinson DM, Kishino H, Thorne JL (2007) Quantifying the impact of protein tertiary structure on molecular evolution. Mol Biol Evol 24:1769–1782CrossRefPubMedGoogle Scholar
  34. 34.
    Kleinman CL, Rodrigue N, Lartillot N, Philippe H (2010) Statistical potentials for improved structurally constrained evolutionary models. Mol Biol Evol 27:1546–1560CrossRefPubMedGoogle Scholar
  35. 35.
    Rodrigue N, Philippe H, Lartillot N (2006) Assessing site-interdependent phylogenetic models of sequence evolution. Mol Biol Evol 23:1762–1775CrossRefPubMedGoogle Scholar
  36. 36.
    Sadowski M, Taylor W (2010) On the evolutionary origins of “fold space continuity”: a study of topological convergence and divergence in mixed alpha-beta domains. J Struct Biol 172:244–252CrossRefPubMedGoogle Scholar
  37. 37.
    Rackovsky S (2015) Nonlinearities in protein space limit the utility of informatics in protein biophysics. Proteins 83:1923–1928CrossRefPubMedPubMedCentralGoogle Scholar
  38. 38.
    Sadreyev RI, Kim B-H, Grishin NV (2009) Discrete–continuous duality of protein structure space. Curr Opin Struct Biol 19:321–328CrossRefPubMedPubMedCentralGoogle Scholar
  39. 39.
    Holzgräfe C, Wallin S (2014) Smooth functional transition along a mutational pathway with an abrupt protein fold switch. Biophys J 107:1217–1225CrossRefPubMedPubMedCentralGoogle Scholar
  40. 40.
    Challis CJ, Schmidler SC (2012) A stochastic evolutionary model for protein structure alignment and phylogeny. Mol Biol Evol 29:3575–3587CrossRefPubMedPubMedCentralGoogle Scholar
  41. 41.
    Herman JL, Challis CJ, Novák Á, Hein J, Schmidler SC (2014) Simultaneous Bayesian estimation of alignment and phylogeny under a joint model of protein sequence and structure. Mol Biol Evol 31:2251–2266CrossRefPubMedPubMedCentralGoogle Scholar
  42. 42.
    Novák Á, Miklós I, Lyngsø R, Hein J (2008) StatAlign: an extendable software package for joint Bayesian estimation of alignments and evolutionary trees. Bioinformatics 24:2403–2404CrossRefPubMedGoogle Scholar
  43. 43.
    Burmester T, Ebner B, Weich B, Hankeln T (2002) Cytoglobin: a novel globin type ubiquitously expressed invertebrate tissues. Mol Biol Evol 19:416–421CrossRefPubMedGoogle Scholar
  44. 44.
    de Sanctis D, Dewilde S, Pesce A, Moens L, Ascenzi P, Hankeln T, Burmester T, Bolognesi M (2004) Crystal structure of cytoglobin: the fourth globin type discovered in man displays heme hexa-coordination. J Mol Biol 336:917–927CrossRefPubMedGoogle Scholar
  45. 45.
    Hoffmann FG, Opazo JC, Storz JF (2010) Gene cooption and convergent evolution of oxygen transport hemoglobins in jawed and jawless vertebrates. Proc Natl Acad Sci U S A 107:14274–14279CrossRefPubMedPubMedCentralGoogle Scholar
  46. 46.
    Hoffmann FG, Opazo JC, Storz JF (2011) Differential loss and retention of cytoglobin, myoglobin, and globin-e during the radiation of vertebrates. Genome Biol Evol 3:588–600CrossRefPubMedPubMedCentralGoogle Scholar
  47. 47.
    Hoffmann FG, Opazo JC, Hoogewijs D, Hankeln T, Ebner B, Vinogradov SN, Bailly X, Storz JF (2012) Evolution of the globin gene family in deuterostomes: lineage-specific patterns of diversification and attrition. Mol Biol Evol 29:1735–1745CrossRefPubMedPubMedCentralGoogle Scholar
  48. 48.
    Geyer C (2011) Importance sampling, simulated tempering, and umbrella sampling. In: Brooks S, Gelman A, Jones G, Meng X (eds) Handbook of Markov Chain Monte Carlo. Chapman & Hall/CRC, Boca Raton, pp 295–311Google Scholar
  49. 49.
    Altekar G, Dwarkadas S, Huelsenbeck JP, Ronquist F (2004) Parallel Metropolis coupled Markov chain Monte Carlo for Bayesian phylogenetic inference. Bioinformatics 20:407–415CrossRefPubMedGoogle Scholar
  50. 50.
    Thorne JL, Kishino H, Felsenstein J (1992) Inching toward reality: an improved likelihood model of sequence evolution. J Mol Evol 34:3–16CrossRefPubMedGoogle Scholar
  51. 51.
    Gelman A, Rubin DB (1992) Inference from iterative simulation using multiple sequences. Stat Sci 7:457–472CrossRefGoogle Scholar
  52. 52.
    Humphrey W, Dalke A, Schulten K (1996) VMD: visual molecular dynamics. J Mol Graph 14:33–38CrossRefPubMedGoogle Scholar
  53. 53.
    Hoy JA, Robinson H, Trent JT, Kakar S, Smagghe BJ, Hargrove MS (2007) Plant hemoglobins: a molecular fossil record for the evolution of oxygen transport. J Mol Biol 371:168–179CrossRefPubMedGoogle Scholar
  54. 54.
    Lobanov M, Bogatyreva N, Galzitskaia O (2008) Radius of gyration is indicator of compactness of protein structure. Mol Biol 42:701–706CrossRefGoogle Scholar
  55. 55.
    Christensen AB, Herman JL, Elphick MR, Kober KM, Janies D, Linchangco G, Semmens DC, Bailly X, Vinogradov SN, Hoogewijs D (2015) Phylogeny of echinoderm hemoglobins. PLoS One 10:e0129668CrossRefPubMedPubMedCentralGoogle Scholar
  56. 56.
    Gupta KJ, Hebelstrup KH, Mur LA, Igamberdiev AU (2011) Plant hemoglobins: important players at the crossroads between oxygen and nitric oxide. FEBS Lett 585:3843–3849CrossRefPubMedGoogle Scholar
  57. 57.
    Hargrove MS, Brucker EA, Stec B, Sarath G, Arredondo-Peter R, Klucas RV, Olson JS, Phillips GN (2000) Crystal structure of a nonsymbiotic plant hemoglobin. Structure 8:1005–1014CrossRefPubMedGoogle Scholar
  58. 58.
    Sharir-Ivry A, Xia Y (2017) The impact of native state switching on protein sequence evolution. Mol Biol Evol 34:1378–1390CrossRefPubMedGoogle Scholar
  59. 59.
    Maadooliat M, Zhou L, Najibi SM, Gao X, Huang JZ (2016) Collective estimation of multiple bivariate density functions with application to angular-sampling-based protein loop modeling. J Am Stat Assoc 111:43–56CrossRefGoogle Scholar
  60. 60.
    Golden M, García-Portugués E, Sørensen M, Mardia KV, Hamelryck T, Hein J (2017) A generative angular model of protein structure evolution. Mol Biol Evol 34:2085–2100CrossRefPubMedPubMedCentralGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of Biomedical InformaticsHarvard Medical SchoolBostonUSA

Personalised recommendations