Detecting Amino Acid Coevolution with Bayesian Graphical Models

  • Mariano AvinoEmail author
  • Art F. Y. Poon
Part of the Methods in Molecular Biology book series (MIMB, volume 1851)


The comparative study of homologous proteins can provide abundant information about the functional and structural constraints on protein evolution. For example, an amino acid substitution that is deleterious may become permissive in the presence of another substitution at a second site of the protein. A popular approach for detecting coevolving residues is by looking for correlated substitution events on branches of the molecular phylogeny relating the protein-coding sequences. Here we describe a machine learning method (Bayesian graphical models) implemented in the open-source phylogenetic software package HyPhy,, for extracting a network of coevolving residues from a sequence alignment.

Key words

amino acid coevolution Bayesian graphical model hepatitis C virus HyPhy epistasis 



This study was supported in part by the Government of Canada through Genome Canada and the Ontario Genomics Institute (OGI-131), and by grants from the Canadian Institutes of Health Research (PJT-153391 and BOP-149562). AFYP was supported by a CIHR New Investigator Award (FRN-130609).


  1. 1.
    Kihara D (2005) The effect of long-range interactions on the secondary structure formation of proteins. Protein Sci 14(8):1955–1963PubMedPubMedCentralGoogle Scholar
  2. 2.
    Sprinzak E, Margalit H (2001) Correlated sequence-signatures as markers of protein-protein interaction. J Mol Biol 311(4):681–692PubMedGoogle Scholar
  3. 3.
    Horner DS, Pirovano W, Pesole G (2007) Correlated substitution analysis and the prediction of amino acid structural contacts. Brief Bioinform 9(1):46–56PubMedGoogle Scholar
  4. 4.
    Taylor WR, Hamilton RS, Sadowski MI (2013) Prediction of contacts from correlated sequence substitutions. Curr Opin Struct Biol 23(3):473–479PubMedGoogle Scholar
  5. 5.
    Marks DS, Hopf TA, Sander C (2012) Protein structure prediction from sequence variation. Nat Biotechnol 30(11):1072–1080PubMedPubMedCentralGoogle Scholar
  6. 6.
    De Juan D, Pazos F, Valencia A (2013) Emerging methods in protein co-evolution. Nat Rev Genet 14(4):249PubMedGoogle Scholar
  7. 7.
    Göbel U, Sander C, Schneider R, Valencia A (1994) Correlated mutations and residue contacts in proteins. Proteins Struct Funct Bioinf 18(4):309–317Google Scholar
  8. 8.
    Korber B, Farber RM, Wolpert DH, Lapedes AS (1993) Covariation of mutations in the V3 loop of human immunodeficiency virus type 1 envelope protein: an information theoretic analysis. Proc Natl Acad Sci 90(15):7176–7180PubMedGoogle Scholar
  9. 9.
    Hirschhorn JN, Lohmueller K, Byrne E, Hirschhorn K (2002) A comprehensive review of genetic association studies. Genet Med 4(2):45–61PubMedGoogle Scholar
  10. 10.
    Kowarsch A, Fuchs A, Frishman D, Pagel P (2010) Correlated mutations: a hallmark of phenotypic amino acid substitutions. PLoS Comput Biol 6(9):e1000923PubMedPubMedCentralGoogle Scholar
  11. 11.
    Weinreich DM, Delaney NF, DePristo MA, Hartl DL (2006) Darwinian evolution can follow only very few mutational paths to fitter proteins. Science 312(5770):111–114PubMedGoogle Scholar
  12. 12.
    Ivankov DN, Finkelstein AV, Kondrashov FA (2014) A structural perspective of compensatory evolution. Curr Opin Struct Biol 26:104–112PubMedPubMedCentralGoogle Scholar
  13. 13.
    Neher E (1994) How frequent are correlated changes in families of protein sequences? Proc Natl Acad Sci 91(1):98–102PubMedGoogle Scholar
  14. 14.
    Olmea O, Rost B, Valencia A (1999) Effective use of sequence correlation and conservation in fold recognition. J Mol Biol 293(5):1221–1239PubMedGoogle Scholar
  15. 15.
    Atchley WR, Wollenberg KR, Fitch WM, Terhalle W, Dress AW (2000) Correlations among amino acid sites in bHLH protein domains: an information theoretic analysis. Mol Biol Evol 17(1):164–178PubMedGoogle Scholar
  16. 16.
    Tillier ER, Lui TW (2003) Using multiple interdependency to separate functional from phylogenetic correlations in protein alignments. Bioinformatics 19(6):750–755PubMedGoogle Scholar
  17. 17.
    Martin L, Gloor GB, Dunn S, Wahl LM (2005) Using information theory to search for co-evolving residues in proteins. Bioinformatics 21(22):4116–4124PubMedGoogle Scholar
  18. 18.
    Gouveia-Oliveira R, Pedersen AG (2007) Finding coevolving amino acid residues using row and column weighting of mutual information and multi-dimensional amino acid representation. Algorithms Mol Biol 2(1):12PubMedPubMedCentralGoogle Scholar
  19. 19.
    Fernandes AD, Gloor GB (2010) Mutual information is critically dependent on prior assumptions: would the correct estimate of mutual information please identify itself? Bioinformatics 26(9):1135–1139PubMedGoogle Scholar
  20. 20.
    Jeong CS, Kim D (2012) Reliable and robust detection of coevolving protein residues. Protein Eng Des Sel 25(11):705–713PubMedGoogle Scholar
  21. 21.
    Felsenstein J (1985) Phylogenies and the comparative method. Am Nat 125(1):1–15Google Scholar
  22. 22.
    Shindyalov IN, Kolchanov NA, Sander C (1994) Can three-dimensional contacts in protein structures be predicted by analysis of correlated mutations? Protein Eng 7(3):349–358PubMedGoogle Scholar
  23. 23.
    Wollenberg KR, Atchley WR (2000) Separation of phylogenetic and functional associations in biological sequences by using the parametric bootstrap. Proc Natl Acad Sci 97(7):3288–3291PubMedGoogle Scholar
  24. 24.
    Gloor GB, Martin LC, Wahl LM, Dunn SD (2005) Mutual information in protein multiple sequence alignments reveals two classes of coevolving positions. Biochemistry 44(19):7156–7165PubMedGoogle Scholar
  25. 25.
    Pollock DD, Taylor WR, Goldman N (1999) Coevolving protein residues: maximum likelihood identification and relationship to structure. J Mol Biol 287(1):187–198PubMedGoogle Scholar
  26. 26.
    Tuff P, Darlu P (2000) Exploring a phylogenetic approach for the detection of correlated substitutions in proteins. Mol Biol Evol 17(11):1753–1759PubMedGoogle Scholar
  27. 27.
    Poon AFY, Lewis FI, Pond SLK, Frost SDW (2007) An evolutionary-network model reveals stratified interactions in the V3 loop of the HIV-1 envelope. PLoS Comput Biol 3(11):e231PubMedPubMedCentralGoogle Scholar
  28. 28.
    Talavera D, Lovell SC, Whelan S (2015) Covariation is a poor measure of molecular coevolution. Mol Biol Evol 32(9):2456–2468PubMedPubMedCentralGoogle Scholar
  29. 29.
    Fodor AA, Aldrich RW (2004) Influence of conservation on calculations of amino acid covariance in multiple sequence alignments. Proteins Struct Funct Bioinf 56(2):211–221Google Scholar
  30. 30.
    Pearl J (1986) Fusion, propagation, and structuring in belief networks. Artif Intell 29(3):241–288Google Scholar
  31. 31.
    Friedman N, Koller D (2003) Being Bayesian about network structure. A Bayesian approach to structure discovery in Bayesian networks. Mach Learn 50(1–2):95–125Google Scholar
  32. 32.
    Pond SLK, Frost SDW, Muse SV (2005) HyPhy: hypothesis testing using phylogenies. Bioinformatics 21(5):676–679PubMedGoogle Scholar
  33. 33.
    Delport W, Poon AFY, Frost SDW, Kosakovsky Pond SL (2010) Datamonkey 2010: a suite of phylogenetic analysis tools for evolutionary biology. Bioinformatics 26(19):2455–2457PubMedPubMedCentralGoogle Scholar
  34. 34.
    Poon AFY, Lewis FI, Frost SDW, Kosakovsky Pond SL (2008) Spidermonkey: rapid detection of co-evolving sites using Bayesian graphical models. Bioinformatics 24(17):1949–1950PubMedPubMedCentralGoogle Scholar
  35. 35.
    Stamatakis A (2014) RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30(9):1312–1313PubMedPubMedCentralGoogle Scholar
  36. 36.
    Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, Gascuel O (2010) New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol 59(3):307–321PubMedGoogle Scholar
  37. 37.
    Price MN, Dehal PS, Arkin AP (2010) FastTree 2–approximately maximum-likelihood trees for large alignments. PLoS ONE 5(3):e9490PubMedPubMedCentralGoogle Scholar
  38. 38.
    Holmes S (2003) Bootstrapping phylogenetic trees: theory and methods. Stat Sci 18:241–255Google Scholar
  39. 39.
    Muse SV, Gaut BS (1994) A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol Biol Evol 11(5):715–724PubMedGoogle Scholar
  40. 40.
    Yang Z (1993) Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol Biol Evol 10(6):1396–1401PubMedGoogle Scholar
  41. 41.
    Felsenstein J, Churchill GA (1996) A hidden Markov model approach to variation among sites in rate of evolution. Mol Biol Evol 13(1):93–104PubMedGoogle Scholar
  42. 42.
    Swofford D, Begle DP (1993) PAUP: Phylogenetic analysis using parsimony, Version 3.1, March 1993. Center for Biodiversity, Illinois Natural History SurveyGoogle Scholar
  43. 43.
    Tamura K, Nei M (1993) Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol 10(3):512–526PubMedPubMedCentralGoogle Scholar
  44. 44.
    Posada D (2003) Using MODELTEST and PAUP* to select a model of nucleotide substitution. Curr Protoc Bioinformatics 6–5. Scholar
  45. 45.
    Maddison DR, Swofford DL, Maddison WP (1997) NEXUS: an extensible file format for systematic information. Syst Biol 46(4):590–621PubMedGoogle Scholar
  46. 46.
    Joy JB, Liang RH, McCloskey RM, Nguyen T, Poon AFY (2016) Ancestral reconstruction. PLoS Comput Biol 12(7):e1004763PubMedPubMedCentralGoogle Scholar
  47. 47.
    Nielsen R (2002) Mapping mutations on phylogenies. Syst Biol 51(5):729–739PubMedGoogle Scholar
  48. 48.
    Pupko T, Pe I, Shamir R, Graur D (2000) A fast algorithm for joint reconstruction of ancestral amino acid sequences. Mol Biol Evol 17(6):890–896PubMedGoogle Scholar
  49. 49.
    Ellson J, Gansner E, Koutsofios L, North SC, Woodhull G (2001) Graphviz—open source graph drawing tools. In: International symposium on graph drawing. Springer, Berlin, pp 483–484Google Scholar
  50. 50.
    Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13(11):2498–2504PubMedPubMedCentralGoogle Scholar
  51. 51.
    Bastian M, Heymann S, Jacomy M et al (2009) Gephi: an open source software for exploring and manipulating networks. In: Proceedings of the third international ICWSM conference, vol 8, pp 361–362Google Scholar
  52. 52.
    Simmonds P (2004) Genetic diversity and evolution of hepatitis C virus–15 years on. J Gen Virol 85(11):3173–3188PubMedGoogle Scholar
  53. 53.
    Blach S, Zeuzem S, Manns M, Altraif I, Duberg AS, Muljono DH, Waked I, Alavian SM, Lee MH, Negro F et al (2017) Global prevalence and genotype distribution of hepatitis C virus infection in 2015: a modelling study. Lancet Gastroenterol Hepatol 2(3):161–176Google Scholar
  54. 54.
    Campo D, Dimitrova Z, Mitchell RJ, Lara J, Khudyakov Y (2008) Coordinated evolution of the hepatitis C virus. Proc Natl Acad Sci 105(28):9685–9690PubMedGoogle Scholar
  55. 55.
    Aurora R, Donlin MJ, Cannon NA, Tavis JE (2009) Genome-wide hepatitis C virus amino acid covariance networks can predict response to antiviral therapy in humans. J Clin Invest 119(1):225–236PubMedGoogle Scholar
  56. 56.
    McCloskey RM, Liang RH, Joy JB, Krajden M, Montaner JS, Harrigan PR, Poon AF (2014) Global origin and transmission of hepatitis C virus nonstructural protein 3 Q80K polymorphism. J Infect Dis 211(8):1288–1295PubMedGoogle Scholar
  57. 57.
    Poveda E, Wyles DL, Mena Á, Pedreira JD, Castro-Iglesias Á, Cachay E (2014) Update on hepatitis C virus resistance to direct-acting antiviral agents. Antivir Res 108:181–191PubMedGoogle Scholar
  58. 58.
    Combet C, Garnier N, Charavay C, Grando D, Crisan D, Lopez J, Dehne-Garcia A, Geourjon C, Bettler E, Hulo C et al (2006) euHCVdb: the European hepatitis C virus database. Nucleic Acids Res 35(Suppl_1):D363–D366PubMedGoogle Scholar
  59. 59.
    Katoh K, Standley DM (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 30(4):772–780PubMedPubMedCentralGoogle Scholar
  60. 60.
    Larsson A (2014) AliView: a fast and lightweight alignment viewer and editor for large datasets. Bioinformatics 30(22):3276–3278PubMedPubMedCentralGoogle Scholar
  61. 61.
    Darriba D, Taboada GL, Doallo R, Posada D (2012) jModelTest 2: more models, new heuristics and parallel computing. Nat Methods 9(8):772PubMedPubMedCentralGoogle Scholar
  62. 62.
    Guindon S, Gascuel O (2003) A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol 52(5):696–704PubMedGoogle Scholar
  63. 63.
    Yu G, Smith DK, Zhu H, Guan Y, Lam TTY (2017) ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods Ecol Evol 8(1):28–36Google Scholar
  64. 64.
    Plummer M, Best N, Cowles K, Vines K (2006) CODA: convergence diagnosis and output analysis for MCMC. R News 6(1):7–11Google Scholar
  65. 65.
    Gelman A, Rubin DB (1992) Inference from iterative simulation using multiple sequences. Stat Sci 7:457–472Google Scholar
  66. 66.
    Ranjith-Kumar C, Kao CC (2006) Biochemical activities of the HCV NS5B RNA-dependent RNA polymerase. In: Tan S (ed) Hepatitis C viruses: genomes and molecular biology. Horizon Bioscience, Norfolk, pp 293–310Google Scholar
  67. 67.
    Hong Z, Cameron CE, Walker MP, Castro C, Yao N, Lau JY, Zhong W (2001) A novel mechanism to ensure terminal initiation by hepatitis C virus NS5B polymerase. Virology 285(1):6–11PubMedGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of Pathology and Laboratory MedicineWestern UniversityLondonCanada

Personalised recommendations