Skip to main content

Detecting Amino Acid Coevolution with Bayesian Graphical Models

  • Protocol
  • First Online:
Computational Methods in Protein Evolution

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1851))

Abstract

The comparative study of homologous proteins can provide abundant information about the functional and structural constraints on protein evolution. For example, an amino acid substitution that is deleterious may become permissive in the presence of another substitution at a second site of the protein. A popular approach for detecting coevolving residues is by looking for correlated substitution events on branches of the molecular phylogeny relating the protein-coding sequences. Here we describe a machine learning method (Bayesian graphical models) implemented in the open-source phylogenetic software package HyPhy, http://hyphy.org, for extracting a network of coevolving residues from a sequence alignment.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The scripts in this chapter were tested with HyPhy version 2.220170201beta and release 2.2.7. HyPhy is a large and complex software package that is constantly undergoing development by a small team of researchers and programmers, and some of the more specialized features such as BGMs may temporarily break as newer versions are released. If you compiled HyPhy from source, make sure that you are using a single-threaded (HYPHYSP) or multiprocessing-enabled (HYPHYMP) build and not a message passing interface (MPI)-enabled (HYPHYMPI) build; at the time of writing, there were residual issues in the source code related to MPI processing. If you encounter any other problems, please submit an issue at https://github.com/veg/hyphy/issues and we will attend to it as soon as possible.

  2. 2.

    For this type of analysis, we prefer using maximum likelihood (ML) methods to reconstruct trees. If it is not feasible to use ML methods due to excessive numbers of sequence and/or sequence lengths, we suggest using the approximate ML program FastTree 2 [37], which can be orders of magnitude faster than the standard ML programs. Neighbor-joining (NJ) methods also scale favorably with larger alignments, but tend to be less accurate for reconstructing branch lengths. While there are NJ and ML tree reconstruction methods implemented in HyPhy, they are not as efficient as these specialized programs and we do not recommend using them for larger data sets.

  3. 3.

    A bootstrap support value is an empirical measure of confidence in a specific clade given the data. Most phylogeny reconstruction programs should have an option to omit these values. If you already have a Newick tree file and you just need to remove the support values, you can use the following UNIX command: sed -E ’s/)[0-9.]+:/):/g’ [input] > [output].

  4. 4.

    From this point onward, we assume that you are using the command-line interface. Unfortunately, this script may not work properly with the GUI because of how HyPhy handles file paths. Even on the command line, this is not straight-forward. For example, we used the following invocation in the macOS Terminal: HYPHYMP BASEPATH=/usr/local/lib/hyphy/ ‘pwd‘/fit_codon_model.bf If you want to take advantage of a multi-core CPU, you can add the argument CPU=[number of cores] immediately after HYPHYMP. Note that not all steps in this analysis are able to utilize multiple threads.

  5. 5.

    If you want to examine this scaling factor, you can find it in the serialized likelihood function generated by this script by searching for the parameter name scalingB.

  6. 6.

    If you’re using an operating system with a desktop environment, it’s often easier to drag the icon representing your file into the terminal window instead of typing out the corresponding path. This works when running HyPhy on the command line, but you need to use backspace to remove the space that is automatically appended to end of the path. HyPhy won’t be able to locate the file otherwise.

  7. 7.

    Prior to version 2.3.4, the text in HyPhy implies that these options allow rates to vary among branches, not sites: “…branch lengths come from a user-chosen distribution.” We have revised this help text as of version 2.3.4 to indicate that the distributions are used to model rate variation across sites, not branches.

  8. 8.

    A standard codon model is described by a 61-by-61 transition rate matrix and a single parameter R that corresponds to the ratio of non-synonymous and synonymous substitution rates. The model assumes that the system moves from one codon to another by single nucleotide substitutions; codon substitutions that require more than one nucleotide change are not allowed.

  9. 9.

    Some phylogeny reconstruction programs truncate sequence labels and cause an error at this stage—for example, neither RAxML or FastTree2 will read sequence labels beyond a whitespace character. A quick fix in this situation is to replace all whitespace characters with underscores in a text editor or with sed.

  10. 10.

    By convention, we use the file extension .lf and keep the same basename as the codon data file. This makes it easier to track files that belong to the same workflow.

  11. 11.

    NEXUS is a widespread format with known issues with standardization and usability, and has been implemented in diverse and often incompatible ways by multiple programs.

  12. 12.

    We have previously found this list output to be a more convenient format for debugging the script. It’s usually a good idea to manually compare entries in this list against your sequence alignment to make sure that things make sense.

  13. 13.

    Most phylogenetic tree reconstruction methods, such as maximum likelihood or neighbor-joining, will output an unrooted tree. For an unrooted tree, the labels will be generated for the deepest internal node.

  14. 14.

    For example, you can customize on a node-by-node basis the number of “parental” nodes on which a given node can be conditionally dependent. You can also load a serialized BGM from a XML Bayesian Interchange format file and use this model to simulate additional data sets. For more details, please refer to the file bayesgraph.ibf and the batch file tests/hbltests/BayesianGraphicalModels/TestBGM.bf in the HyPhy source code distribution.

  15. 15.

    As a general rule of thumb, we try to not build a BGM model that has many more nodes than observations. The number of substitutions provides a meaningful criterion for reducing the dimensionality of our data.

  16. 16.

    This is where the ability to customize the analysis implemented in the bayesgraph.bf script can be very useful. If you have prior information that a subset of codon sites are involved in a large number of interactions, the computational complexity of increasing the number of parents can be greatly reduced by modifying this parameter for only these sites.

  17. 17.

    (In an MCMC run, we observe autocorrelation when we sample parameter values that are very close in the parameter space and unrepresentative of the true underlying posterior distribution. Therefore, we try to decrease autocorrelation so that the MCMC sample provides a more precise estimate of the posterior sample. One way to accomplish this is by down-sampling to every n-th step).

  18. 18.

    We have provided most of the data files in this example on our GitHub repository at https://github.com/PoonLab/comet-prot/tree/master/data.

  19. 19.

    To generate an amino acid sequence from the column labels, we used the regular expression “[0-9]+,*” to replace all instances with an empty string. In Python, this can be achieved with the re module: seq = re.sub(’[0-9]+,*’, ’’, header.strip()), where header is a string variable containing the first line of the CSV file.

  20. 20.

    This can be accomplished with the following R commands:

    require(coda)

    chain1 <- read.csv("chain1.trace.csv", header=F)

    chain2 <- read.csv("chain2.trace.csv", header=F)

    chains <- mcmc.list(mcmc(chain1$V1), mcmc(chain2$V1))

    gelman.diag(chains, autoburnin=F)

    where the file names may be different for your run.

References

  1. Kihara D (2005) The effect of long-range interactions on the secondary structure formation of proteins. Protein Sci 14(8):1955–1963

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Sprinzak E, Margalit H (2001) Correlated sequence-signatures as markers of protein-protein interaction. J Mol Biol 311(4):681–692

    Article  CAS  PubMed  Google Scholar 

  3. Horner DS, Pirovano W, Pesole G (2007) Correlated substitution analysis and the prediction of amino acid structural contacts. Brief Bioinform 9(1):46–56

    Article  PubMed  CAS  Google Scholar 

  4. Taylor WR, Hamilton RS, Sadowski MI (2013) Prediction of contacts from correlated sequence substitutions. Curr Opin Struct Biol 23(3):473–479

    Article  CAS  PubMed  Google Scholar 

  5. Marks DS, Hopf TA, Sander C (2012) Protein structure prediction from sequence variation. Nat Biotechnol 30(11):1072–1080

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. De Juan D, Pazos F, Valencia A (2013) Emerging methods in protein co-evolution. Nat Rev Genet 14(4):249

    Article  PubMed  CAS  Google Scholar 

  7. Göbel U, Sander C, Schneider R, Valencia A (1994) Correlated mutations and residue contacts in proteins. Proteins Struct Funct Bioinf 18(4):309–317

    Article  Google Scholar 

  8. Korber B, Farber RM, Wolpert DH, Lapedes AS (1993) Covariation of mutations in the V3 loop of human immunodeficiency virus type 1 envelope protein: an information theoretic analysis. Proc Natl Acad Sci 90(15):7176–7180

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Hirschhorn JN, Lohmueller K, Byrne E, Hirschhorn K (2002) A comprehensive review of genetic association studies. Genet Med 4(2):45–61

    Article  CAS  PubMed  Google Scholar 

  10. Kowarsch A, Fuchs A, Frishman D, Pagel P (2010) Correlated mutations: a hallmark of phenotypic amino acid substitutions. PLoS Comput Biol 6(9):e1000923

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  11. Weinreich DM, Delaney NF, DePristo MA, Hartl DL (2006) Darwinian evolution can follow only very few mutational paths to fitter proteins. Science 312(5770):111–114

    Article  CAS  PubMed  Google Scholar 

  12. Ivankov DN, Finkelstein AV, Kondrashov FA (2014) A structural perspective of compensatory evolution. Curr Opin Struct Biol 26:104–112

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Neher E (1994) How frequent are correlated changes in families of protein sequences? Proc Natl Acad Sci 91(1):98–102

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Olmea O, Rost B, Valencia A (1999) Effective use of sequence correlation and conservation in fold recognition. J Mol Biol 293(5):1221–1239

    Article  CAS  PubMed  Google Scholar 

  15. Atchley WR, Wollenberg KR, Fitch WM, Terhalle W, Dress AW (2000) Correlations among amino acid sites in bHLH protein domains: an information theoretic analysis. Mol Biol Evol 17(1):164–178

    Article  CAS  PubMed  Google Scholar 

  16. Tillier ER, Lui TW (2003) Using multiple interdependency to separate functional from phylogenetic correlations in protein alignments. Bioinformatics 19(6):750–755

    Article  CAS  PubMed  Google Scholar 

  17. Martin L, Gloor GB, Dunn S, Wahl LM (2005) Using information theory to search for co-evolving residues in proteins. Bioinformatics 21(22):4116–4124

    Article  CAS  PubMed  Google Scholar 

  18. Gouveia-Oliveira R, Pedersen AG (2007) Finding coevolving amino acid residues using row and column weighting of mutual information and multi-dimensional amino acid representation. Algorithms Mol Biol 2(1):12

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  19. Fernandes AD, Gloor GB (2010) Mutual information is critically dependent on prior assumptions: would the correct estimate of mutual information please identify itself? Bioinformatics 26(9):1135–1139

    Article  CAS  PubMed  Google Scholar 

  20. Jeong CS, Kim D (2012) Reliable and robust detection of coevolving protein residues. Protein Eng Des Sel 25(11):705–713

    Article  CAS  PubMed  Google Scholar 

  21. Felsenstein J (1985) Phylogenies and the comparative method. Am Nat 125(1):1–15

    Article  Google Scholar 

  22. Shindyalov IN, Kolchanov NA, Sander C (1994) Can three-dimensional contacts in protein structures be predicted by analysis of correlated mutations? Protein Eng 7(3):349–358

    Article  CAS  PubMed  Google Scholar 

  23. Wollenberg KR, Atchley WR (2000) Separation of phylogenetic and functional associations in biological sequences by using the parametric bootstrap. Proc Natl Acad Sci 97(7):3288–3291

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Gloor GB, Martin LC, Wahl LM, Dunn SD (2005) Mutual information in protein multiple sequence alignments reveals two classes of coevolving positions. Biochemistry 44(19):7156–7165

    Article  CAS  PubMed  Google Scholar 

  25. Pollock DD, Taylor WR, Goldman N (1999) Coevolving protein residues: maximum likelihood identification and relationship to structure. J Mol Biol 287(1):187–198

    Article  CAS  PubMed  Google Scholar 

  26. Tuff P, Darlu P (2000) Exploring a phylogenetic approach for the detection of correlated substitutions in proteins. Mol Biol Evol 17(11):1753–1759

    Article  CAS  PubMed  Google Scholar 

  27. Poon AFY, Lewis FI, Pond SLK, Frost SDW (2007) An evolutionary-network model reveals stratified interactions in the V3 loop of the HIV-1 envelope. PLoS Comput Biol 3(11):e231

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  28. Talavera D, Lovell SC, Whelan S (2015) Covariation is a poor measure of molecular coevolution. Mol Biol Evol 32(9):2456–2468

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Fodor AA, Aldrich RW (2004) Influence of conservation on calculations of amino acid covariance in multiple sequence alignments. Proteins Struct Funct Bioinf 56(2):211–221

    Article  CAS  Google Scholar 

  30. Pearl J (1986) Fusion, propagation, and structuring in belief networks. Artif Intell 29(3):241–288

    Article  Google Scholar 

  31. Friedman N, Koller D (2003) Being Bayesian about network structure. A Bayesian approach to structure discovery in Bayesian networks. Mach Learn 50(1–2):95–125

    Article  Google Scholar 

  32. Pond SLK, Frost SDW, Muse SV (2005) HyPhy: hypothesis testing using phylogenies. Bioinformatics 21(5):676–679

    Article  CAS  PubMed  Google Scholar 

  33. Delport W, Poon AFY, Frost SDW, Kosakovsky Pond SL (2010) Datamonkey 2010: a suite of phylogenetic analysis tools for evolutionary biology. Bioinformatics 26(19):2455–2457

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Poon AFY, Lewis FI, Frost SDW, Kosakovsky Pond SL (2008) Spidermonkey: rapid detection of co-evolving sites using Bayesian graphical models. Bioinformatics 24(17):1949–1950

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Stamatakis A (2014) RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30(9):1312–1313

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, Gascuel O (2010) New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol 59(3):307–321

    Article  CAS  PubMed  Google Scholar 

  37. Price MN, Dehal PS, Arkin AP (2010) FastTree 2–approximately maximum-likelihood trees for large alignments. PLoS ONE 5(3):e9490

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  38. Holmes S (2003) Bootstrapping phylogenetic trees: theory and methods. Stat Sci 18:241–255

    Article  Google Scholar 

  39. Muse SV, Gaut BS (1994) A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol Biol Evol 11(5):715–724

    CAS  PubMed  Google Scholar 

  40. Yang Z (1993) Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol Biol Evol 10(6):1396–1401

    CAS  PubMed  Google Scholar 

  41. Felsenstein J, Churchill GA (1996) A hidden Markov model approach to variation among sites in rate of evolution. Mol Biol Evol 13(1):93–104

    Article  CAS  PubMed  Google Scholar 

  42. Swofford D, Begle DP (1993) PAUP: Phylogenetic analysis using parsimony, Version 3.1, March 1993. Center for Biodiversity, Illinois Natural History Survey

    Google Scholar 

  43. Tamura K, Nei M (1993) Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol 10(3):512–526

    CAS  PubMed  Google Scholar 

  44. Posada D (2003) Using MODELTEST and PAUP* to select a model of nucleotide substitution. Curr Protoc Bioinformatics 6–5. https://doi.org/10.1002/0471250953.bi0605s00

    Article  Google Scholar 

  45. Maddison DR, Swofford DL, Maddison WP (1997) NEXUS: an extensible file format for systematic information. Syst Biol 46(4):590–621

    Article  CAS  PubMed  Google Scholar 

  46. Joy JB, Liang RH, McCloskey RM, Nguyen T, Poon AFY (2016) Ancestral reconstruction. PLoS Comput Biol 12(7):e1004763

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  47. Nielsen R (2002) Mapping mutations on phylogenies. Syst Biol 51(5):729–739

    Article  PubMed  Google Scholar 

  48. Pupko T, Pe I, Shamir R, Graur D (2000) A fast algorithm for joint reconstruction of ancestral amino acid sequences. Mol Biol Evol 17(6):890–896

    Article  CAS  PubMed  Google Scholar 

  49. Ellson J, Gansner E, Koutsofios L, North SC, Woodhull G (2001) Graphviz—open source graph drawing tools. In: International symposium on graph drawing. Springer, Berlin, pp 483–484

    Google Scholar 

  50. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13(11):2498–2504

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Bastian M, Heymann S, Jacomy M et al (2009) Gephi: an open source software for exploring and manipulating networks. In: Proceedings of the third international ICWSM conference, vol 8, pp 361–362

    Google Scholar 

  52. Simmonds P (2004) Genetic diversity and evolution of hepatitis C virus–15 years on. J Gen Virol 85(11):3173–3188

    Article  CAS  PubMed  Google Scholar 

  53. Blach S, Zeuzem S, Manns M, Altraif I, Duberg AS, Muljono DH, Waked I, Alavian SM, Lee MH, Negro F et al (2017) Global prevalence and genotype distribution of hepatitis C virus infection in 2015: a modelling study. Lancet Gastroenterol Hepatol 2(3):161–176

    Article  Google Scholar 

  54. Campo D, Dimitrova Z, Mitchell RJ, Lara J, Khudyakov Y (2008) Coordinated evolution of the hepatitis C virus. Proc Natl Acad Sci 105(28):9685–9690

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Aurora R, Donlin MJ, Cannon NA, Tavis JE (2009) Genome-wide hepatitis C virus amino acid covariance networks can predict response to antiviral therapy in humans. J Clin Invest 119(1):225–236

    CAS  PubMed  Google Scholar 

  56. McCloskey RM, Liang RH, Joy JB, Krajden M, Montaner JS, Harrigan PR, Poon AF (2014) Global origin and transmission of hepatitis C virus nonstructural protein 3 Q80K polymorphism. J Infect Dis 211(8):1288–1295

    Article  PubMed  CAS  Google Scholar 

  57. Poveda E, Wyles DL, Mena Á, Pedreira JD, Castro-Iglesias Á, Cachay E (2014) Update on hepatitis C virus resistance to direct-acting antiviral agents. Antivir Res 108:181–191

    Article  CAS  PubMed  Google Scholar 

  58. Combet C, Garnier N, Charavay C, Grando D, Crisan D, Lopez J, Dehne-Garcia A, Geourjon C, Bettler E, Hulo C et al (2006) euHCVdb: the European hepatitis C virus database. Nucleic Acids Res 35(Suppl_1):D363–D366

    Article  CAS  PubMed  Google Scholar 

  59. Katoh K, Standley DM (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 30(4):772–780

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Larsson A (2014) AliView: a fast and lightweight alignment viewer and editor for large datasets. Bioinformatics 30(22):3276–3278

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Darriba D, Taboada GL, Doallo R, Posada D (2012) jModelTest 2: more models, new heuristics and parallel computing. Nat Methods 9(8):772

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. Guindon S, Gascuel O (2003) A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol 52(5):696–704

    Article  PubMed  Google Scholar 

  63. Yu G, Smith DK, Zhu H, Guan Y, Lam TTY (2017) ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods Ecol Evol 8(1):28–36

    Article  Google Scholar 

  64. Plummer M, Best N, Cowles K, Vines K (2006) CODA: convergence diagnosis and output analysis for MCMC. R News 6(1):7–11

    Google Scholar 

  65. Gelman A, Rubin DB (1992) Inference from iterative simulation using multiple sequences. Stat Sci 7:457–472

    Article  Google Scholar 

  66. Ranjith-Kumar C, Kao CC (2006) Biochemical activities of the HCV NS5B RNA-dependent RNA polymerase. In: Tan S (ed) Hepatitis C viruses: genomes and molecular biology. Horizon Bioscience, Norfolk, pp 293–310

    Google Scholar 

  67. Hong Z, Cameron CE, Walker MP, Castro C, Yao N, Lau JY, Zhong W (2001) A novel mechanism to ensure terminal initiation by hepatitis C virus NS5B polymerase. Virology 285(1):6–11

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

This study was supported in part by the Government of Canada through Genome Canada and the Ontario Genomics Institute (OGI-131), and by grants from the Canadian Institutes of Health Research (PJT-153391 and BOP-149562). AFYP was supported by a CIHR New Investigator Award (FRN-130609).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mariano Avino .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Science+Business Media, LLC, part of Springer Nature

About this protocol

Check for updates. Verify currency and authenticity via CrossMark

Cite this protocol

Avino, M., Poon, A.F.Y. (2019). Detecting Amino Acid Coevolution with Bayesian Graphical Models. In: Sikosek, T. (eds) Computational Methods in Protein Evolution. Methods in Molecular Biology, vol 1851. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-8736-8_6

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-8736-8_6

  • Published:

  • Publisher Name: Humana Press, New York, NY

  • Print ISBN: 978-1-4939-8735-1

  • Online ISBN: 978-1-4939-8736-8

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics