Detecting Amino Acid Coevolution with Bayesian Graphical Models

Avino, Mariano; Poon, Art F. Y.

doi:10.1007/978-1-4939-8736-8_6

Mariano Avino³ &
Art F. Y. Poon³

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1851))

3065 Accesses
3 Citations

Abstract

The comparative study of homologous proteins can provide abundant information about the functional and structural constraints on protein evolution. For example, an amino acid substitution that is deleterious may become permissive in the presence of another substitution at a second site of the protein. A popular approach for detecting coevolving residues is by looking for correlated substitution events on branches of the molecular phylogeny relating the protein-coding sequences. Here we describe a machine learning method (Bayesian graphical models) implemented in the open-source phylogenetic software package HyPhy, http://hyphy.org, for extracting a network of coevolving residues from a sequence alignment.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The scripts in this chapter were tested with HyPhy version 2.220170201beta and release 2.2.7. HyPhy is a large and complex software package that is constantly undergoing development by a small team of researchers and programmers, and some of the more specialized features such as BGMs may temporarily break as newer versions are released. If you compiled HyPhy from source, make sure that you are using a single-threaded (HYPHYSP) or multiprocessing-enabled (HYPHYMP) build and not a message passing interface (MPI)-enabled (HYPHYMPI) build; at the time of writing, there were residual issues in the source code related to MPI processing. If you encounter any other problems, please submit an issue at https://github.com/veg/hyphy/issues and we will attend to it as soon as possible.
2.
For this type of analysis, we prefer using maximum likelihood (ML) methods to reconstruct trees. If it is not feasible to use ML methods due to excessive numbers of sequence and/or sequence lengths, we suggest using the approximate ML program FastTree 2 [37], which can be orders of magnitude faster than the standard ML programs. Neighbor-joining (NJ) methods also scale favorably with larger alignments, but tend to be less accurate for reconstructing branch lengths. While there are NJ and ML tree reconstruction methods implemented in HyPhy, they are not as efficient as these specialized programs and we do not recommend using them for larger data sets.
3.
A bootstrap support value is an empirical measure of confidence in a specific clade given the data. Most phylogeny reconstruction programs should have an option to omit these values. If you already have a Newick tree file and you just need to remove the support values, you can use the following UNIX command: sed -E ’s/)[0-9.]+:/):/g’ [input] > [output].
4.
From this point onward, we assume that you are using the command-line interface. Unfortunately, this script may not work properly with the GUI because of how HyPhy handles file paths. Even on the command line, this is not straight-forward. For example, we used the following invocation in the macOS Terminal: HYPHYMP BASEPATH=/usr/local/lib/hyphy/ ‘pwd‘/fit_codon_model.bf If you want to take advantage of a multi-core CPU, you can add the argument CPU=[number of cores] immediately after HYPHYMP. Note that not all steps in this analysis are able to utilize multiple threads.
5.
If you want to examine this scaling factor, you can find it in the serialized likelihood function generated by this script by searching for the parameter name scalingB.
6.
If you’re using an operating system with a desktop environment, it’s often easier to drag the icon representing your file into the terminal window instead of typing out the corresponding path. This works when running HyPhy on the command line, but you need to use backspace to remove the space that is automatically appended to end of the path. HyPhy won’t be able to locate the file otherwise.
7.
Prior to version 2.3.4, the text in HyPhy implies that these options allow rates to vary among branches, not sites: “…branch lengths come from a user-chosen distribution.” We have revised this help text as of version 2.3.4 to indicate that the distributions are used to model rate variation across sites, not branches.
8.
A standard codon model is described by a 61-by-61 transition rate matrix and a single parameter R that corresponds to the ratio of non-synonymous and synonymous substitution rates. The model assumes that the system moves from one codon to another by single nucleotide substitutions; codon substitutions that require more than one nucleotide change are not allowed.
9.
Some phylogeny reconstruction programs truncate sequence labels and cause an error at this stage—for example, neither RAxML or FastTree2 will read sequence labels beyond a whitespace character. A quick fix in this situation is to replace all whitespace characters with underscores in a text editor or with sed.
10.
By convention, we use the file extension .lf and keep the same basename as the codon data file. This makes it easier to track files that belong to the same workflow.
11.
NEXUS is a widespread format with known issues with standardization and usability, and has been implemented in diverse and often incompatible ways by multiple programs.
12.
We have previously found this list output to be a more convenient format for debugging the script. It’s usually a good idea to manually compare entries in this list against your sequence alignment to make sure that things make sense.
13.
Most phylogenetic tree reconstruction methods, such as maximum likelihood or neighbor-joining, will output an unrooted tree. For an unrooted tree, the labels will be generated for the deepest internal node.
14.
For example, you can customize on a node-by-node basis the number of “parental” nodes on which a given node can be conditionally dependent. You can also load a serialized BGM from a XML Bayesian Interchange format file and use this model to simulate additional data sets. For more details, please refer to the file bayesgraph.ibf and the batch file tests/hbltests/BayesianGraphicalModels/TestBGM.bf in the HyPhy source code distribution.
15.
As a general rule of thumb, we try to not build a BGM model that has many more nodes than observations. The number of substitutions provides a meaningful criterion for reducing the dimensionality of our data.
16.
This is where the ability to customize the analysis implemented in the bayesgraph.bf script can be very useful. If you have prior information that a subset of codon sites are involved in a large number of interactions, the computational complexity of increasing the number of parents can be greatly reduced by modifying this parameter for only these sites.
17.
(In an MCMC run, we observe autocorrelation when we sample parameter values that are very close in the parameter space and unrepresentative of the true underlying posterior distribution. Therefore, we try to decrease autocorrelation so that the MCMC sample provides a more precise estimate of the posterior sample. One way to accomplish this is by down-sampling to every n-th step).
18.
We have provided most of the data files in this example on our GitHub repository at https://github.com/PoonLab/comet-prot/tree/master/data.
19.
To generate an amino acid sequence from the column labels, we used the regular expression “[0-9]+,*” to replace all instances with an empty string. In Python, this can be achieved with the re module: seq = re.sub(’[0-9]+,*’, ’’, header.strip()), where header is a string variable containing the first line of the CSV file.
20.
This can be accomplished with the following R commands:

require(coda)

chain1 <- read.csv("chain1.trace.csv", header=F)

chain2 <- read.csv("chain2.trace.csv", header=F)

chains <- mcmc.list(mcmc(chain1$V1), mcmc(chain2$V1))

gelman.diag(chains, autoburnin=F)

where the file names may be different for your run.

References

Kihara D (2005) The effect of long-range interactions on the secondary structure formation of proteins. Protein Sci 14(8):1955–1963
Article CAS PubMed PubMed Central Google Scholar
Sprinzak E, Margalit H (2001) Correlated sequence-signatures as markers of protein-protein interaction. J Mol Biol 311(4):681–692
Article CAS PubMed Google Scholar
Horner DS, Pirovano W, Pesole G (2007) Correlated substitution analysis and the prediction of amino acid structural contacts. Brief Bioinform 9(1):46–56
Article PubMed CAS Google Scholar
Taylor WR, Hamilton RS, Sadowski MI (2013) Prediction of contacts from correlated sequence substitutions. Curr Opin Struct Biol 23(3):473–479
Article CAS PubMed Google Scholar
Marks DS, Hopf TA, Sander C (2012) Protein structure prediction from sequence variation. Nat Biotechnol 30(11):1072–1080
Article CAS PubMed PubMed Central Google Scholar
De Juan D, Pazos F, Valencia A (2013) Emerging methods in protein co-evolution. Nat Rev Genet 14(4):249
Article PubMed CAS Google Scholar
Göbel U, Sander C, Schneider R, Valencia A (1994) Correlated mutations and residue contacts in proteins. Proteins Struct Funct Bioinf 18(4):309–317
Article Google Scholar
Korber B, Farber RM, Wolpert DH, Lapedes AS (1993) Covariation of mutations in the V3 loop of human immunodeficiency virus type 1 envelope protein: an information theoretic analysis. Proc Natl Acad Sci 90(15):7176–7180
Article CAS PubMed PubMed Central Google Scholar
Hirschhorn JN, Lohmueller K, Byrne E, Hirschhorn K (2002) A comprehensive review of genetic association studies. Genet Med 4(2):45–61
Article CAS PubMed Google Scholar
Kowarsch A, Fuchs A, Frishman D, Pagel P (2010) Correlated mutations: a hallmark of phenotypic amino acid substitutions. PLoS Comput Biol 6(9):e1000923
Article PubMed PubMed Central CAS Google Scholar
Weinreich DM, Delaney NF, DePristo MA, Hartl DL (2006) Darwinian evolution can follow only very few mutational paths to fitter proteins. Science 312(5770):111–114
Article CAS PubMed Google Scholar
Ivankov DN, Finkelstein AV, Kondrashov FA (2014) A structural perspective of compensatory evolution. Curr Opin Struct Biol 26:104–112
Article CAS PubMed PubMed Central Google Scholar
Neher E (1994) How frequent are correlated changes in families of protein sequences? Proc Natl Acad Sci 91(1):98–102
Article CAS PubMed PubMed Central Google Scholar
Olmea O, Rost B, Valencia A (1999) Effective use of sequence correlation and conservation in fold recognition. J Mol Biol 293(5):1221–1239
Article CAS PubMed Google Scholar
Atchley WR, Wollenberg KR, Fitch WM, Terhalle W, Dress AW (2000) Correlations among amino acid sites in bHLH protein domains: an information theoretic analysis. Mol Biol Evol 17(1):164–178
Article CAS PubMed Google Scholar
Tillier ER, Lui TW (2003) Using multiple interdependency to separate functional from phylogenetic correlations in protein alignments. Bioinformatics 19(6):750–755
Article CAS PubMed Google Scholar
Martin L, Gloor GB, Dunn S, Wahl LM (2005) Using information theory to search for co-evolving residues in proteins. Bioinformatics 21(22):4116–4124
Article CAS PubMed Google Scholar
Gouveia-Oliveira R, Pedersen AG (2007) Finding coevolving amino acid residues using row and column weighting of mutual information and multi-dimensional amino acid representation. Algorithms Mol Biol 2(1):12
Article PubMed PubMed Central CAS Google Scholar
Fernandes AD, Gloor GB (2010) Mutual information is critically dependent on prior assumptions: would the correct estimate of mutual information please identify itself? Bioinformatics 26(9):1135–1139
Article CAS PubMed Google Scholar
Jeong CS, Kim D (2012) Reliable and robust detection of coevolving protein residues. Protein Eng Des Sel 25(11):705–713
Article CAS PubMed Google Scholar
Felsenstein J (1985) Phylogenies and the comparative method. Am Nat 125(1):1–15
Article Google Scholar
Shindyalov IN, Kolchanov NA, Sander C (1994) Can three-dimensional contacts in protein structures be predicted by analysis of correlated mutations? Protein Eng 7(3):349–358
Article CAS PubMed Google Scholar
Wollenberg KR, Atchley WR (2000) Separation of phylogenetic and functional associations in biological sequences by using the parametric bootstrap. Proc Natl Acad Sci 97(7):3288–3291
Article CAS PubMed PubMed Central Google Scholar
Gloor GB, Martin LC, Wahl LM, Dunn SD (2005) Mutual information in protein multiple sequence alignments reveals two classes of coevolving positions. Biochemistry 44(19):7156–7165
Article CAS PubMed Google Scholar
Pollock DD, Taylor WR, Goldman N (1999) Coevolving protein residues: maximum likelihood identification and relationship to structure. J Mol Biol 287(1):187–198
Article CAS PubMed Google Scholar
Tuff P, Darlu P (2000) Exploring a phylogenetic approach for the detection of correlated substitutions in proteins. Mol Biol Evol 17(11):1753–1759
Article CAS PubMed Google Scholar
Poon AFY, Lewis FI, Pond SLK, Frost SDW (2007) An evolutionary-network model reveals stratified interactions in the V3 loop of the HIV-1 envelope. PLoS Comput Biol 3(11):e231
Article PubMed PubMed Central CAS Google Scholar
Talavera D, Lovell SC, Whelan S (2015) Covariation is a poor measure of molecular coevolution. Mol Biol Evol 32(9):2456–2468
Article CAS PubMed PubMed Central Google Scholar
Fodor AA, Aldrich RW (2004) Influence of conservation on calculations of amino acid covariance in multiple sequence alignments. Proteins Struct Funct Bioinf 56(2):211–221
Article CAS Google Scholar
Pearl J (1986) Fusion, propagation, and structuring in belief networks. Artif Intell 29(3):241–288
Article Google Scholar
Friedman N, Koller D (2003) Being Bayesian about network structure. A Bayesian approach to structure discovery in Bayesian networks. Mach Learn 50(1–2):95–125
Article Google Scholar
Pond SLK, Frost SDW, Muse SV (2005) HyPhy: hypothesis testing using phylogenies. Bioinformatics 21(5):676–679
Article CAS PubMed Google Scholar
Delport W, Poon AFY, Frost SDW, Kosakovsky Pond SL (2010) Datamonkey 2010: a suite of phylogenetic analysis tools for evolutionary biology. Bioinformatics 26(19):2455–2457
Article CAS PubMed PubMed Central Google Scholar
Poon AFY, Lewis FI, Frost SDW, Kosakovsky Pond SL (2008) Spidermonkey: rapid detection of co-evolving sites using Bayesian graphical models. Bioinformatics 24(17):1949–1950
Article CAS PubMed PubMed Central Google Scholar
Stamatakis A (2014) RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30(9):1312–1313
Article CAS PubMed PubMed Central Google Scholar
Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, Gascuel O (2010) New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol 59(3):307–321
Article CAS PubMed Google Scholar
Price MN, Dehal PS, Arkin AP (2010) FastTree 2–approximately maximum-likelihood trees for large alignments. PLoS ONE 5(3):e9490
Article PubMed PubMed Central CAS Google Scholar
Holmes S (2003) Bootstrapping phylogenetic trees: theory and methods. Stat Sci 18:241–255
Article Google Scholar
Muse SV, Gaut BS (1994) A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol Biol Evol 11(5):715–724
CAS PubMed Google Scholar
Yang Z (1993) Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol Biol Evol 10(6):1396–1401
CAS PubMed Google Scholar
Felsenstein J, Churchill GA (1996) A hidden Markov model approach to variation among sites in rate of evolution. Mol Biol Evol 13(1):93–104
Article CAS PubMed Google Scholar
Swofford D, Begle DP (1993) PAUP: Phylogenetic analysis using parsimony, Version 3.1, March 1993. Center for Biodiversity, Illinois Natural History Survey
Google Scholar
Tamura K, Nei M (1993) Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol 10(3):512–526
CAS PubMed Google Scholar
Posada D (2003) Using MODELTEST and PAUP* to select a model of nucleotide substitution. Curr Protoc Bioinformatics 6–5. https://doi.org/10.1002/0471250953.bi0605s00
Article Google Scholar
Maddison DR, Swofford DL, Maddison WP (1997) NEXUS: an extensible file format for systematic information. Syst Biol 46(4):590–621
Article CAS PubMed Google Scholar
Joy JB, Liang RH, McCloskey RM, Nguyen T, Poon AFY (2016) Ancestral reconstruction. PLoS Comput Biol 12(7):e1004763
Article PubMed PubMed Central CAS Google Scholar
Nielsen R (2002) Mapping mutations on phylogenies. Syst Biol 51(5):729–739
Article PubMed Google Scholar
Pupko T, Pe I, Shamir R, Graur D (2000) A fast algorithm for joint reconstruction of ancestral amino acid sequences. Mol Biol Evol 17(6):890–896
Article CAS PubMed Google Scholar
Ellson J, Gansner E, Koutsofios L, North SC, Woodhull G (2001) Graphviz—open source graph drawing tools. In: International symposium on graph drawing. Springer, Berlin, pp 483–484
Google Scholar
Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13(11):2498–2504
Article CAS PubMed PubMed Central Google Scholar
Bastian M, Heymann S, Jacomy M et al (2009) Gephi: an open source software for exploring and manipulating networks. In: Proceedings of the third international ICWSM conference, vol 8, pp 361–362
Google Scholar
Simmonds P (2004) Genetic diversity and evolution of hepatitis C virus–15 years on. J Gen Virol 85(11):3173–3188
Article CAS PubMed Google Scholar
Blach S, Zeuzem S, Manns M, Altraif I, Duberg AS, Muljono DH, Waked I, Alavian SM, Lee MH, Negro F et al (2017) Global prevalence and genotype distribution of hepatitis C virus infection in 2015: a modelling study. Lancet Gastroenterol Hepatol 2(3):161–176
Article Google Scholar
Campo D, Dimitrova Z, Mitchell RJ, Lara J, Khudyakov Y (2008) Coordinated evolution of the hepatitis C virus. Proc Natl Acad Sci 105(28):9685–9690
Article CAS PubMed PubMed Central Google Scholar
Aurora R, Donlin MJ, Cannon NA, Tavis JE (2009) Genome-wide hepatitis C virus amino acid covariance networks can predict response to antiviral therapy in humans. J Clin Invest 119(1):225–236
CAS PubMed Google Scholar
McCloskey RM, Liang RH, Joy JB, Krajden M, Montaner JS, Harrigan PR, Poon AF (2014) Global origin and transmission of hepatitis C virus nonstructural protein 3 Q80K polymorphism. J Infect Dis 211(8):1288–1295
Article PubMed CAS Google Scholar
Poveda E, Wyles DL, Mena Á, Pedreira JD, Castro-Iglesias Á, Cachay E (2014) Update on hepatitis C virus resistance to direct-acting antiviral agents. Antivir Res 108:181–191
Article CAS PubMed Google Scholar
Combet C, Garnier N, Charavay C, Grando D, Crisan D, Lopez J, Dehne-Garcia A, Geourjon C, Bettler E, Hulo C et al (2006) euHCVdb: the European hepatitis C virus database. Nucleic Acids Res 35(Suppl_1):D363–D366
Article CAS PubMed Google Scholar
Katoh K, Standley DM (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 30(4):772–780
Article CAS PubMed PubMed Central Google Scholar
Larsson A (2014) AliView: a fast and lightweight alignment viewer and editor for large datasets. Bioinformatics 30(22):3276–3278
Article CAS PubMed PubMed Central Google Scholar
Darriba D, Taboada GL, Doallo R, Posada D (2012) jModelTest 2: more models, new heuristics and parallel computing. Nat Methods 9(8):772
Article CAS PubMed PubMed Central Google Scholar
Guindon S, Gascuel O (2003) A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol 52(5):696–704
Article PubMed Google Scholar
Yu G, Smith DK, Zhu H, Guan Y, Lam TTY (2017) ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods Ecol Evol 8(1):28–36
Article Google Scholar
Plummer M, Best N, Cowles K, Vines K (2006) CODA: convergence diagnosis and output analysis for MCMC. R News 6(1):7–11
Google Scholar
Gelman A, Rubin DB (1992) Inference from iterative simulation using multiple sequences. Stat Sci 7:457–472
Article Google Scholar
Ranjith-Kumar C, Kao CC (2006) Biochemical activities of the HCV NS5B RNA-dependent RNA polymerase. In: Tan S (ed) Hepatitis C viruses: genomes and molecular biology. Horizon Bioscience, Norfolk, pp 293–310
Google Scholar
Hong Z, Cameron CE, Walker MP, Castro C, Yao N, Lau JY, Zhong W (2001) A novel mechanism to ensure terminal initiation by hepatitis C virus NS5B polymerase. Virology 285(1):6–11
Article CAS PubMed Google Scholar

Download references

Acknowledgements

This study was supported in part by the Government of Canada through Genome Canada and the Ontario Genomics Institute (OGI-131), and by grants from the Canadian Institutes of Health Research (PJT-153391 and BOP-149562). AFYP was supported by a CIHR New Investigator Award (FRN-130609).

Author information

Authors and Affiliations

Department of Pathology and Laboratory Medicine, Western University, London, Canada
Mariano Avino & Art F. Y. Poon

Authors

Mariano Avino
View author publications
You can also search for this author in PubMed Google Scholar
Art F. Y. Poon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mariano Avino .

Editor information

Editors and Affiliations

GlaxoSmithKline, Cellzome – a GSK company Meyerhofstrasse 1, Heidelberg, Baden-Württemberg, Germany
Tobias Sikosek

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Avino, M., Poon, A.F.Y. (2019). Detecting Amino Acid Coevolution with Bayesian Graphical Models. In: Sikosek, T. (eds) Computational Methods in Protein Evolution. Methods in Molecular Biology, vol 1851. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-8736-8_6

Download citation

DOI: https://doi.org/10.1007/978-1-4939-8736-8_6
Published: 27 September 2018
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-8735-1
Online ISBN: 978-1-4939-8736-8
eBook Packages: Springer Protocols

Publish with us

Policies and ethics