Skip to main content

A Graph-Based Approach for Detecting Sequence Homology in Highly Diverged Repeat Protein Families

  • Protocol
  • First Online:
Computational Methods in Protein Evolution

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1851))

Abstract

Reconstructing evolutionary relationships in repeat proteins is notoriously difficult due to the high degree of sequence divergence that typically occurs between duplicated repeats. This is complicated further by the fact that proteins with a large number of similar repeats are more likely to produce significant local sequence alignments than proteins with fewer copies of the repeat motif. Furthermore, biologically correct sequence alignments are sometimes impossible to achieve in cases where insertion or translocation events disrupt the order of repeats in one of the sequences being aligned. Combined, these attributes make traditional phylogenetic methods for studying protein families unreliable for repeat proteins, due to the dependence of such methods on accurate sequence alignment.

We present here a practical solution to this problem, making use of graph clustering combined with the open-source software package HH-suite, which enables highly sensitive detection of sequence relationships. Carrying out multiple rounds of homology searches via alignment of profile hidden Markov models, large sets of related proteins are generated. By representing the relationships between proteins in these sets as graphs, subsequent clustering with the Markov cluster algorithm enables robust detection of repeat protein subfamilies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Kajava AV (2001) Review: proteins with repeated sequence—structural prediction and modeling. J Struct Biol 134:132–144. https://doi.org/10.1006/jsbi.2000.4328

    Article  CAS  PubMed  Google Scholar 

  2. Kajava AV (2012) Tandem repeats in proteins: from sequence to structure. J Struct Biol 179:279–288. https://doi.org/10.1016/j.jsb.2011.08.009

    Article  CAS  PubMed  Google Scholar 

  3. Kobe B, Deisenhofer J (1994) The leucine-rich repeat: a versatile binding motif. Trends Biochem Sci 19:415–421

    Article  CAS  Google Scholar 

  4. Neer EJ, Schmidt CJ, Nambudripad R, Smith TF (1994) The ancient regulatory-protein family of WD-repeat proteins. Nature 371:297–300. https://doi.org/10.1038/371297a0

    Article  CAS  PubMed  Google Scholar 

  5. Marcotte EM, Pellegrini M, Yeates TO, Eisenberg D (1999) A census of protein repeats. J Mol Biol 293:151–160. https://doi.org/10.1006/jmbi.1999.3136

    Article  CAS  PubMed  Google Scholar 

  6. Schaper E, Gascuel O, Anisimova M (2014) Deep conservation of human protein tandem repeats within the eukaryotes. Mol Biol Evol 31:1132–1148. https://doi.org/10.1093/molbev/msu062

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Andrade MA, Petosa C, O’Donoghue SI et al (2001) Comparison of ARM and HEAT protein repeats. J Mol Biol 309:1–18. https://doi.org/10.1006/jmbi.2001.4624

    Article  CAS  PubMed  Google Scholar 

  8. Sutherland TD, Campbell PM, Weisman S et al (2006) A highly divergent gene cluster in honey bees encodes a novel silk family. Genome Res 16:1414–1421. https://doi.org/10.1101/gr.5052606

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Björklund ÅK, Ekman D, Elofsson A (2006) Expansion of protein domain repeats. PLoS Comput Biol 2:0959–0970. https://doi.org/10.1371/journal.pcbi.0020114

    Article  CAS  Google Scholar 

  10. Schüler A, Bornberg-Bauer E (2016) Evolution of protein domain repeats in Metazoa. Mol Biol Evol 33:3170

    Article  Google Scholar 

  11. Persi E, Wolf YI, Koonin EV (2016) Positive and strongly relaxed purifying selection drive the evolution of repeats in proteins. Nat Commun 7:13570. https://doi.org/10.1038/ncomms13570

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Szklarczyk R, Heringa J (2004) Tracking repeats using significance and transitivity. Bioinformatics 20(Suppl 1):i311–i317. https://doi.org/10.1093/bioinformatics/bth911

    Article  CAS  PubMed  Google Scholar 

  13. Söding J, Remmert M, Biegert A, Lupas AN (2006) HHsenser: exhaustive transitive profile search using HMM-HMM comparison. Nucleic Acids Res 34:374–378. https://doi.org/10.1093/nar/gkl195

    Article  CAS  Google Scholar 

  14. Newman AM, Cooper JB (2007) XSTREAM: a practical algorithm for identification and architecture modeling of tandem repeats in protein sequences. BMC Bioinformatics 8:382. https://doi.org/10.1186/1471-2105-8-382

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Vo A, Nguyen N, Huang H (2010) Solenoid and non-solenoid protein recognition using stationary wavelet packet transform. Bioinformatics 26:i467–i473. https://doi.org/10.1093/bioinformatics/btq371

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Szalkowski AM, Anisimova M (2013) Graph-based modeling of tandem repeats improves global multiple sequence alignment. Nucleic Acids Res 41:e162–e162. https://doi.org/10.1093/nar/gkt628

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Schaper E, Kajava AV, Hauser A, Anisimova M (2012) Repeat or not repeat?--Statistical validation of tandem repeat prediction in genomic sequences. Nucleic Acids Res 40:10005–10017. https://doi.org/10.1093/nar/gks726

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Soding J, Söding J (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics 21:951–960. https://doi.org/10.1093/bioinformatics/bti125

    Article  Google Scholar 

  19. Remmert M, Biegert A, Hauser A, Söding J (2011) HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods 9:173–175. https://doi.org/10.1038/nmeth.1818

    Article  CAS  PubMed  Google Scholar 

  20. Van Dongen S (2000) A cluster algorithm for graphs. Rep Inf Syst 10:1–40

    Article  Google Scholar 

  21. Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30:1575–1584

    Article  CAS  Google Scholar 

  22. Wells JN, Gligoris TG, Nasmyth KA, Marsh JA (2017) Evolution of condensin and cohesin complexes driven by replacement of kite by hawk proteins. Curr Biol 27:R17–R18. https://doi.org/10.1016/j.cub.2016.11.050

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14:755–763

    Article  CAS  Google Scholar 

  24. Viterbi A (1967) Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans Inf Theory 13:260–269. https://doi.org/10.1109/TIT.1967.1054010

    Article  Google Scholar 

  25. Altschul SF, Gish W, Miller W et al (1990) Basic local alignment search tool. J Mol Biol 215:403–410. https://doi.org/10.1016/S0022-2836(05)80360-2

    Article  CAS  PubMed  Google Scholar 

  26. Altschul SF, Madden TL, Schäffer AA et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402

    Article  CAS  Google Scholar 

  27. Cline MS, Smoot M, Cerami E et al (2007) Integration of biological networks and gene expression data using Cytoscape. Nat Protoc 2:2366–2382. https://doi.org/10.1038/nprot.2007.324

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Chavali S, Chavali PL, Chalancon G et al (2017) Constraints and consequences of the emergence of amino acid repeats in eukaryotic proteins. Nat Struct Mol Biol 24:765–777. https://doi.org/10.1038/nsmb.3441

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jonathan N. Wells .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Science+Business Media, LLC, part of Springer Nature

About this protocol

Check for updates. Verify currency and authenticity via CrossMark

Cite this protocol

Wells, J.N., Marsh, J.A. (2019). A Graph-Based Approach for Detecting Sequence Homology in Highly Diverged Repeat Protein Families. In: Sikosek, T. (eds) Computational Methods in Protein Evolution. Methods in Molecular Biology, vol 1851. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-8736-8_13

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-8736-8_13

  • Published:

  • Publisher Name: Humana Press, New York, NY

  • Print ISBN: 978-1-4939-8735-1

  • Online ISBN: 978-1-4939-8736-8

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics