Skip to main content

Learning Models for Aligning Protein Sequences with Predicted Secondary Structure

  • Conference paper
Research in Computational Molecular Biology (RECOMB 2009)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 5541))

  • 1595 Accesses

Abstract

Accurately aligning distant protein sequences is notoriously difficult. A recent approach to improving alignment accuracy is to use additional information such as predicted secondary structure. We introduce several new models for scoring alignments of protein sequences with predicted secondary structure, which use the predictions and their confidences to modify both the substitution and gap cost functions. We present efficient algorithms for computing optimal pairwise alignments under these models, all of which run in near-quadratic time. We also review an approach to learning the values of the parameters in these models called inverse alignment. We then evaluate the accuracy of these models by studying how well an optimal alignment under the model recovers known benchmark reference alignments. Our experiments show that using parameters learned by inverse alignment, these new secondary-structure-based models provide a significant improvement in alignment accuracy for distant sequences. The best model improves upon the accuracy of the standard sequence alignment model for pairwise alignment by as much as 15% for sequences with less than 25% identity, and improves the accuracy of multiple alignment by 20% for difficult benchmarks whose average accuracy under standard tools is less than 40%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aydin, Z., Altunbasak, Y., Borodovsky, M.: Protein secondary structure prediction for a single-sequence using hidden semi-Markov models. BMC Bioinformatics 7(178), 1–15 (2006)

    Google Scholar 

  2. Bahr, A., Thompson, J.D., Thierry, J.C., Poch, O.: BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations. Nucleic Acids Research 29(1), 323–326 (2001)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Balaji, S., Sujatha, S., Kumar, S.S.C., Srinivasan, N.: PALI: a database of alignments and phylogeny of homologous protein structures. Nucleic Acids Research 29(1), 61–65 (2001)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. de Berg, M., van Kreveld, M., Overmars, M., Schwarzkopf, O.: Computational Geometry: Algorithms and Applications, 2nd edn. Springer, Berlin (2000)

    Book  Google Scholar 

  5. Cook, W., Cunningham, W., Pulleyblank, W., Schrijver, A.: Combinatorial Optimization. John Wiley and Sons, New York (1998)

    Google Scholar 

  6. Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C.: A model of evolutionary change in proteins. In: Dayhoff, M.O. (ed.) Atlas of Protein Sequence and Structure, vol. 5(3), pp. 345–352. National Biomedical Research Foundation, Washington DC (1978)

    Google Scholar 

  7. Do, C.B., Mahabhashyam, M.S., Brudno, M., Batzoglou, S.: ProbCons: probabilistic consistency based multiple sequence alignment. Genome Research 15, 330–340 (2005)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Do, C.B., Gross, S., Batzoglou, S.: CONTRAlign: discriminative training for protein sequence alignment. In: Apostolico, A., Guerra, C., Istrail, S., Pevzner, P.A., Waterman, M. (eds.) RECOMB 2006. LNCS (LNBI), vol. 3909, pp. 160–174. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  9. Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge (1998)

    Book  Google Scholar 

  10. Edgar, R.C.: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research 32, 1792–1797 (2004)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Galil, Z., Giancarlo, R.: Speeding up dynamic programming with applications to molecular biology. Theoretical Computer Science 64, 107–118 (1989)

    Article  Google Scholar 

  12. Gotoh, O.: An improved algorithm for matching biological sequences. Journal of Molecular Biology 162, 705–708 (1982)

    Article  CAS  PubMed  Google Scholar 

  13. Griggs, J.R., Hanlon, P., Odlyzko, A.M., Waterman, M.S.: On the number of alignments of k sequences. Graphs and Combinatorics 6, 133–146 (1990)

    Article  Google Scholar 

  14. Grötschel, M., Lovász, L., Schrijver, A.: Geometric Algorithms and Combinatorial Optimization. Springer, Berlin (1988)

    Book  Google Scholar 

  15. Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York (1997)

    Book  Google Scholar 

  16. Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences USA 89, 10915–10919 (1992)

    Article  CAS  Google Scholar 

  17. Jones, D.T.: Protein secondary structure prediction based on position-specific scoring matrices. Journal of Molecular Biology 292, 195–202 (1999)

    Article  CAS  PubMed  Google Scholar 

  18. Katoh, K., Kuma, K.I., Toh, H., Miyata, T.: MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Research 33, 511–518 (2005)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Kececioglu, J., Kim, E.: Simple and fast inverse alignment. In: Apostolico, A., Guerra, C., Istrail, S., Pevzner, P.A., Waterman, M. (eds.) RECOMB 2006. LNCS (LNBI), vol. 3909, pp. 441–455. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  20. Kececioglu, J., Starrett, D.: Aligning alignments exactly. In: Proceedings of the 8th ACM Conference on Research in Computational Molecular Biology, pp. 85–96 (2004)

    Google Scholar 

  21. Kim, E., Kececioglu, J.: Inverse sequence alignment from partial examples. In: Giancarlo, R., Hannenhalli, S. (eds.) WABI 2007. LNCS (LNBI), vol. 4645, pp. 359–370. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  22. Kim, E., Kececioglu, J.: Learning scoring schemes for sequence alignment from partial examples. IEEE/ACM Transactions on Computational Biology and Bioinformatics 5(4), 546–556 (2008)

    Article  PubMed  Google Scholar 

  23. Lu, Y., Sze, S.-H.: Multiple sequence alignment based on profile alignment of intermediate sequences. Journal of Computational Biology 15(7), 676–777 (2008)

    Article  Google Scholar 

  24. Lüthy, R., McLachlan, A.D., Eisenberg, D.: Secondary structure-based profies: use of structure-conserving scoring tables in searching protein sequence databases for structural similarities. Proteins: Structure, Function, and Genetics 10, 229–239 (1991)

    Article  Google Scholar 

  25. Makhorin, A.: GNU Linear Programming Kit, release 4.8 (2005), http://www.gnu.org/software/glpk

  26. Miller, W., Myers, E.W.: Sequence comparison with concave weighting functions. Bulletin of Mathematical Biology 50, 97–120 (1988)

    Article  CAS  PubMed  Google Scholar 

  27. Mizuguchi, K., Deane, C.M., Blundell, T.L., Overington, J.P.: HOMSTRAD: a database of protein structure alignments for homologous families. Protein Science 7, 2469–2471 (1998)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Notredame, C., Higgins, D.G., Heringa, J.: T-Coffee: a novel method for fast and accurate multiple sequence alignment. Journal of Molecular Biology 302, 205–217 (2000)

    Article  CAS  PubMed  Google Scholar 

  29. Pei, J., Grishin, N.V.: MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information. Nucleic Acids Research 34, 4364–4374 (2006)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Pei, J., Grishin, N.V.: PROMALS: towards accurate multiple sequence alignments of distantly related proteins. Bioinformatics 23(7), 802–808 (2007)

    Article  CAS  PubMed  Google Scholar 

  31. Sander, C., Schneider, R.: Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins: Structure, Function, and Genetics 9, 56–68 (1991)

    Article  CAS  Google Scholar 

  32. Simossis, V.A., Heringa, J.: PRALINE: a multiple sequence alignment toolbox that integrates homology-extended and secondary structure information. Nucleic Acids Research 33, W289–W294 (2005)

    Article  Google Scholar 

  33. Söding, J.: Protein homology detection by HMM-HMM comparison. Bioinformatics 21(7), 951–960 (2005)

    Article  PubMed  Google Scholar 

  34. Thompson, J.D., Higgins, D.G., Gibson, T.J.: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research 22, 4673–4680 (1994)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Van Walle, I., Lasters, I., Wyns, L.: Align-m: A new algorithm for multiple alignment of highly divergent sequences. Bioinformatics 20, 1428–1435 (2004)

    Article  PubMed  Google Scholar 

  36. Wheeler, T.J., Kececioglu, J.D.: Multiple alignment by aligning alignments. In: Proceedings of the 15th ISCB Conference on Intelligent Systems for Molecular Biology (ISMB), Bioinformatics, vol. 23, pp. i559–i568 (2007)

    Google Scholar 

  37. Wheeler, T.J., Kececioglu, J.D.: Opal: software for aligning multiple biological sequences. Version 0.3.7 (2007), http://opal.cs.arizona.edu

  38. Yu, C.-N., Joachims, T., Elber, R., Pillardy, J.: Support vector training of protein alignment models. Journal of Computational Biology 15(7), 867–880 (2008)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Zhou, H., Zhou, Y.: SPEM: improving multiple sequence alignment with sequence profiles and predicted secondary structures. Bioinformatics 21, 3615–3621 (2005)

    Article  CAS  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kim, E., Wheeler, T., Kececioglu, J. (2009). Learning Models for Aligning Protein Sequences with Predicted Secondary Structure. In: Batzoglou, S. (eds) Research in Computational Molecular Biology. RECOMB 2009. Lecture Notes in Computer Science(), vol 5541. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02008-7_36

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-02008-7_36

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-02007-0

  • Online ISBN: 978-3-642-02008-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics