Learning Models for Aligning Protein Sequences with Predicted Secondary Structure

Kim, Eagu; Wheeler, Travis; Kececioglu, John

doi:10.1007/978-3-642-02008-7_36

Eagu Kim²⁰,
Travis Wheeler²⁰ &
John Kececioglu²⁰

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 5541))

Included in the following conference series:

Annual International Conference on Research in Computational Molecular Biology

1595 Accesses

Abstract

Accurately aligning distant protein sequences is notoriously difficult. A recent approach to improving alignment accuracy is to use additional information such as predicted secondary structure. We introduce several new models for scoring alignments of protein sequences with predicted secondary structure, which use the predictions and their confidences to modify both the substitution and gap cost functions. We present efficient algorithms for computing optimal pairwise alignments under these models, all of which run in near-quadratic time. We also review an approach to learning the values of the parameters in these models called inverse alignment. We then evaluate the accuracy of these models by studying how well an optimal alignment under the model recovers known benchmark reference alignments. Our experiments show that using parameters learned by inverse alignment, these new secondary-structure-based models provide a significant improvement in alignment accuracy for distant sequences. The best model improves upon the accuracy of the standard sequence alignment model for pairwise alignment by as much as 15% for sequences with less than 25% identity, and improves the accuracy of multiple alignment by 20% for difficult benchmarks whose average accuracy under standard tools is less than 40%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aydin, Z., Altunbasak, Y., Borodovsky, M.: Protein secondary structure prediction for a single-sequence using hidden semi-Markov models. BMC Bioinformatics 7(178), 1–15 (2006)
Google Scholar
Bahr, A., Thompson, J.D., Thierry, J.C., Poch, O.: BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations. Nucleic Acids Research 29(1), 323–326 (2001)
Article CAS PubMed PubMed Central Google Scholar
Balaji, S., Sujatha, S., Kumar, S.S.C., Srinivasan, N.: PALI: a database of alignments and phylogeny of homologous protein structures. Nucleic Acids Research 29(1), 61–65 (2001)
Article CAS PubMed PubMed Central Google Scholar
de Berg, M., van Kreveld, M., Overmars, M., Schwarzkopf, O.: Computational Geometry: Algorithms and Applications, 2nd edn. Springer, Berlin (2000)
Book Google Scholar
Cook, W., Cunningham, W., Pulleyblank, W., Schrijver, A.: Combinatorial Optimization. John Wiley and Sons, New York (1998)
Google Scholar
Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C.: A model of evolutionary change in proteins. In: Dayhoff, M.O. (ed.) Atlas of Protein Sequence and Structure, vol. 5(3), pp. 345–352. National Biomedical Research Foundation, Washington DC (1978)
Google Scholar
Do, C.B., Mahabhashyam, M.S., Brudno, M., Batzoglou, S.: ProbCons: probabilistic consistency based multiple sequence alignment. Genome Research 15, 330–340 (2005)
Article CAS PubMed PubMed Central Google Scholar
Do, C.B., Gross, S., Batzoglou, S.: CONTRAlign: discriminative training for protein sequence alignment. In: Apostolico, A., Guerra, C., Istrail, S., Pevzner, P.A., Waterman, M. (eds.) RECOMB 2006. LNCS (LNBI), vol. 3909, pp. 160–174. Springer, Heidelberg (2006)
Chapter Google Scholar
Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge (1998)
Book Google Scholar
Edgar, R.C.: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research 32, 1792–1797 (2004)
Article CAS PubMed PubMed Central Google Scholar
Galil, Z., Giancarlo, R.: Speeding up dynamic programming with applications to molecular biology. Theoretical Computer Science 64, 107–118 (1989)
Article Google Scholar
Gotoh, O.: An improved algorithm for matching biological sequences. Journal of Molecular Biology 162, 705–708 (1982)
Article CAS PubMed Google Scholar
Griggs, J.R., Hanlon, P., Odlyzko, A.M., Waterman, M.S.: On the number of alignments of k sequences. Graphs and Combinatorics 6, 133–146 (1990)
Article Google Scholar
Grötschel, M., Lovász, L., Schrijver, A.: Geometric Algorithms and Combinatorial Optimization. Springer, Berlin (1988)
Book Google Scholar
Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York (1997)
Book Google Scholar
Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences USA 89, 10915–10919 (1992)
Article CAS Google Scholar
Jones, D.T.: Protein secondary structure prediction based on position-specific scoring matrices. Journal of Molecular Biology 292, 195–202 (1999)
Article CAS PubMed Google Scholar
Katoh, K., Kuma, K.I., Toh, H., Miyata, T.: MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Research 33, 511–518 (2005)
Article CAS PubMed PubMed Central Google Scholar
Kececioglu, J., Kim, E.: Simple and fast inverse alignment. In: Apostolico, A., Guerra, C., Istrail, S., Pevzner, P.A., Waterman, M. (eds.) RECOMB 2006. LNCS (LNBI), vol. 3909, pp. 441–455. Springer, Heidelberg (2006)
Chapter Google Scholar
Kececioglu, J., Starrett, D.: Aligning alignments exactly. In: Proceedings of the 8th ACM Conference on Research in Computational Molecular Biology, pp. 85–96 (2004)
Google Scholar
Kim, E., Kececioglu, J.: Inverse sequence alignment from partial examples. In: Giancarlo, R., Hannenhalli, S. (eds.) WABI 2007. LNCS (LNBI), vol. 4645, pp. 359–370. Springer, Heidelberg (2007)
Chapter Google Scholar
Kim, E., Kececioglu, J.: Learning scoring schemes for sequence alignment from partial examples. IEEE/ACM Transactions on Computational Biology and Bioinformatics 5(4), 546–556 (2008)
Article PubMed Google Scholar
Lu, Y., Sze, S.-H.: Multiple sequence alignment based on profile alignment of intermediate sequences. Journal of Computational Biology 15(7), 676–777 (2008)
Article Google Scholar
Lüthy, R., McLachlan, A.D., Eisenberg, D.: Secondary structure-based profies: use of structure-conserving scoring tables in searching protein sequence databases for structural similarities. Proteins: Structure, Function, and Genetics 10, 229–239 (1991)
Article Google Scholar
Makhorin, A.: GNU Linear Programming Kit, release 4.8 (2005), http://www.gnu.org/software/glpk
Miller, W., Myers, E.W.: Sequence comparison with concave weighting functions. Bulletin of Mathematical Biology 50, 97–120 (1988)
Article CAS PubMed Google Scholar
Mizuguchi, K., Deane, C.M., Blundell, T.L., Overington, J.P.: HOMSTRAD: a database of protein structure alignments for homologous families. Protein Science 7, 2469–2471 (1998)
Article CAS PubMed PubMed Central Google Scholar
Notredame, C., Higgins, D.G., Heringa, J.: T-Coffee: a novel method for fast and accurate multiple sequence alignment. Journal of Molecular Biology 302, 205–217 (2000)
Article CAS PubMed Google Scholar
Pei, J., Grishin, N.V.: MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information. Nucleic Acids Research 34, 4364–4374 (2006)
Article CAS PubMed PubMed Central Google Scholar
Pei, J., Grishin, N.V.: PROMALS: towards accurate multiple sequence alignments of distantly related proteins. Bioinformatics 23(7), 802–808 (2007)
Article CAS PubMed Google Scholar
Sander, C., Schneider, R.: Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins: Structure, Function, and Genetics 9, 56–68 (1991)
Article CAS Google Scholar
Simossis, V.A., Heringa, J.: PRALINE: a multiple sequence alignment toolbox that integrates homology-extended and secondary structure information. Nucleic Acids Research 33, W289–W294 (2005)
Article Google Scholar
Söding, J.: Protein homology detection by HMM-HMM comparison. Bioinformatics 21(7), 951–960 (2005)
Article PubMed Google Scholar
Thompson, J.D., Higgins, D.G., Gibson, T.J.: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research 22, 4673–4680 (1994)
Article CAS PubMed PubMed Central Google Scholar
Van Walle, I., Lasters, I., Wyns, L.: Align-m: A new algorithm for multiple alignment of highly divergent sequences. Bioinformatics 20, 1428–1435 (2004)
Article PubMed Google Scholar
Wheeler, T.J., Kececioglu, J.D.: Multiple alignment by aligning alignments. In: Proceedings of the 15th ISCB Conference on Intelligent Systems for Molecular Biology (ISMB), Bioinformatics, vol. 23, pp. i559–i568 (2007)
Google Scholar
Wheeler, T.J., Kececioglu, J.D.: Opal: software for aligning multiple biological sequences. Version 0.3.7 (2007), http://opal.cs.arizona.edu
Yu, C.-N., Joachims, T., Elber, R., Pillardy, J.: Support vector training of protein alignment models. Journal of Computational Biology 15(7), 867–880 (2008)
Article CAS PubMed PubMed Central Google Scholar
Zhou, H., Zhou, Y.: SPEM: improving multiple sequence alignment with sequence profiles and predicted secondary structures. Bioinformatics 21, 3615–3621 (2005)
Article CAS PubMed Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, The University of Arizona, Tucson, AZ 85721, USA
Eagu Kim, Travis Wheeler & John Kececioglu

Authors

Eagu Kim
View author publications
You can also search for this author in PubMed Google Scholar
Travis Wheeler
View author publications
You can also search for this author in PubMed Google Scholar
John Kececioglu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Science Department, James H. Clark Center, 318 Campus Drive, RM S266, CA 94305-5428,, Stanford, USA
Serafim Batzoglou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kim, E., Wheeler, T., Kececioglu, J. (2009). Learning Models for Aligning Protein Sequences with Predicted Secondary Structure. In: Batzoglou, S. (eds) Research in Computational Molecular Biology. RECOMB 2009. Lecture Notes in Computer Science(), vol 5541. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02008-7_36

Download citation

DOI: https://doi.org/10.1007/978-3-642-02008-7_36
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-02007-0
Online ISBN: 978-3-642-02008-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics