Abstract
Accurately aligning distant protein sequences is notoriously difficult. A recent approach to improving alignment accuracy is to use additional information such as predicted secondary structure. We introduce several new models for scoring alignments of protein sequences with predicted secondary structure, which use the predictions and their confidences to modify both the substitution and gap cost functions. We present efficient algorithms for computing optimal pairwise alignments under these models, all of which run in near-quadratic time. We also review an approach to learning the values of the parameters in these models called inverse alignment. We then evaluate the accuracy of these models by studying how well an optimal alignment under the model recovers known benchmark reference alignments. Our experiments show that using parameters learned by inverse alignment, these new secondary-structure-based models provide a significant improvement in alignment accuracy for distant sequences. The best model improves upon the accuracy of the standard sequence alignment model for pairwise alignment by as much as 15% for sequences with less than 25% identity, and improves the accuracy of multiple alignment by 20% for difficult benchmarks whose average accuracy under standard tools is less than 40%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aydin, Z., Altunbasak, Y., Borodovsky, M.: Protein secondary structure prediction for a single-sequence using hidden semi-Markov models. BMC Bioinformatics 7(178), 1–15 (2006)
Bahr, A., Thompson, J.D., Thierry, J.C., Poch, O.: BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations. Nucleic Acids Research 29(1), 323–326 (2001)
Balaji, S., Sujatha, S., Kumar, S.S.C., Srinivasan, N.: PALI: a database of alignments and phylogeny of homologous protein structures. Nucleic Acids Research 29(1), 61–65 (2001)
de Berg, M., van Kreveld, M., Overmars, M., Schwarzkopf, O.: Computational Geometry: Algorithms and Applications, 2nd edn. Springer, Berlin (2000)
Cook, W., Cunningham, W., Pulleyblank, W., Schrijver, A.: Combinatorial Optimization. John Wiley and Sons, New York (1998)
Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C.: A model of evolutionary change in proteins. In: Dayhoff, M.O. (ed.) Atlas of Protein Sequence and Structure, vol. 5(3), pp. 345–352. National Biomedical Research Foundation, Washington DC (1978)
Do, C.B., Mahabhashyam, M.S., Brudno, M., Batzoglou, S.: ProbCons: probabilistic consistency based multiple sequence alignment. Genome Research 15, 330–340 (2005)
Do, C.B., Gross, S., Batzoglou, S.: CONTRAlign: discriminative training for protein sequence alignment. In: Apostolico, A., Guerra, C., Istrail, S., Pevzner, P.A., Waterman, M. (eds.) RECOMB 2006. LNCS (LNBI), vol. 3909, pp. 160–174. Springer, Heidelberg (2006)
Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge (1998)
Edgar, R.C.: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research 32, 1792–1797 (2004)
Galil, Z., Giancarlo, R.: Speeding up dynamic programming with applications to molecular biology. Theoretical Computer Science 64, 107–118 (1989)
Gotoh, O.: An improved algorithm for matching biological sequences. Journal of Molecular Biology 162, 705–708 (1982)
Griggs, J.R., Hanlon, P., Odlyzko, A.M., Waterman, M.S.: On the number of alignments of k sequences. Graphs and Combinatorics 6, 133–146 (1990)
Grötschel, M., Lovász, L., Schrijver, A.: Geometric Algorithms and Combinatorial Optimization. Springer, Berlin (1988)
Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York (1997)
Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences USA 89, 10915–10919 (1992)
Jones, D.T.: Protein secondary structure prediction based on position-specific scoring matrices. Journal of Molecular Biology 292, 195–202 (1999)
Katoh, K., Kuma, K.I., Toh, H., Miyata, T.: MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Research 33, 511–518 (2005)
Kececioglu, J., Kim, E.: Simple and fast inverse alignment. In: Apostolico, A., Guerra, C., Istrail, S., Pevzner, P.A., Waterman, M. (eds.) RECOMB 2006. LNCS (LNBI), vol. 3909, pp. 441–455. Springer, Heidelberg (2006)
Kececioglu, J., Starrett, D.: Aligning alignments exactly. In: Proceedings of the 8th ACM Conference on Research in Computational Molecular Biology, pp. 85–96 (2004)
Kim, E., Kececioglu, J.: Inverse sequence alignment from partial examples. In: Giancarlo, R., Hannenhalli, S. (eds.) WABI 2007. LNCS (LNBI), vol. 4645, pp. 359–370. Springer, Heidelberg (2007)
Kim, E., Kececioglu, J.: Learning scoring schemes for sequence alignment from partial examples. IEEE/ACM Transactions on Computational Biology and Bioinformatics 5(4), 546–556 (2008)
Lu, Y., Sze, S.-H.: Multiple sequence alignment based on profile alignment of intermediate sequences. Journal of Computational Biology 15(7), 676–777 (2008)
Lüthy, R., McLachlan, A.D., Eisenberg, D.: Secondary structure-based profies: use of structure-conserving scoring tables in searching protein sequence databases for structural similarities. Proteins: Structure, Function, and Genetics 10, 229–239 (1991)
Makhorin, A.: GNU Linear Programming Kit, release 4.8 (2005), http://www.gnu.org/software/glpk
Miller, W., Myers, E.W.: Sequence comparison with concave weighting functions. Bulletin of Mathematical Biology 50, 97–120 (1988)
Mizuguchi, K., Deane, C.M., Blundell, T.L., Overington, J.P.: HOMSTRAD: a database of protein structure alignments for homologous families. Protein Science 7, 2469–2471 (1998)
Notredame, C., Higgins, D.G., Heringa, J.: T-Coffee: a novel method for fast and accurate multiple sequence alignment. Journal of Molecular Biology 302, 205–217 (2000)
Pei, J., Grishin, N.V.: MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information. Nucleic Acids Research 34, 4364–4374 (2006)
Pei, J., Grishin, N.V.: PROMALS: towards accurate multiple sequence alignments of distantly related proteins. Bioinformatics 23(7), 802–808 (2007)
Sander, C., Schneider, R.: Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins: Structure, Function, and Genetics 9, 56–68 (1991)
Simossis, V.A., Heringa, J.: PRALINE: a multiple sequence alignment toolbox that integrates homology-extended and secondary structure information. Nucleic Acids Research 33, W289–W294 (2005)
Söding, J.: Protein homology detection by HMM-HMM comparison. Bioinformatics 21(7), 951–960 (2005)
Thompson, J.D., Higgins, D.G., Gibson, T.J.: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research 22, 4673–4680 (1994)
Van Walle, I., Lasters, I., Wyns, L.: Align-m: A new algorithm for multiple alignment of highly divergent sequences. Bioinformatics 20, 1428–1435 (2004)
Wheeler, T.J., Kececioglu, J.D.: Multiple alignment by aligning alignments. In: Proceedings of the 15th ISCB Conference on Intelligent Systems for Molecular Biology (ISMB), Bioinformatics, vol. 23, pp. i559–i568 (2007)
Wheeler, T.J., Kececioglu, J.D.: Opal: software for aligning multiple biological sequences. Version 0.3.7 (2007), http://opal.cs.arizona.edu
Yu, C.-N., Joachims, T., Elber, R., Pillardy, J.: Support vector training of protein alignment models. Journal of Computational Biology 15(7), 867–880 (2008)
Zhou, H., Zhou, Y.: SPEM: improving multiple sequence alignment with sequence profiles and predicted secondary structures. Bioinformatics 21, 3615–3621 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kim, E., Wheeler, T., Kececioglu, J. (2009). Learning Models for Aligning Protein Sequences with Predicted Secondary Structure. In: Batzoglou, S. (eds) Research in Computational Molecular Biology. RECOMB 2009. Lecture Notes in Computer Science(), vol 5541. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02008-7_36
Download citation
DOI: https://doi.org/10.1007/978-3-642-02008-7_36
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-02007-0
Online ISBN: 978-3-642-02008-7
eBook Packages: Computer ScienceComputer Science (R0)