Learning Models for Aligning Protein Sequences with Predicted Secondary Structure

  • Eagu Kim
  • Travis Wheeler
  • John Kececioglu
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5541)


Accurately aligning distant protein sequences is notoriously difficult. A recent approach to improving alignment accuracy is to use additional information such as predicted secondary structure. We introduce several new models for scoring alignments of protein sequences with predicted secondary structure, which use the predictions and their confidences to modify both the substitution and gap cost functions. We present efficient algorithms for computing optimal pairwise alignments under these models, all of which run in near-quadratic time. We also review an approach to learning the values of the parameters in these models called inverse alignment. We then evaluate the accuracy of these models by studying how well an optimal alignment under the model recovers known benchmark reference alignments. Our experiments show that using parameters learned by inverse alignment, these new secondary-structure-based models provide a significant improvement in alignment accuracy for distant sequences. The best model improves upon the accuracy of the standard sequence alignment model for pairwise alignment by as much as 15% for sequences with less than 25% identity, and improves the accuracy of multiple alignment by 20% for difficult benchmarks whose average accuracy under standard tools is less than 40%.


Sequence alignment protein secondary structure inverse parametric alignment substitution score matrices affine gap penalties 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aydin, Z., Altunbasak, Y., Borodovsky, M.: Protein secondary structure prediction for a single-sequence using hidden semi-Markov models. BMC Bioinformatics 7(178), 1–15 (2006)Google Scholar
  2. 2.
    Bahr, A., Thompson, J.D., Thierry, J.C., Poch, O.: BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations. Nucleic Acids Research 29(1), 323–326 (2001)CrossRefPubMedPubMedCentralGoogle Scholar
  3. 3.
    Balaji, S., Sujatha, S., Kumar, S.S.C., Srinivasan, N.: PALI: a database of alignments and phylogeny of homologous protein structures. Nucleic Acids Research 29(1), 61–65 (2001)CrossRefPubMedPubMedCentralGoogle Scholar
  4. 4.
    de Berg, M., van Kreveld, M., Overmars, M., Schwarzkopf, O.: Computational Geometry: Algorithms and Applications, 2nd edn. Springer, Berlin (2000)CrossRefGoogle Scholar
  5. 5.
    Cook, W., Cunningham, W., Pulleyblank, W., Schrijver, A.: Combinatorial Optimization. John Wiley and Sons, New York (1998)Google Scholar
  6. 6.
    Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C.: A model of evolutionary change in proteins. In: Dayhoff, M.O. (ed.) Atlas of Protein Sequence and Structure, vol. 5(3), pp. 345–352. National Biomedical Research Foundation, Washington DC (1978)Google Scholar
  7. 7.
    Do, C.B., Mahabhashyam, M.S., Brudno, M., Batzoglou, S.: ProbCons: probabilistic consistency based multiple sequence alignment. Genome Research 15, 330–340 (2005)CrossRefPubMedPubMedCentralGoogle Scholar
  8. 8.
    Do, C.B., Gross, S., Batzoglou, S.: CONTRAlign: discriminative training for protein sequence alignment. In: Apostolico, A., Guerra, C., Istrail, S., Pevzner, P.A., Waterman, M. (eds.) RECOMB 2006. LNCS (LNBI), vol. 3909, pp. 160–174. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  9. 9.
    Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge (1998)CrossRefGoogle Scholar
  10. 10.
    Edgar, R.C.: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research 32, 1792–1797 (2004)CrossRefPubMedPubMedCentralGoogle Scholar
  11. 11.
    Galil, Z., Giancarlo, R.: Speeding up dynamic programming with applications to molecular biology. Theoretical Computer Science 64, 107–118 (1989)CrossRefGoogle Scholar
  12. 12.
    Gotoh, O.: An improved algorithm for matching biological sequences. Journal of Molecular Biology 162, 705–708 (1982)CrossRefPubMedGoogle Scholar
  13. 13.
    Griggs, J.R., Hanlon, P., Odlyzko, A.M., Waterman, M.S.: On the number of alignments of k sequences. Graphs and Combinatorics 6, 133–146 (1990)CrossRefGoogle Scholar
  14. 14.
    Grötschel, M., Lovász, L., Schrijver, A.: Geometric Algorithms and Combinatorial Optimization. Springer, Berlin (1988)CrossRefGoogle Scholar
  15. 15.
    Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York (1997)CrossRefGoogle Scholar
  16. 16.
    Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences USA 89, 10915–10919 (1992)CrossRefGoogle Scholar
  17. 17.
    Jones, D.T.: Protein secondary structure prediction based on position-specific scoring matrices. Journal of Molecular Biology 292, 195–202 (1999)CrossRefPubMedGoogle Scholar
  18. 18.
    Katoh, K., Kuma, K.I., Toh, H., Miyata, T.: MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Research 33, 511–518 (2005)CrossRefPubMedPubMedCentralGoogle Scholar
  19. 19.
    Kececioglu, J., Kim, E.: Simple and fast inverse alignment. In: Apostolico, A., Guerra, C., Istrail, S., Pevzner, P.A., Waterman, M. (eds.) RECOMB 2006. LNCS (LNBI), vol. 3909, pp. 441–455. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  20. 20.
    Kececioglu, J., Starrett, D.: Aligning alignments exactly. In: Proceedings of the 8th ACM Conference on Research in Computational Molecular Biology, pp. 85–96 (2004)Google Scholar
  21. 21.
    Kim, E., Kececioglu, J.: Inverse sequence alignment from partial examples. In: Giancarlo, R., Hannenhalli, S. (eds.) WABI 2007. LNCS (LNBI), vol. 4645, pp. 359–370. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  22. 22.
    Kim, E., Kececioglu, J.: Learning scoring schemes for sequence alignment from partial examples. IEEE/ACM Transactions on Computational Biology and Bioinformatics 5(4), 546–556 (2008)CrossRefPubMedGoogle Scholar
  23. 23.
    Lu, Y., Sze, S.-H.: Multiple sequence alignment based on profile alignment of intermediate sequences. Journal of Computational Biology 15(7), 676–777 (2008)CrossRefGoogle Scholar
  24. 24.
    Lüthy, R., McLachlan, A.D., Eisenberg, D.: Secondary structure-based profies: use of structure-conserving scoring tables in searching protein sequence databases for structural similarities. Proteins: Structure, Function, and Genetics 10, 229–239 (1991)CrossRefGoogle Scholar
  25. 25.
    Makhorin, A.: GNU Linear Programming Kit, release 4.8 (2005),
  26. 26.
    Miller, W., Myers, E.W.: Sequence comparison with concave weighting functions. Bulletin of Mathematical Biology 50, 97–120 (1988)CrossRefPubMedGoogle Scholar
  27. 27.
    Mizuguchi, K., Deane, C.M., Blundell, T.L., Overington, J.P.: HOMSTRAD: a database of protein structure alignments for homologous families. Protein Science 7, 2469–2471 (1998)CrossRefPubMedPubMedCentralGoogle Scholar
  28. 28.
    Notredame, C., Higgins, D.G., Heringa, J.: T-Coffee: a novel method for fast and accurate multiple sequence alignment. Journal of Molecular Biology 302, 205–217 (2000)CrossRefPubMedGoogle Scholar
  29. 29.
    Pei, J., Grishin, N.V.: MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information. Nucleic Acids Research 34, 4364–4374 (2006)CrossRefPubMedPubMedCentralGoogle Scholar
  30. 30.
    Pei, J., Grishin, N.V.: PROMALS: towards accurate multiple sequence alignments of distantly related proteins. Bioinformatics 23(7), 802–808 (2007)CrossRefPubMedGoogle Scholar
  31. 31.
    Sander, C., Schneider, R.: Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins: Structure, Function, and Genetics 9, 56–68 (1991)CrossRefGoogle Scholar
  32. 32.
    Simossis, V.A., Heringa, J.: PRALINE: a multiple sequence alignment toolbox that integrates homology-extended and secondary structure information. Nucleic Acids Research 33, W289–W294 (2005)CrossRefGoogle Scholar
  33. 33.
    Söding, J.: Protein homology detection by HMM-HMM comparison. Bioinformatics 21(7), 951–960 (2005)CrossRefPubMedGoogle Scholar
  34. 34.
    Thompson, J.D., Higgins, D.G., Gibson, T.J.: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research 22, 4673–4680 (1994)CrossRefPubMedPubMedCentralGoogle Scholar
  35. 35.
    Van Walle, I., Lasters, I., Wyns, L.: Align-m: A new algorithm for multiple alignment of highly divergent sequences. Bioinformatics 20, 1428–1435 (2004)CrossRefPubMedGoogle Scholar
  36. 36.
    Wheeler, T.J., Kececioglu, J.D.: Multiple alignment by aligning alignments. In: Proceedings of the 15th ISCB Conference on Intelligent Systems for Molecular Biology (ISMB), Bioinformatics, vol. 23, pp. i559–i568 (2007)Google Scholar
  37. 37.
    Wheeler, T.J., Kececioglu, J.D.: Opal: software for aligning multiple biological sequences. Version 0.3.7 (2007),
  38. 38.
    Yu, C.-N., Joachims, T., Elber, R., Pillardy, J.: Support vector training of protein alignment models. Journal of Computational Biology 15(7), 867–880 (2008)CrossRefPubMedPubMedCentralGoogle Scholar
  39. 39.
    Zhou, H., Zhou, Y.: SPEM: improving multiple sequence alignment with sequence profiles and predicted secondary structures. Bioinformatics 21, 3615–3621 (2005)CrossRefPubMedGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Eagu Kim
    • 1
  • Travis Wheeler
    • 1
  • John Kececioglu
    • 1
  1. 1.Department of Computer ScienceThe University of ArizonaTucsonUSA

Personalised recommendations