A Maximum-Likelihood Formulation and EM Algorithm for the Protein Multiple Alignment Problem

  • Valentina Sulimova
  • Nikolay Razin
  • Vadim Mottl
  • Ilya Muchnik
  • Casimir Kulikowski
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6282)


A given group of protein sequences of different lengths is considered as resulting from random transformations of independent random ancestor sequences of the same preset smaller length, each produced in accordance with an unknown common probabilistic profile. We describe the process of transformation by a Hidden Markov Model (HMM) which is a direct generalization of the PAM model for amino acids. We formulate the problem of finding the maximum likelihood probabilistic ancestor profile and demonstrate its practicality. The proposed method of solving this problem allows for obtaining simultaneously the ancestor profile and the posterior distribution of its HMM, which permits efficient determination of the most probable multiple alignment of all the sequences. Results obtained on the BAliBASE 3.0 protein alignment benchmark indicate that the proposed method is generally more accurate than popular methods of multiple alignment such as CLUSTALW, DIALIGN and ProbAlign.


Multiple alignment problem protein sequences analysis EM-algorithm HMM common ancestor 


  1. 1.
    Rost, B., Sander, C., Schneider, R.P.: - an automatic server for protein secondary structure prediction. Computational Applications in Biosciences 10, 53–60 (1994)Google Scholar
  2. 2.
    Notredame, C.: Recent progresses in multiple sequence alignment: a survey. Pharmacogenomics 3(1), 131–144 (2002)CrossRefPubMedGoogle Scholar
  3. 3.
    Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, p. 356. Cambridge University Press, Cambridge (1998)CrossRefGoogle Scholar
  4. 4.
    Attwood, T.K.: The PRINTS database: A resource for identification of protein families. Brief Bioinformatics 3, 252–263 (2002)CrossRefPubMedGoogle Scholar
  5. 5.
    Saitou, N., Nei, M.: The neighbor-joining method: A new method for reconstructing phylo-genetic trees. Molecular Biology 212, 403–428 (1987)Google Scholar
  6. 6.
    Sankoff, D., Cedergren, R.J.: Simultaneous comparison of three or more sequences related by a tree. In: Sankoff, D., Kruskal, J.B. (eds.) Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison, pp. 253–263. Addison-Wesley, Reading (1989)Google Scholar
  7. 7.
    Altschul, S.F., Lipman, D.J.: Trees, stars, and multiple biological sequence alignment. SIAM J. Appl. Math. 49, 197–209 (1989)CrossRefGoogle Scholar
  8. 8.
    Todd Wareham, H.: A simplified proof of the NP- and MAX SNP-hardness of multiple sequence tree alignments. J. Comput. Biol. 2(4), 509–514 (1995)CrossRefGoogle Scholar
  9. 9.
    Carrillo, H., Lipman, D.: The multiple sequence alignment problem in biology. SIAM J. Appl. Math. 48, 1073–1082 (1988)CrossRefGoogle Scholar
  10. 10.
    Notredame, C., Higgins, D.G., T-Coffee, H.J.: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302, 205–217 (2000)CrossRefPubMedGoogle Scholar
  11. 11.
    Subramanian, A.R., Kaufmann, M., Morgenstern, B.: DIALIGN-TX: Greedy and progres-sive approaches for segment-based multiple sequence alignment. Algorithms for Molecular Biology 3, 6 (2008)CrossRefPubMedPubMedCentralGoogle Scholar
  12. 12.
    Barton, G.J., Sternberg, M.J.E.: A strategy for the rapid multiple alignment of protein se-quences. J. Mol. Biol. 198, 327–337 (1987)CrossRefPubMedGoogle Scholar
  13. 13.
    Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge (1998)CrossRefGoogle Scholar
  14. 14.
    Roshan, U., Libesay, D.R.: Probalign: Multiple Sequence Alignment Using Partition Function Posterior Probabilities. Oxford University Press, Oxford (2005)Google Scholar
  15. 15.
    Do, C.B., Mahabhashyam, M.S., Brudno, M., Batzoglou, S.: ProbCons: Probabilistic Consis-tency-based Multiple Sequence Alignment. Genome Res. 15, 330–340 (2005)CrossRefPubMedPubMedCentralGoogle Scholar
  16. 16.
    Pei, J., Grishin, N.V.: PROMALS: Towards accurate multiple sequence alignments of dis-tantly related proteins. Bioinformatics 23, 802–808 (2007)CrossRefPubMedGoogle Scholar
  17. 17.
    Dayhoff, M.O., Schwarts, R.M., Orcutt, B.C.: A model of evolutionary change in proteins. Atlas of Protein Sequences and Structures 5(suppl. 3), 345–352 (1978)Google Scholar
  18. 18.
    Sulimova, V., Mottl, V., Mirkin, B., Muchnik, I., Kulikowski, C.: A class of evolution-based kernels for protein homology analysis: A generalization of the PAM model. In: Proceedings of the 5th International Symposium on Bioinformatics Research and Applications, May 13-16, pp. 284–296. Nova Southeastern University, Ft. Lauderdale (2009)CrossRefGoogle Scholar
  19. 19.
    Thompson, J.D., Koehl, P., Ripp, R., Poch, O.: BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins 61, 127–136 (2005)CrossRefPubMedGoogle Scholar
  20. 20.
    BALiBASE3.0: A benchmark alignment database home page,

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Valentina Sulimova
    • 1
  • Nikolay Razin
    • 2
  • Vadim Mottl
    • 3
  • Ilya Muchnik
    • 4
  • Casimir Kulikowski
    • 5
  1. 1.Tula State UniversityTulaRussia
  2. 2.MIPTMoscowRussia
  3. 3.Computing Center of the RASMoscowRussia
  4. 4.DIMACSRutgers UniversityNew Brunswick
  5. 5.Department of Computer ScienceRutgers UniversityNew Brunswick

Personalised recommendations