Abstract
A given group of protein sequences of different lengths is considered as resulting from random transformations of independent random ancestor sequences of the same preset smaller length, each produced in accordance with an unknown common probabilistic profile. We describe the process of transformation by a Hidden Markov Model (HMM) which is a direct generalization of the PAM model for amino acids. We formulate the problem of finding the maximum likelihood probabilistic ancestor profile and demonstrate its practicality. The proposed method of solving this problem allows for obtaining simultaneously the ancestor profile and the posterior distribution of its HMM, which permits efficient determination of the most probable multiple alignment of all the sequences. Results obtained on the BAliBASE 3.0 protein alignment benchmark indicate that the proposed method is generally more accurate than popular methods of multiple alignment such as CLUSTALW, DIALIGN and ProbAlign.
Chapter PDF
Similar content being viewed by others
References
Rost, B., Sander, C., Schneider, R.P.: - an automatic server for protein secondary structure prediction. Computational Applications in Biosciences 10, 53–60 (1994)
Notredame, C.: Recent progresses in multiple sequence alignment: a survey. Pharmacogenomics 3(1), 131–144 (2002)
Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, p. 356. Cambridge University Press, Cambridge (1998)
Attwood, T.K.: The PRINTS database: A resource for identification of protein families. Brief Bioinformatics 3, 252–263 (2002)
Saitou, N., Nei, M.: The neighbor-joining method: A new method for reconstructing phylo-genetic trees. Molecular Biology 212, 403–428 (1987)
Sankoff, D., Cedergren, R.J.: Simultaneous comparison of three or more sequences related by a tree. In: Sankoff, D., Kruskal, J.B. (eds.) Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison, pp. 253–263. Addison-Wesley, Reading (1989)
Altschul, S.F., Lipman, D.J.: Trees, stars, and multiple biological sequence alignment. SIAM J. Appl. Math. 49, 197–209 (1989)
Todd Wareham, H.: A simplified proof of the NP- and MAX SNP-hardness of multiple sequence tree alignments. J. Comput. Biol. 2(4), 509–514 (1995)
Carrillo, H., Lipman, D.: The multiple sequence alignment problem in biology. SIAM J. Appl. Math. 48, 1073–1082 (1988)
Notredame, C., Higgins, D.G., T-Coffee, H.J.: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302, 205–217 (2000)
Subramanian, A.R., Kaufmann, M., Morgenstern, B.: DIALIGN-TX: Greedy and progres-sive approaches for segment-based multiple sequence alignment. Algorithms for Molecular Biology 3, 6 (2008)
Barton, G.J., Sternberg, M.J.E.: A strategy for the rapid multiple alignment of protein se-quences. J. Mol. Biol. 198, 327–337 (1987)
Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge (1998)
Roshan, U., Libesay, D.R.: Probalign: Multiple Sequence Alignment Using Partition Function Posterior Probabilities. Oxford University Press, Oxford (2005)
Do, C.B., Mahabhashyam, M.S., Brudno, M., Batzoglou, S.: ProbCons: Probabilistic Consis-tency-based Multiple Sequence Alignment. Genome Res. 15, 330–340 (2005)
Pei, J., Grishin, N.V.: PROMALS: Towards accurate multiple sequence alignments of dis-tantly related proteins. Bioinformatics 23, 802–808 (2007)
Dayhoff, M.O., Schwarts, R.M., Orcutt, B.C.: A model of evolutionary change in proteins. Atlas of Protein Sequences and Structures 5(suppl. 3), 345–352 (1978)
Sulimova, V., Mottl, V., Mirkin, B., Muchnik, I., Kulikowski, C.: A class of evolution-based kernels for protein homology analysis: A generalization of the PAM model. In: Proceedings of the 5th International Symposium on Bioinformatics Research and Applications, May 13-16, pp. 284–296. Nova Southeastern University, Ft. Lauderdale (2009)
Thompson, J.D., Koehl, P., Ripp, R., Poch, O.: BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins 61, 127–136 (2005)
BALiBASE3.0: A benchmark alignment database home page, http://www-bio3d-igbmc.u-strasbg.fr/~julie/balibase/index.html
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sulimova, V., Razin, N., Mottl, V., Muchnik, I., Kulikowski, C. (2010). A Maximum-Likelihood Formulation and EM Algorithm for the Protein Multiple Alignment Problem. In: Dijkstra, T.M.H., Tsivtsivadze, E., Marchiori, E., Heskes, T. (eds) Pattern Recognition in Bioinformatics. PRIB 2010. Lecture Notes in Computer Science(), vol 6282. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16001-1_15
Download citation
DOI: https://doi.org/10.1007/978-3-642-16001-1_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-16000-4
Online ISBN: 978-3-642-16001-1
eBook Packages: Computer ScienceComputer Science (R0)