Abstract
Sequence to structure alignment is an important step in homology modeling of protein structures. Incorporation of features like secondary structure, solvent accessibility, or evolutionary information improve sequence to structure alignment accuracy, but conventional generative estimation techniques for alignment models impose independence assumptions that make these features difficult to include in a principled way. In this paper, we overcome this problem using a Support Vector Machine (SVM) method that provides a well-founded way of estimating complex alignment models with hundred-thousands of parameters. Furthermore, we show that the method can be trained using a variety of loss functions. In a rigorous empirical evaluation, the SVM algorithm outperforms the generative alignment method SSALN, a highly accurate generative alignment model that incorporates structural information. The alignment model learned by the SVM aligns 47% of the residues correctly and aligns over 70% of the residues within a shift of 4 positions.
Keywords: Machine learning, Pairwise sequence alignment, Protein structure prediction.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Joachims, T.: Learning to align sequences: A maximum-margin approach (August (2003), http://www.joachims.org
Joachims, T., Galor, T., Elber, R.: Learning to Align Sequences: A Maximum-Margin Approach. In: Leimkuhler, B. (ed.) New Algorithms for Macromolecular Simulation. Lecture Notes in Computational Science and Engineering, vol. 49, pp. 57–68. Springer, Heidelberg (2005)
Qiu, J., Elber, R.: SSALN: an alignment algorithm using structure-dependent substitution matrices and gap penalties learned from structurally aligned protein pairs. Proteins 62, 881–891 (2006)
Bucher, P., Hofmann, K.: A sequence similarity search algorithm based on a probabilistic interpretation of an alignment scoring system. In: International Conference on Intelligent Systems for Molecular Biology (ISMB) (1996)
Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological Sequence Analysis. Cambridge University Press, Cambridge (1998)
Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences 89, 10915–10919 (1992)
Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C.: A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure 5, 345–352 (1978)
Ristad, S.E, Yianilos, P.N.: Learning String Edit Distance. IEEE Transactions on Pattern Recognition and Machine Intelligence 20(5), 522–532 (1998)
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) Machine Learning: ECML-98. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998), http://www-ai.cs.uni-dortmund.de/DOKUMENTE/joachims_98a.ps.gz
Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large Margin Methods for Structured and Interdependent Output Variables. Journal of Machine Learning Research (JMLR) 6, 1453–1484 (2005)
Lafferty, J., McCallum, A., Pereira, F.: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In: International Conference on Machine Learning (ICML) (2001)
Gusfield, D., Stelling, P.: Parametric and Inverse-Parametric Sequence Alignment with XPARAL. Methods in Enzymology 266, 481–494 (1996)
Pachter, L., Sturmfelds, B.: Parametric Inference for Biological Sequence Analysis. In: Proceedings of the National Academy of Sciences, vol. 101, pp. 16138–16143 (2004)
Sun, F., Fernandez-Baca, D., Yu, W.: Inverse Parametric Sequence Alignment. In: International Computing and Combinatorics Conference (COCOON) (2002)
Tsochantaridis, I., Hofmann, T., Joachims, T., Altun, Y.: Support vector machine learning for interdependent and structured output spaces. In: International Conference on Machine Learning (ICML) (2004)
Do, C.B., Gross, S.S., Batzoglou, S.: CONTRAlign: Discriminative Training for Protein Sequence Alignment. In: International Conference in Research on Computational Molecular Biology (RECOMB) (2006)
McCallum, A., Bellare, K., Pereira, F.: A Conditional Random Field for Discriminatively-Trained Finite-State String Edit Distance. In: Conference on Uncertainty in Artificial Intelligence (2005)
Kececioglu, J.D., Kim, E.: Simple and Fast Inverse Alignment. In: Apostolico, A., Guerra, C., Istrail, S., Pevzner, P., Waterman, M. (eds.) RECOMB 2006. LNCS (LNBI), vol. 3909, pp. 441–455. Springer, Heidelberg (2006)
Smith, T., Waterman, M.: Identification of Common Molecular Subsequences. Journal of Molecular Biology 147, 195–197 (1981)
Vapnik, V.: Statistical Learning Theory. Wiley, Chichester (1998)
Taskar, B., Guestrin, C., Koller, D.: Maximum-Margin Markov Networks. In: Neural Information Processing Systems (NIPS) (2003)
Shindyalov, I.N., Bourne, P.E.: Protein structure alignment by incremental combinatorial extension(CE) of the optimal path. Protein Eng. 11, 739–747 (1998)
Zhang, Y., Skolnick, J.: TM-align: A protein structure alignment algorithm based on TM-score. Nucleic Acids Research 33, 2302–2309 (2005)
Adamczak, R., Porollo, A., Meller, J.: Accurate prediction of solvent accessibility using neural networks-based regression. Proteins 56, 753–767 (2004)
Kabsch, W., Sander, C.: Dictionary of protein secondary structure: pattern recognition of hydrogen bond and geometrical features. Biopolymers 22, 2577–2637 (1983)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Yu, CN.J., Joachims, T., Elber, R., Pillardy, J. (2007). Support Vector Training of Protein Alignment Models. In: Speed, T., Huang, H. (eds) Research in Computational Molecular Biology. RECOMB 2007. Lecture Notes in Computer Science(), vol 4453. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71681-5_18
Download citation
DOI: https://doi.org/10.1007/978-3-540-71681-5_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71680-8
Online ISBN: 978-3-540-71681-5
eBook Packages: Computer ScienceComputer Science (R0)