Abstract
Protein threading is one of the most successful protein structure prediction methods. Most protein threading methods use a scoring function linearly combining sequence and structure features to measure the quality of a sequence-template alignment so that a dynamic programming algorithm can be used to optimize the scoring function. However, a linear scoring function cannot fully exploit interdependency among features and thus, limits alignment accuracy.
This paper presents a nonlinear scoring function for protein threading, which not only can model interactions among different protein features, but also can be efficiently optimized using a dynamic programming algorithm. We achieve this by modeling the threading problem using a probabilistic graphical model Conditional Random Fields (CRF) and training the model using the gradient tree boosting algorithm. The resultant model is a nonlinear scoring function consisting of a collection of regression trees. Each regression tree models a type of nonlinear relationship among sequence and structure features. Experimental results indicate that this new threading model can effectively leverage weak biological signals and improve both alignment accuracy and fold recognition rate greatly.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Kihara, D., Skolnick, J.: The PDB is a covering set of small protein structures. Journal of Molecular Biology 334(4), 793–802 (2003)
Zhang, Y., Skolnick, J.: The protein structure prediction problem could be solved using the current PDB library. Proceedings of National Academy Sciences, USA 102(4), 1029–1034 (2005)
Jones, D.T.: Progress in protein structure prediction. Current Opinion in Structural Biology 7(3), 377–387 (1997)
Rost, B.: Twilight zone of protein sequence alignments. Protein Engineering 12, 85–94 (1999)
John, B., Sali, A.: Comparative protein structure modeling by iterative alignment model building and model assessment. Nucleic Acids Research 31(14), 3982–3992 (2003)
Chivian, Dylan, Baker, David: Homology modeling using parametric alignment ensemble generation with consensus and energy-based model selection. Nucleic Acids Research 34(17), e112 (2006)
Marko, A.C., Stafford, K., Wymore, T.: Stochastic Pairwise Alignments and Scoring Methods for Comparative Protein Structure Modeling. Journal of Chemical Information and Modeling (March 2007)
Jaroszewski, L., Rychlewski, L., Li, Z., Li, W., Godzik, A.: FFAS03: a server for profile–profile sequence alignments. Nucleic Acids Research 33(Web Server issue) (July 2005)
Rychlewski, L., Jaroszewski, L., Li, W., Godzik, A.: Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Science 9(2), 232–241 (2000)
Yona, G., Levitt, M.: Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. Journal of Molecular Biology (315), 1257–1275 (2002)
Pei, J., Sadreyev, R., Grishin, N.V.: PCMA: fast and accurate multiple sequence alignment based on profile consistency. Bioinformatics 19(3), 427–428 (2003)
Marti-Renom, M.A., Madhusudhan, M.S., Sali, A.: Alignment of protein sequences by their profiles. Protein Science 13(4), 1071–1087 (2004)
Ginalski, K., Pas, J., Wyrwicz, L.S., von Grotthuss, M., Bujnicki, J.M., Rychlewski, L.: ORFeus: Detection of distant homology using sequence profiles and predicted secondary structure. Nucleic Acids Research 31(13), 3804–3807 (2003)
Zhou, H., Zhou, Y.: Single-body residue-level knowledge-based energy score combined with sequence-profile and secondary structure information for fold recognition. Proteins: Structure, Function, and Bioinformatics 55(4), 1005–1013 (2004)
Han, S., Lee, B.-C., Yu, S.T., Jeong, C.-S., Lee, S., Kim, D.: Fold recognition by combining profile-profile alignment and support vector machine. Bioinformatics 21(11), 2667–2673 (2005)
Shi, J., Blundell, T.L., Mizuguchi, K.: FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. Journal of Molecular Biology 310(1), 243–257 (2001)
Jones, D.T.: GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. Journal of Molecular Biology 287(4), 797–815 (1999)
Kelley, L.A., MacCallum, R.M., Sternberg, M.J.: Enhanced genome annotation using structural profiles in the program 3D-PSSM. Journal of Molecular Biology 299(2), 499–520 (2000)
Zhou, H., Zhou, Y.: Fold recognition by combining sequence profiles derived from evolution and from depth-dependent structural alignment of fragments. Proteins: Structure, Function, and Bioinformatics 58(2), 321–328 (2005)
Karplus, K., Barrett, C., Hughey, R.: Hidden Markov Models for Detecting Remote Protein Homologies. Bioinformatics 14(10), 846–856 (1998)
Johannes, S.: Protein homology detection by HMM-HMM comparison. Bioinformatics 21(7), 951–960 (2005)
Xu, J., Li, M., Lin, G., Kim, D., Xu, Y.: Protein threading by linear programming. In: The Pacific Symposium on Biocomputing, pp. 264–275 (2003)
Xu, J., Li, M., Kim, D., Xu, Y.: RAPTOR: optimal protein threading by linear programming. Journal of Bioinformatics and Computational Biology 1(1), 95–117 (2003)
Xu, J., Li, M.: Assessment of RAPTOR’s linear programming approach in CAFASP3. Proteins: Structure, Function and Genetics (2003)
Xu, J., Jiao, F., Berger, B.: A tree-decomposition approach to protein structure prediction. In: Proceedings of IEEE Computational Systems Bioinformatics Conference, pp. 247–256 (2005)
Rai, B.K., Fiser, A.: Multiple mapping method: a novel approach to the sequence-to-structure alignment problem in comparative protein structure modeling. Proteins: Structure, Function, and Bioinformatics 63(3), 644–661 (2006)
Wu, S., Zhang, Y.: MUSTER: Improving protein sequence profile-profile alignments by using multiple sources of structure information. Proteins: Structure, Function, and Bioinformatics 9999(9999), NA+ (2008)
Wu, S., Skolnick, J., Zhang, Y.: Ab initio modeling of small proteins by iterative TASSER simulations. BMC Biology 5, 17+ (2007)
Silva, P.J.: Assessing the reliability of sequence similarities detected through hydrophobic cluster analysis. Proteins: Structure, Function, and Bioinformatics 70(4), 1588–1594 (2008)
Skolnick, J., Kihara, D.: Defrosting the frozen approximation: PROSPECTOR - a new approach to threading. Proteins: Structure, Function, and Genetics 42(3), 319–331 (2001)
Kim, D., Xu, D., Guo, J., Ellrott, K., Xu, Y.: PROSPECT II: Protein structure prediction method for genome-scale applications. Protein Engineering (2002)
Yu, C.N., Joachims, T., Elber, R., Pillardy, J.: Support vector training of protein alignment models. Journal of Computational Biology 15(7), 867–880 (2008)
Dietterich, T.G., Ashenfelter, A., Bulatov, Y.: Training Conditional Random Fields via Gradient Tree Boosting. In: Proceedings of the 21st International Conference on Machine Learning (ICML), pp. 217–224 (2004)
Lafferty, J., McCallum, A., Pereira, F.: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In: ICML: Proc. 18th International Conf. on Machine Learning, pp. 282–289. Morgan Kaufmann, San Francisco (2001)
Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: Proceedings of Human Language Technology NAACL 2003, pp. 134–141 (2003)
Shen, R.: Protein secondary structure prediction using conditional random fields and profiles. Master Thesis, Department of Computer Science, Oregon State University (2006)
Lafferty, J., Zhu, X., Liu, Y.: Kernel Conditional Random Fields: Representation and Clique Selection. In: ICML 2004: Proceedings of the twenty-first international conference on Machine learning. ACM Press, New York (2004)
Zhao, F., Li, S., Sterner, B.W., Xu, J.: Discriminative learning for protein conformation sampling. Proteins: Structure, Function, and Bioinformatics 73(1), 228–240 (2008)
Do, C., Gross, S., Batzoglou, S.: CONTRAlign: Discriminative Training for Protein Sequence Alignment (2006)
Mcguffin, L.J., Bryson, K., Jones, D.T.: The PSIPRED protein structure prediction server. Bioinformatics 16(4), 404–405 (2000)
Qiu, J., Elber, R.: SSALN: An alignment algorithm using structure-dependent substitution matrices and gap penalties learned from structurally aligned protein pairs. Proteins: Structure, Function, and Bioinformatics 62(4), 881–891 (2006)
Karplus, K., Karchin, R., Shackelford, G., Hughey, R.: Calibrating E-values for Hidden Markov Models using Reverse-Sequence Null Models. Bioinformatics 21(22), 4107–4115 (2005)
Jones, D.T.: Protein secondary structure prediction based on position-specific scoring matrices. Journal of Molecular Biology 292(2), 195–202 (1999)
Gutmann, B., Kersting, K.: Stratified Gradient Boosting for Fast Training of Conditional Random Fields. In: Proceedings of the 6th International Workshop on Multi-Relational Data Mining, pp. 56–68
Rabiner, L.R.: A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 267–296 (1990)
Kabsch, W., Sander, C.: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22(12), 2577–2637 (1983)
Pollastri, G., Baldi, P., Fariselli, P., Casadio, R.: Prediction of coordination number and relative solvent accessibility in proteins. Proteins: Structure, Function, and Genetics 47(2), 142–153 (2002)
Xu, J.: Fold Recognition by Predicted Alignment Accuracy. IEEE/ACM Transaction of Computational Biology and Bioinformatics 2(2), 157–165 (2005)
Marti-Renom, M.A., Madhusudhan, M.S., Sali, A.: Alignment of protein sequences by their profiles. Protein Science 13(4), 1071–1087 (2004)
Zhang, W., Liu, S., Zhou, Y.: SP5: Improving protein fold recognition by using torsion angle profiles and profile-based gap penalty model. PLoS ONE 3(6) (2008)
Ellrott, K., Guo, J.T., Olman, V., Xu, Y.: Improvement in protein sequence-structure alignment using insertion/deletion frequency arrays. In: Computational systems bioinformatics / Life Sciences Society. Computational Systems Bioinformatics Conference, vol. 6, pp. 335–342 (2007)
Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C.: SCOP: a structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology 247(4), 536–540 (1995)
Zhang, Y., Skolnick, J.: TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Research 33(7), 2302–2309 (2005)
Lackner, P., Koppensteiner, W.A., Sippl, M.J., Domingues, F.S.: ProSup: a refined tool for protein structure alignment. Protein Engneering 13(11), 745–752 (2000)
Liu, S., Zhang, C., Liang, S., Zhou, Y.: Fold Recognition by Concurrent Use of Solvent Accessibility and Residue Depth. Proteins: Structure, Function, and Bioinformatics 68(3), 636–645 (2007)
Lindahl, E., Elofsson, A.: Identification of related proteins on family, superfamily and fold level. Journal of Molecular Biology 295(3), 613–625 (2000)
Cheng, J., Baldi, P.: A machine learning information retrieval approach to protein fold recognition. Bioinformatics 22(12), 1456–1463 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Peng, J., Xu, J. (2009). Boosting Protein Threading Accuracy. In: Batzoglou, S. (eds) Research in Computational Molecular Biology. RECOMB 2009. Lecture Notes in Computer Science(), vol 5541. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02008-7_3
Download citation
DOI: https://doi.org/10.1007/978-3-642-02008-7_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-02007-0
Online ISBN: 978-3-642-02008-7
eBook Packages: Computer ScienceComputer Science (R0)