Protein threading is one of the most successful protein structure prediction methods. Most protein threading methods use a scoring function linearly combining sequence and structure features to measure the quality of a sequence-template alignment so that a dynamic programming algorithm can be used to optimize the scoring function. However, a linear scoring function cannot fully exploit interdependency among features and thus, limits alignment accuracy.

This paper presents a nonlinear scoring function for protein threading, which not only can model interactions among different protein features, but also can be efficiently optimized using a dynamic programming algorithm. We achieve this by modeling the threading problem using a probabilistic graphical model Conditional Random Fields (CRF) and training the model using the gradient tree boosting algorithm. The resultant model is a nonlinear scoring function consisting of a collection of regression trees. Each regression tree models a type of nonlinear relationship among sequence and structure features. Experimental results indicate that this new threading model can effectively leverage weak biological signals and improve both alignment accuracy and fold recognition rate greatly.


protein threading conditional random fields gradient tree boosting regression tree nonlinear scoring function 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Kihara, D., Skolnick, J.: The PDB is a covering set of small protein structures. Journal of Molecular Biology 334(4), 793–802 (2003)CrossRefPubMedGoogle Scholar
  2. 2.
    Zhang, Y., Skolnick, J.: The protein structure prediction problem could be solved using the current PDB library. Proceedings of National Academy Sciences, USA 102(4), 1029–1034 (2005)CrossRefGoogle Scholar
  3. 3.
    Jones, D.T.: Progress in protein structure prediction. Current Opinion in Structural Biology 7(3), 377–387 (1997)CrossRefPubMedGoogle Scholar
  4. 4.
    Rost, B.: Twilight zone of protein sequence alignments. Protein Engineering 12, 85–94 (1999)CrossRefPubMedGoogle Scholar
  5. 5.
    John, B., Sali, A.: Comparative protein structure modeling by iterative alignment model building and model assessment. Nucleic Acids Research 31(14), 3982–3992 (2003)CrossRefPubMedPubMedCentralGoogle Scholar
  6. 6.
    Chivian, Dylan, Baker, David: Homology modeling using parametric alignment ensemble generation with consensus and energy-based model selection. Nucleic Acids Research 34(17), e112 (2006)CrossRefGoogle Scholar
  7. 7.
    Marko, A.C., Stafford, K., Wymore, T.: Stochastic Pairwise Alignments and Scoring Methods for Comparative Protein Structure Modeling. Journal of Chemical Information and Modeling (March 2007)Google Scholar
  8. 8.
    Jaroszewski, L., Rychlewski, L., Li, Z., Li, W., Godzik, A.: FFAS03: a server for profile–profile sequence alignments. Nucleic Acids Research 33(Web Server issue) (July 2005)Google Scholar
  9. 9.
    Rychlewski, L., Jaroszewski, L., Li, W., Godzik, A.: Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Science 9(2), 232–241 (2000)CrossRefPubMedPubMedCentralGoogle Scholar
  10. 10.
    Yona, G., Levitt, M.: Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. Journal of Molecular Biology (315), 1257–1275 (2002) CrossRefPubMedGoogle Scholar
  11. 11.
    Pei, J., Sadreyev, R., Grishin, N.V.: PCMA: fast and accurate multiple sequence alignment based on profile consistency. Bioinformatics 19(3), 427–428 (2003)CrossRefPubMedGoogle Scholar
  12. 12.
    Marti-Renom, M.A., Madhusudhan, M.S., Sali, A.: Alignment of protein sequences by their profiles. Protein Science 13(4), 1071–1087 (2004)CrossRefPubMedPubMedCentralGoogle Scholar
  13. 13.
    Ginalski, K., Pas, J., Wyrwicz, L.S., von Grotthuss, M., Bujnicki, J.M., Rychlewski, L.: ORFeus: Detection of distant homology using sequence profiles and predicted secondary structure. Nucleic Acids Research 31(13), 3804–3807 (2003)CrossRefPubMedPubMedCentralGoogle Scholar
  14. 14.
    Zhou, H., Zhou, Y.: Single-body residue-level knowledge-based energy score combined with sequence-profile and secondary structure information for fold recognition. Proteins: Structure, Function, and Bioinformatics 55(4), 1005–1013 (2004)CrossRefGoogle Scholar
  15. 15.
    Han, S., Lee, B.-C., Yu, S.T., Jeong, C.-S., Lee, S., Kim, D.: Fold recognition by combining profile-profile alignment and support vector machine. Bioinformatics 21(11), 2667–2673 (2005)CrossRefPubMedGoogle Scholar
  16. 16.
    Shi, J., Blundell, T.L., Mizuguchi, K.: FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. Journal of Molecular Biology 310(1), 243–257 (2001)CrossRefPubMedGoogle Scholar
  17. 17.
    Jones, D.T.: GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. Journal of Molecular Biology 287(4), 797–815 (1999)CrossRefPubMedGoogle Scholar
  18. 18.
    Kelley, L.A., MacCallum, R.M., Sternberg, M.J.: Enhanced genome annotation using structural profiles in the program 3D-PSSM. Journal of Molecular Biology 299(2), 499–520 (2000)CrossRefPubMedGoogle Scholar
  19. 19.
    Zhou, H., Zhou, Y.: Fold recognition by combining sequence profiles derived from evolution and from depth-dependent structural alignment of fragments. Proteins: Structure, Function, and Bioinformatics 58(2), 321–328 (2005)CrossRefGoogle Scholar
  20. 20.
    Karplus, K., Barrett, C., Hughey, R.: Hidden Markov Models for Detecting Remote Protein Homologies. Bioinformatics 14(10), 846–856 (1998)CrossRefPubMedGoogle Scholar
  21. 21.
    Johannes, S.: Protein homology detection by HMM-HMM comparison. Bioinformatics 21(7), 951–960 (2005)CrossRefGoogle Scholar
  22. 22.
    Xu, J., Li, M., Lin, G., Kim, D., Xu, Y.: Protein threading by linear programming. In: The Pacific Symposium on Biocomputing, pp. 264–275 (2003)Google Scholar
  23. 23.
    Xu, J., Li, M., Kim, D., Xu, Y.: RAPTOR: optimal protein threading by linear programming. Journal of Bioinformatics and Computational Biology 1(1), 95–117 (2003)CrossRefPubMedGoogle Scholar
  24. 24.
    Xu, J., Li, M.: Assessment of RAPTOR’s linear programming approach in CAFASP3. Proteins: Structure, Function and Genetics (2003)Google Scholar
  25. 25.
    Xu, J., Jiao, F., Berger, B.: A tree-decomposition approach to protein structure prediction. In: Proceedings of IEEE Computational Systems Bioinformatics Conference, pp. 247–256 (2005)Google Scholar
  26. 26.
    Rai, B.K., Fiser, A.: Multiple mapping method: a novel approach to the sequence-to-structure alignment problem in comparative protein structure modeling. Proteins: Structure, Function, and Bioinformatics 63(3), 644–661 (2006)CrossRefGoogle Scholar
  27. 27.
    Wu, S., Zhang, Y.: MUSTER: Improving protein sequence profile-profile alignments by using multiple sources of structure information. Proteins: Structure, Function, and Bioinformatics 9999(9999), NA+ (2008)Google Scholar
  28. 28.
    Wu, S., Skolnick, J., Zhang, Y.: Ab initio modeling of small proteins by iterative TASSER simulations. BMC Biology 5, 17+ (2007)CrossRefPubMedPubMedCentralGoogle Scholar
  29. 29.
    Silva, P.J.: Assessing the reliability of sequence similarities detected through hydrophobic cluster analysis. Proteins: Structure, Function, and Bioinformatics 70(4), 1588–1594 (2008)CrossRefGoogle Scholar
  30. 30.
    Skolnick, J., Kihara, D.: Defrosting the frozen approximation: PROSPECTOR - a new approach to threading. Proteins: Structure, Function, and Genetics 42(3), 319–331 (2001)CrossRefGoogle Scholar
  31. 31.
    Kim, D., Xu, D., Guo, J., Ellrott, K., Xu, Y.: PROSPECT II: Protein structure prediction method for genome-scale applications. Protein Engineering (2002)Google Scholar
  32. 32.
    Yu, C.N., Joachims, T., Elber, R., Pillardy, J.: Support vector training of protein alignment models. Journal of Computational Biology 15(7), 867–880 (2008)CrossRefPubMedPubMedCentralGoogle Scholar
  33. 33.
    Dietterich, T.G., Ashenfelter, A., Bulatov, Y.: Training Conditional Random Fields via Gradient Tree Boosting. In: Proceedings of the 21st International Conference on Machine Learning (ICML), pp. 217–224 (2004)Google Scholar
  34. 34.
    Lafferty, J., McCallum, A., Pereira, F.: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In: ICML: Proc. 18th International Conf. on Machine Learning, pp. 282–289. Morgan Kaufmann, San Francisco (2001)Google Scholar
  35. 35.
    Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: Proceedings of Human Language Technology NAACL 2003, pp. 134–141 (2003)Google Scholar
  36. 36.
    Shen, R.: Protein secondary structure prediction using conditional random fields and profiles. Master Thesis, Department of Computer Science, Oregon State University (2006)Google Scholar
  37. 37.
    Lafferty, J., Zhu, X., Liu, Y.: Kernel Conditional Random Fields: Representation and Clique Selection. In: ICML 2004: Proceedings of the twenty-first international conference on Machine learning. ACM Press, New York (2004)Google Scholar
  38. 38.
    Zhao, F., Li, S., Sterner, B.W., Xu, J.: Discriminative learning for protein conformation sampling. Proteins: Structure, Function, and Bioinformatics 73(1), 228–240 (2008)CrossRefGoogle Scholar
  39. 39.
    Do, C., Gross, S., Batzoglou, S.: CONTRAlign: Discriminative Training for Protein Sequence Alignment (2006)Google Scholar
  40. 40.
    Mcguffin, L.J., Bryson, K., Jones, D.T.: The PSIPRED protein structure prediction server. Bioinformatics 16(4), 404–405 (2000)CrossRefPubMedGoogle Scholar
  41. 41.
    Qiu, J., Elber, R.: SSALN: An alignment algorithm using structure-dependent substitution matrices and gap penalties learned from structurally aligned protein pairs. Proteins: Structure, Function, and Bioinformatics 62(4), 881–891 (2006)CrossRefGoogle Scholar
  42. 42.
    Karplus, K., Karchin, R., Shackelford, G., Hughey, R.: Calibrating E-values for Hidden Markov Models using Reverse-Sequence Null Models. Bioinformatics 21(22), 4107–4115 (2005)CrossRefPubMedGoogle Scholar
  43. 43.
    Jones, D.T.: Protein secondary structure prediction based on position-specific scoring matrices. Journal of Molecular Biology 292(2), 195–202 (1999)CrossRefPubMedGoogle Scholar
  44. 44.
    Gutmann, B., Kersting, K.: Stratified Gradient Boosting for Fast Training of Conditional Random Fields. In: Proceedings of the 6th International Workshop on Multi-Relational Data Mining, pp. 56–68Google Scholar
  45. 45.
    Rabiner, L.R.: A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 267–296 (1990)Google Scholar
  46. 46.
    Kabsch, W., Sander, C.: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22(12), 2577–2637 (1983)CrossRefPubMedGoogle Scholar
  47. 47.
    Pollastri, G., Baldi, P., Fariselli, P., Casadio, R.: Prediction of coordination number and relative solvent accessibility in proteins. Proteins: Structure, Function, and Genetics 47(2), 142–153 (2002)CrossRefGoogle Scholar
  48. 48.
    Xu, J.: Fold Recognition by Predicted Alignment Accuracy. IEEE/ACM Transaction of Computational Biology and Bioinformatics 2(2), 157–165 (2005)CrossRefGoogle Scholar
  49. 49.
    Marti-Renom, M.A., Madhusudhan, M.S., Sali, A.: Alignment of protein sequences by their profiles. Protein Science 13(4), 1071–1087 (2004)CrossRefPubMedPubMedCentralGoogle Scholar
  50. 50.
    Zhang, W., Liu, S., Zhou, Y.: SP5: Improving protein fold recognition by using torsion angle profiles and profile-based gap penalty model. PLoS ONE 3(6) (2008)Google Scholar
  51. 51.
    Ellrott, K., Guo, J.T., Olman, V., Xu, Y.: Improvement in protein sequence-structure alignment using insertion/deletion frequency arrays. In: Computational systems bioinformatics / Life Sciences Society. Computational Systems Bioinformatics Conference, vol. 6, pp. 335–342 (2007)Google Scholar
  52. 52.
    Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C.: SCOP: a structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology 247(4), 536–540 (1995)PubMedGoogle Scholar
  53. 53.
    Zhang, Y., Skolnick, J.: TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Research 33(7), 2302–2309 (2005)CrossRefPubMedPubMedCentralGoogle Scholar
  54. 54.
    Lackner, P., Koppensteiner, W.A., Sippl, M.J., Domingues, F.S.: ProSup: a refined tool for protein structure alignment. Protein Engneering 13(11), 745–752 (2000)CrossRefGoogle Scholar
  55. 55.
    Liu, S., Zhang, C., Liang, S., Zhou, Y.: Fold Recognition by Concurrent Use of Solvent Accessibility and Residue Depth. Proteins: Structure, Function, and Bioinformatics 68(3), 636–645 (2007)CrossRefGoogle Scholar
  56. 56.
    Lindahl, E., Elofsson, A.: Identification of related proteins on family, superfamily and fold level. Journal of Molecular Biology 295(3), 613–625 (2000)CrossRefPubMedGoogle Scholar
  57. 57.
    Cheng, J., Baldi, P.: A machine learning information retrieval approach to protein fold recognition. Bioinformatics 22(12), 1456–1463 (2006)CrossRefPubMedGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Jian Peng
    • 1
  • Jinbo Xu
    • 1
  1. 1.Toyota Technological Institute at ChicagoChicagoUSA

Personalised recommendations