Segment assembly, structure alignment and iterative simulation in protein structure prediction
KeywordsProtein Data Bank Structure Alignment Protein Structure Prediction Template Identification Folding Algorithm
It has been 50 years since Anfinsen first showed that the native structure of protein molecules is determined solely by their amino acid sequence, with the folded state representing a unique and kinetically accessible minimum of the free energy . This finding, also known as the Anfinsen's thermodynamic hypothesis, motivated the belief that the solution to the protein structure prediction problem should be based on physicochemical principles; that is, all one has to do to find the native structure of a protein is to identify the lowest free-energy state. However, the success of such first principle-based methods has been modest at best. Knowledge-based approaches, which predict structural models using the regularities and rules of known protein structures seen in the Protein Data Bank (PDB) library, have enjoyed more extensive success in protein structure prediction [2, 3, 4]. Among these approaches, TASSER (Threading ASSEmbly Refinement) is a hierarchical structure modeling method designed to predict full-length atomic models from primary amino acid sequences . For a given query sequence, TASSER first threads the sequence through the PDB to identify template proteins that may have similar topology to the query. Continuous segments are then excised from the top-ranked template structures following the query-template alignments, and these are reassembled into full-length atomic models by Monte Carlo simulations. An essential advantage of TASSER over traditional comparative modeling methods, which often deteriorate the quality of template models, is its ability to drive the template structures closer to the native than the input templates. This is mainly attributed to its highly optimized knowledge-based potential and the efficiency in combining complementary threading alignments from multiple template structures.
The I-TASSER (Iterative Threading ASSEmbly Refinement) method that we published in BMC Biology in 2007 is an extension of the TASSER algorithm for iterative structure assembly and refinement of protein molecules . The idea of I-TASSER was inspired by the encouraging template structure refinement of the TASSER simulations. Here, a highly appealing question is to examine whether we can continuously improve the quality of protein structural models by repeatedly re-folding the query sequence starting from the last step of assembly simulations. The result of initial tests was not encouraging since the final structural models stay essentially the same as the first round of TASSER models, although the local structural quality, including hydrogen-bonding networks and steric clash of backbone atoms, was generally improved. The reason for the failure in structural refinement became obvious once it was realized that no new structural information was introduced in the reassembly simulations by simply starting from the last round TASSER models. As long as the dynamic searching of conformational space is complete in the first round of simulations, the iterations should in principle lead to exactly the same modeling results. More pronounced topology-level improvements were achieved when new structural templates identified from the PDB were incorporated into the folding iterations; these templates were detected by the structure alignment program TM-align , which matches the TASSER models with each of the known proteins in the PDB to identify the templates that are structurally closest to the TASSER models. In a benchmark testing experiment on a set of known proteins, the TM-score of the template structures identified by TM-align (a measure of similarity between model and native with value in 0) is shown to be generally lower than that of the TASSER models (only 21% of the TM-align alignments have a higher TM-score, as seen in Figure 4A of ). Nevertheless, the combination of the new structural alignment information in the I-TASSER simulations eventually resulted in final models with improved TM-score in 77% of the test proteins, or an overall TM-score increase of approximately 3% with improved local structure quality .
The major goal of protein structure prediction is to help understand the biological roles of protein molecules in living cells. Since the launch, at the beginning of this century, of structural genomics projects that aim to solve experimentally the structure of a set of proteins covering all representative structural types in nature , the general challenges to the field of structure prediction have been the development of methods for better template identification and consequent structural refinement. Progress has been substantial but significant difficulties still remain in distant-homology identifications and atomic-level structure refinements despite the fact that the PDB library is approaching completeness in structural space [12, 13]. Meanwhile, it is now known that nearly 10% of all proteins (or partial sequence in 40% of eukaryotic proteins) do not follow Anfinsen's dogma to fold into unique states - these sequences are unfolded or intrinsically disordered to conduct their physiological functions . Template-based approaches cannot be used for deducing structural and functional characteristics of these molecules. While recent progress has demonstrated promising use of the structure-based iteration as driven by ab initio modeling in both fold-recognition and structure refinement procedures, the development of efficient ab initio folding algorithms will remain a major theme in the field and should have important impacts on all aspects of protein structure predictions.
This article is part of the BMC Biology tenth anniversary series. Other articles in this series can be found at http://www.biomedcentral.com/bmcbiol/series/tenthanniversary.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.