Comparison of molecular dynamics and superfamily spaces of protein domain deformation
- 5.4k Downloads
It is well known the strong relationship between protein structure and flexibility, on one hand, and biological protein function, on the other hand. Technically, protein flexibility exploration is an essential task in many applications, such as protein structure prediction and modeling. In this contribution we have compared two different approaches to explore the flexibility space of protein domains: i) molecular dynamics (MD-space), and ii) the study of the structural changes within superfamily (SF-space).
Our analysis indicates that the MD-space and the SF-space display a significant overlap, but are still different enough to be considered as complementary. The SF-space space is wider but less complex than the MD-space, irrespective of the number of members in the superfamily. Also, the SF-space does not sample all possibilities offered by the MD-space, but often introduces very large changes along just a few deformation modes, whose number tend to a plateau as the number of related folds in the superfamily increases.
Theoretically, we obtained two conclusions. First, that function restricts the access to some flexibility patterns to evolution, as we observe that when a superfamily member changes to become another, the path does not completely overlap with the physical deformability. Second, that conformational changes from variation in a superfamily are larger and much simpler than those allowed by physical deformability. Methodologically, the conclusion is that both spaces studied are complementary, and have different size and complexity. We expect this fact to have application in fields as 3D-EM/X-ray hybrid models or ab initio protein folding.
KeywordsMolecular Dynamic Singular Vector Molecular Dynamic Trajectory Superfamily Member Deformation Space
The central dogma of structural biology asserts that the aminoacid sequence has all the information needed for a protein to adopt a structure, and that structure determines function. The connection between sequence and structure has centered a great amount of work and detailed theories of protein folding exist , but still predicting structure or function from sequence is a extremely complex task except in cases of high sequence identity between the target protein and a well annotated homolog . There are many cases of non-homologous proteins sharing a given fold or function as well as proteins with reasonably similar sequences having quite different structures.
Flexibility seems to play an important role in protein function, as in many cases movements are key for activity. Unfortunately, still less information exists on this connection between flexibility and function and, specifically, regarding the conformational changes that need to happen in a protein to perform its biological function [3, 4, 5]. In the very same way as structures that are able to perform a specific function are conserved by evolution by not tolerating mutations that seriously modify that structure, it is plausible to think that mutations disrupting the flexibility pattern of a given protein are not going to be accepted either [3, 6, 7, 8, 9].
If physical deformability is crucial to protein function, conformational changes introduced by sequence modifications will happen as orthogonal as possible to the physical deformation pattern.
ii) The physical deformation pattern traces movements that allow quite significant conformational changes without disruption of the function(s) associated to a fold. Mutations leading to conformational changes along this pattern of flexibility are going to be better tolerated, as they won't affect the function. This would suggest a good overlap between the physical space studied by MD and the conformational space explored by the members of a superfamily.
Our results show that the relative flexibility among domains of a given superfamily is restricted to just a few "directions of change" (SF-space), which overlap only partially with the "directions of change" indicated by MD (MD-space). For technical purposes, the conclusion is that both spaces can be combined to increase the dimensionality of the search space when performing any kind of computational-biology task that requires the exploration of possible protein deformations.
Results and discussion
Putting together all the analysis commented above, we conclude that there appear to be many deformation patterns that are physically possible but are not explored within a superfamily and that the overlap between MD- and SF-spaces is only partial. The reasons for these findings could be related to the bias of the SF-space towards insertions, deletions, and changes of aminoacids leading to bigger deformations in the structure than the simple variation of the torsion angles explored in the physical space. Others reasons are probably related to the inability of the SF-space to explore movements that might challenge protein functionality.
The structural changes inside a superfamily can be severe in extension but are easily represented by a few essential movements. We cannot completely rule out the possibility that when the structures of more members of a given superfamily were solved, the overlap between spaces increased, but according to our results it seems to be an inherent limit. In summary, as suggested in the complexity analysis, the SF-space is quickly saturated.
Taking into account local and global behavior together, we distinguish three groups among the 55 studied superfamilies:
i) Superfamilies (with both small and large number of members) showing poor overlap between SF- and MD-spaces (Hess index < 0.15, Additional file 1) and low correspondence between B-factor plots (Figure 6a). This group is largely enriched in enzymes of the α+β structural class. We can expect that flexibility will be a crucial issue in these proteins and accordingly the deformation pattern should be very well preserved, which means that changes in the SF-space happen as orthogonal as possible to the functionally relevant MD-space [24, 25, 26].
ii) Superfamilies with high number of members (n > 40), good overlap of SF- and MD-spaces (Hess index > 0.25, Additional file 1) and relatively good correspondence between the B-factor plots (Figure 6c). Here we find domains with structural or binding roles and fewer enzymes, with preference for α and β motives. In this group the superfamilies have been able to explore many physically-available deformation modes of the MD-space which do not interfere with function.
iii) Superfamilies with low number of members (n < 40), some overlap in the deformation spaces (Hess index > 0.15, Additional file 1) and poor correspondence between B-factor plots (Figure 6b). This group shows diverse families both in structural and functional terms. The physical deformability space has been explored to a little extent, but the residues that are not essential for function introduce large local structural changes reflected in poor B-factor correspondence.
Our technical analysis comparing the spaces of structural variation within superfamilies (SF-space) and along atomistic MD simulations (MD-space) sheds light on the connection between physical flexibility and conformational variation with compositional change in the aminoacid sequence. The overall picture showed a more complex scenario than we originally thought, in part due to the fact that we are comparing a set of different structures in a SF with the MD of just one of them. First, we have observed that when the sequence of a protein changes to become another member of the superfamily, the change is produced following a way that does not completely overlap with that expected from the intrinsic physical deformability of the protein, which suggests that functional restriction limits the access to some flexibility patterns to evolution. This effect is especially clear for enzymes, where there is the worst overlap between SF- and MD-spaces. Second, our analysis shows that conformational changes resulting from sequence variation tend to be larger and much simpler than those allowed by individual physical flexibility. Interestingly, the threshold for achieving the maximum overlap between the SF and MD-spaces seems to be situated around 40 superfamily members (Figure 3b), suggesting some saturation in the deformation along the superfamily when compared to the physical space.
MD and SF spaces are comparable, but they also have important differences, and some words of caution are necessary. Since superfamily members vary in sequence, in some cases quite dramatically, and they will be expected to have different structures, while MD simulation samples the flexibility of a single sequence, it is not surprising that MD does not explain instances where there are specific chemical interactions.
The strength of our analysis relies in its interesting methodological implications. As the deformation spaces have different size and complexity and do not fully overlap, they can be considered as complementary. Flexibility analysis derived from the study of the structural variation along superfamilies can provide easy to manage and useful descriptions [21, 27], although they will have a limit in the physical complexity that they can describe. In much the same way, physical descriptions of isolated domains without considering their possible interactions have a limited capability to predict their flexibility in the context of protein-protein complexes, and variation along domains in a superfamily is a good way of obtaining that information. In other words, taking together SF and MD spaces we enrich our view on the conformational freedom of proteins.
This is expected to be of especial interest in the areas of 3D-EM/X-ray hybrid models or ab initio protein folding, where the exploration of the physical conformational space exclusively with high dimensionality methods such as Molecular Dynamics or Normal Mode Analysis could be over-conservative. We suggest that the use of the most important singular vectors of the SF-space (about 6) will provide a complementary deformation space that can be very useful in sampling , since it will attract to the common fold quite distant structures. A combination of both spaces in a sequential way can help to improve these areas of protein structure prediction.
Superfamily space of flexibility
where x, y, z stand for the coordinates of the same backbone atom n (Cα, O, N and C) in two structurally aligned aminoacids, each one belonging to one domain (i for the reference, j for the aligned). A CDV vector was created by using all the cdv's obtained for the atoms of a given aligned domain, placing x, y, z coordinates in consecutive indexes. Then a CDV matrix was built with all the CDVs as its columns (one per aligned domain). The CDV matrix was decomposed with the incremental singular value decomposition (ISVD) algorithm  to capture the main axes of variation (Figure 1). The use of ISVD, a variant of the single value decomposition (SVD) method , allows us to manage superfamilies with incomplete information in the core due to gaps in the alignment, since it can handle matrices for which some of the values of their elements are unknown. In any case, aminoacids in the reference domain that cannot be aligned in any of the pairwise alignments using MAMMOTH (black box, Figure 9) were excluded of further analysis. When ISVD is applied to the CDV matrix it produces:CDV = U·S·V T
U - 4*3*m × n-1 matrix containing an orthogonal basis for the multi-dimensional space defined by the CDVs, were m is the number of aminoacids in the core and n is the number of superfamily members used in the procedure. 4 comes for the 4 backbone atoms employed and 3 comes from the x, y, z coordinates.
S - n-1 × n-1 diagonal matrix containing the n-1 singular values of the decomposition.
V - n-1 × n-1 matrix containing an orthogonal basis for the space of the rows of CDV.
where n is the number of snapshots used for the decomposition, l i is the PCA eigenvalue and s i is the [I]SVD singular value. Note that the original protein Cartesian coordinates appear now as projections onto the space defined by the singular vectors without any loss of structural information.
Molecular-dynamics space of flexibility
The range of conformations accessible for a protein under normal physiological conditions can be well explored by molecular dynamics (MD) simulations. The technique samples the movements of macromolecules by integration of Newton equations of motion, with the forces being obtained from an accurate potential functional (the force field) fitted to reproduce high accurate quantum mechanical data in small model systems [31, 32]. In opposition to Normal Mode Analysis, atomistic MD does not assume that the protein should be confined in a harmonic well around the experimental structure, allowing then, if required by the physics of the system, large conformational transitions. It is the best technique to explore the physical deformation space for proteins.
The reference protein domains were simulated in the context of the whole native protein. All protein structures were titrated, neutralized by ions, minimized, hydrated, heated and equilibrated (for at least 0.5 ns) using a well established protocol . Trajectories were collected using AMBER parm99 force field  in conjunction with Jorgensen's TIP3P model [34, 35] for representing water molecules. Particle Mesh Ewald approach was used to deal with long-range effects . Integration of motion equations was performed every 1 fs, the vibrations of bonds involving hydrogen atoms being removed by SHAKE algorithm . Production runs were obtained with the program AMBER8  and were extended for 10 ns. Computational effort performed here corresponds to more than 20 CPU years and were done thanks to access to large supercomputer resources.
Statistical descriptors for comparison
The MD and SF-spaces were subjected, for comparison purposes, to a modified version of the essential dynamics procedure  using SVD (with MD-space) and ISVD (with SF-space) decompositions. Many comparisons can be easily made using the singular vectors and values provided by the decomposition algorithms:
1) The size of deformability space was measured by the variance in MD or superfamily ensembles, summing the square of the singular values obtained after the decomposition. To avoid bias related to the limited number of structures in most superfamilies, the analysis of MD variance was repeated also using as many equally spaced MD snapshots as superfamily members (partial-MD space; MDp). The average values for 100 windows were computed.
2) The complexity of the deformability space was determined by the number of singular vectors needed to explain 90% of the variance.
Pure random models were obtained by decomposition of a pseudo-covariance matrix obtained by random permutation of the backbone atoms for each snapshot in a trajectory, and the standard deviation (std) was obtained by considering 500 different pseudo-covariance matrices.
Additional Z-scores* (labeled with * to avoid confusion with previous Z-scores derived from purely random models) showing the relevance of the values for H in a more chemically sound environment were computed from models where the chemical connectivity was maintained and steric collapses were avoided. For this purpose, we performed several 10 ns discrete dynamics simulations for each protein with a simplified force-field defined by covalent bonds plus a hard sphere potential for each atom . Essential dynamics from these trajectories provided sets of singular vectors being representative from random movements but still consistent with the basic physics of the protein. The standard deviations needed for Z-score calculations were evaluated from independent discrete dynamics simulations.
where ⟨Δr2⟩ stands for the oscillations of atoms around equilibrium positions.
where n is the number of superfamily members and stands for a displacement along a given mode (i) in the space X. is the stiffness constant associated with a deformation mode, computed as k b T/(2l i ), with kb being Boltzmann's constant, l i the corresponding PCA eigenvalue and T the absolute temperature.
The authors thank Tim Meyer for helpful suggestions. This work was partially funded by the European Union (FP6-502828 and UE-512092), the USA National Institutes of Health (HL740472), the Spanish Comisión Interministerial de Ciencia y Tecnología (BIO2007-67150-C03-01, BIO2007-67150-C03-03, BIO2006-01602, CONSOLIDER CSD2006-23, CONSOLIDER CSDooC-06.0892), the Spanish Ministry of Health (COMBIOMED RD07/0067/0009), the Government of Madrid (S-Gen-0166/2006), the National Institute of Bioinformatics (a project of Genoma España), and the Fundación Marcelino Botín. JAVM and MR are supported by a MEC Postdoctoral Fellowship, IC is supported by a Spanish Postdoctoral Fellowship (FIS-CD07/00131) and APM is supported by the Spanish Ramón y Cajal program. We acknowledge the Barcelona Supercomputing Center for providing us with computer resources.
- 10.Andreeva A, Howorth D, Chandonia JM, Brenner SE, Hubbard TJ, Chothia C, Murzin AG: Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res 2008, (36 Database):D419–425.Google Scholar
- 11.Pearl F, Todd A, Sillitoe I, Dibley M, Redfern O, Lewis T, Bennett C, Marsden R, Grant A, Lee D, et al.: The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis. Nucleic Acids Res 2005, (33 Database):D247–251.Google Scholar
- 16.Maguid S, Fernandez-Alberti S, Echave J: Evolutionary conservation of protein vibrational dynamics. Gene 2008.Google Scholar
- 29.Brand ME: Incremental Singular Value Decomposition of Uncertain Data with Missing Values. In Lecture Notes in Computer Science. Volume 2350. European Conference on Computer Vision (ECCV); 2002:707–720.Google Scholar
- 30.Press WH, Flannery BP, Teukolsky SA, Vetterling WT: Numerical Recipes in C: The Art of Scientific Computing. 1st edition. UK: Cambridge University Press; 1988.Google Scholar
- 33.Cornell WD, Cieplak P, Bayly CI, Gould IR, Merz KM Jr, Ferguson DM, Spellmeyer DC, Fox T, Caldwell JW, Kollman PA: A Second Generation Force Field for the Simulation of Proteins, Nucleic Acids, and Organic Molecules. Journal of the American Chemical Society 1995, 117(19):5179–5197.CrossRefGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.