1 Introduction

Before a protein structure can be analyzed in light of its biological function it is necessary to validate it, i.e., to have a clear understanding of its reliability in terms of both the overall structure and of its details at per-residue level. However, an accurate and fast validation of protein structures constitutes a long-standing problem in Nuclear Magnetic Resonance (NMR) spectroscopy [14]. For this reason, investigators have proposed a plethora of methods to determine the accuracy and reliability of protein structures in recent years [512]. Despite this progress, there is a growing need for more sophisticated, physics-based and fast structure-validation methods [1, 2, 6, 7, 11].

The 13Cα chemical shifts provide important information about conformations of peptides and proteins in solution [1339] and, therefore, can be used as an exquisitely sensitive probe with which to assess the quality of protein models. We developed recently a new, physics-based methodology [34], that makes use of observed and computed {at the Density-functional theory (DFT) level of theory [40]} 13Cα chemical shifts for an accurate validation of protein structures in solution and in crystal [41]. The first step in the development of this new methodology involved determining the factors that affect 13Cα shielding calculations, such as the protonation/deprotonation state of distant ionizable groups, sequential nearest-neighbor or covalent geometry effects (i.e., due to variations in the bond lengths and bond angles of residues) and the sensitivity of the shielding/deshielding of 13Cα nuclei to changes in side-chain conformation. Once all these factors affecting 13Cα-shielding have been properly identified and considered, a very important test is to determine the accuracy and speed of the computation of the 13Cα-shielding as a function of the size of the basis set chosen and the Density Functional Theory (DFT) model adopted. These are important tests because DFT-based quantum mechanical (QM) calculations are very CPU demanding, despite the ever-increasing computational power available.

The new DFT-based method has been applied to study a number of problems, such as unblocked statistical-coil tetrapeptides in aqueous solution [32], polyproline II helix conformation in a proline-rich environment [31], the 13Cα and 13Cβ chemical shifts of cysteines in disulfide-bonded cysteine [42] or determination of the fraction of the tautomeric forms of histidine in proteins as a function of pH [43]. This new strategy also provides a unified, self-consistent method to determine high-quality protein structures, without relying on knowledge-based information [44]. Thus, a β-sheet or an all α-helical protein structure can be accurately determined by simply identifying a set of conformations which simultaneously satisfy a number of constraints, namely 13Cα-dynamically-derived torsional angle constraints and Nuclear Overhauser Effect (NOE) derived distance constraints [29, 44].

The currently used 13Cα chemical shift-based validation and determination protocol [29, 33, 44, 45, 34] exploits the following features: (a) the assignment of chemical shifts is a fundamental step in a protein structure determination by NMR spectroscopy [46], and no extra experimental work is needed; (b) in addition to the impact of the covalent structure, 13Cα chemical shifts are modulated mainly by the intraresidue backbone and side-chain dihedral angles [16, 17, 19, 2022, 27, 47, 35, 39], with no significant influence of the amino acid sequence [48]; (c) 13Cα is ubiquitous in proteins; and, (d) 13Cα chemical shifts can be computed with high accuracy at the QM level of theory.

This chapter is intended to be an overview of the author’s contribution to the field of protein structure determination and validation using, mainly, information decoded from the 13Cα chemical shifts. Consequently, the chapter is organized as follows: first, the method used to compute the 13Cα chemical shifts and to analyze the results are briefly described; second, the main factors affecting the 13Cα chemical shifts computation are enumerated and discussed; third, the capabilities of the computed 13Cα chemical shifts, as a rich source of encoded structural information, are illustrated by a series of applications that involves, but is not limited to, the determination of protein structures; and finally a new protein-structure validation server, CheShift-2 [49], with which NMR spectroscopists can assess the quality of their protein models, before they are deposited in the Protein Data Bank (PDB) [50], is presented. It is worth noting that the theory, and details, behind alternative protein structure determination and validation methods are not discussed here and, hence, the reader is referred instead to an extensive collection of such methods [1, 512, 26, 5161].

2 Methods

2.1 Calculation of 13Cα Chemical Shifts

All the experimentally determined conformations, unless noted otherwise, were regularized, i.e., all residues were replaced by the standard Empirical Conformational Energy Program for Peptides (ECEPP) [62] residues in which bond lengths and bond angles are fixed (rigid-body geometry approximation) at the standard values [62] and hydrogen atoms were added, if necessary.

Computations of the 13Cα chemical shifts involve a series of approximations. For each amino acid residue X in the protein sequence: (a) the 13Cα shielding depends, mainly, on its own backbone conformations [21, 27] and side-chain [19, 20, 35], with no significant influence of either the amino acid sequence or the position of the given residue in the sequence, except for residues preceding proline [48]; (b) each amino acid residue X in the protein sequence can be treated as a terminally-blocked tripeptide with the sequence Ac-GXG-NMe, with X in the conformation of the protein structure; (c) the 13Cα isotropic shielding values (σ) for each amino acid residue X can be computed at the OB98/6-311 + G(2d,p) level of theory [28] with the Gaussian 03 package [63]. The remaining residues in each tripeptide are treated at the OB98/3-21G level of theory, i.e., by using the locally-dense basis set approach [64]; (d) all ionizable residues can be considered neutral during the QM calculations [45], unless noted otherwise; (e) no geometry optimization is necessary because such optimization by ab initio (HF) or DFT methods has only a small effect on the computed chemical shifts [19].

The computed 13Cα shieldings (σsubst, th) are converted to 13Cα chemical shifts (δ) by employing the equation δth = σref – σsubst, th where the indices denote a theoretical (th) computation, the reference substance (ref), and the substance of interest (subst), i.e., the 13Cα shielding of a given amino acid residue X. The observed shielding value of tetramethylsilane (TMS) in the gas phase [65], namely 188.1 ppm, was adopted as an initial (see below) reference value. All the computed 13Cα shielding (σsubst, th) values are calculated using the Gauge-Invariant Atomic Orbital method at the DFT level of theory as implemented in the GAUSSIAN 03/09 suite of programs (Frisch et al., 2003). For all purposes, in this chapter, we have used only one exchange-correlation functional, OB98, because it was shown [30] to be one of the most accurate and fast functionals with which to reproduce the observed 13Cα chemical shifts of proteins in solution (see Sect. 3.2).

2.2 Determination of an Effective TMS Shielding Value

Determination of a proper TMS shielding value for each functional is crucial for an accurate computation of the 13Cα chemical shifts because it will enable us to minimize the presence of systematic errors which might bias the chemical shifts-based analysis. From this point of view the effective TMS value will provide the most accurate approach to solve the problem because it will not require further adjustments. Consequently computation of an effective TMS values is central to our calculations.

By adopting the observed TMS value of 188.1 ppm (Jameson and Jameson, 1987) as a reference it is possible to find for any functional, the characteristic mean (xo) and standard deviation (σ) of the Normal (or Gaussian) fit of the frequency of the errors distribution. For all functionals tested in our work the characteristic mean value (xo) appears displaced from its ideal value of 0.0 by a positive, or negative, amount, e.g., for OB98 a xo = + 3.6 ppm was found. Further analysis [30] indicates that for any of the 10 functionals tested a straightforward use of the observed TMS shielding value (188.1 ppm) is not appropriate, if no further corrections are introduced. Hence, for each functional and basis set chosen it is feasible to find an ‘effective’ TMS shielding value for which the Normal (or Gaussian) fit shows a zero displacement, i.e., an effective TMS value that gives a xo = 0.0. For example, use of OB98 with a large [6-311 + G(2d,p)/3-21G] basis set leads to an effective TMS of 184.5 ppm, i.e., by subtracting 3.6 ppm from 188.1 ppm [30], that gives a xo = 0.0 ppm. Likewise, use of a small (6-31G/3-21G) basis set leads to an effective TMS of 195.4 ppm.

2.3 Computation of the Ca-RMSD Model

The observed chemical shift for each residue i, 13C αobserved, i , represents contributions from an ensemble of rapidly interconverting conformers that coexist in solution. Then, an accurate comparison between the observed and computed 13Cα chemical shifts requires consideration of an ensemble of NMR-derived conformers, rather than of a single conformation [41, 33]. Consequently, for each amino acid residue in the sequence, i, the average of the chemical shifts calculated for the individual residues in the ensemble of Ω conformers representing the NMR structure, < 13Cα > i, is computed as:

$$ < {^{13}{\text{C}}^{\alpha }} >_{i} = \left( {1/\varOmega } \right)\sum\limits_{k = 1}^{\varOmega } {^{13} {\text{C}}^{\alpha }_{i, \, k} ,} $$
(1)

where 13C α i, k is the computed chemical shift for residue i in conformer k, with 1 ≤ i ≤ N, where N is the number of residues in the sequence. Derivation of Eq. (1) was obtained through the following approximation: for each residue i the quantity to be computed must, in principle, be \( {<} {^{13} {\text{C}}^{\alpha }} {>}_{i} = \sum\nolimits_{k = 1}^{\varOmega } {\lambda_{k}^{13} {\text{C}}^{\alpha }_{i,k} } \), where λk is the Boltzmann factor for conformer k, with \( \sum\nolimits_{k = 1}^{\Omega } {\lambda_{k} } \equiv 1 \). But, computation of the Boltzmann factors at QM level of theory is not possible, with the existing computational facilities, because it would require computation of the total energy at the QM level of theory for each of the conformers in the ensemble used to represent the NMR structure. Therefore, the following approximation was used: λk = 1/Ω [48]; in other words, in this approximation each conformer contributes equally to the average chemical shift obtained by fast conformational averaging. Whether a computation of a Boltzmann average, rather than the arithmetic average, would lead to a more accurate representation of the 13Cα chemical shifts needs further investigation.

The < 13Cα > i value obtained from Εq. (1) is used to compute the conformational-average difference Δi between the observed and computed 13Cα chemical shifts for each amino acid residue i,

$$ \Delta_{i} = \left( {{}^{13}C_{observed,i}^{\alpha } - < {}^{13}C_{{}}^{\alpha } >_{i} } \right) $$
(2)

Hereafter, the conformational-average root-mean-square-deviation (rmsd) parameter, ca-rmsd [48], is obtained as:

$$ ca - rmsd = [\left( {1/N} \right)\sum\limits_{i = 1}^{N} {\Delta_{ \, i}^{2} } ]^{1/2} , $$
(3)

which is a global property of the protein NMR structure given as the weighted average of the differences between the experimental 13Cα chemical shifts and the < 13Cα > i—values for all the residues in the protein.

2.4 13Cα-Based Protein Structure Determination Method

The 13Cα-based procedure used for determination of protein structures consists of three steps. The flow chart of this protocol [44] is shown in Fig. 1 and a brief description of each step follows.

Fig. 1
figure 1

Figure adapted from Vila et al. [44]. Copyright 2007 American Chemical Society

Flow-chart of the 13Cα-based protein structure determination protocol described in the Methods section.

Step 1: The Variable-Target-Function (VTF) approach with a simplified soft-sphere potential function [66] is used to generate an ensemble of conformations at random that simultaneously satisfy a set of long-range distance constraints derived from the experimental NOEs and (φ, ψ) torsional constraints, derived from the observed 13Cα and 13Cβ conformational shifts [27]. The derived torsional constraints are only for those amino acids residues in the sequence that pertain to a regular structure, i.e., to a α-helix or β-sheet. Consequently, these (φ,ψ)α,β torsional constraints (shown in Fig. 1) are limited to, on average, ~50% of the amino acids residues in proteins because the remaining ones populate non-regular structures.

Then, a clustering procedure, e.g., the Minimal Spanning Tree method [67], is used to select a small sub-set of the total number of the VTF-derived conformations, namely those possessing a maximum NOE-derived distance violation lower than some arbitrary fixed value. For each of these conformations the 13Cα chemical shifts are computed as described in Sect. 2.1. Examination of the chemical shifts of all the amino acids in the ensemble of conformations enables us to identify the amino acid at each position in the sequence whose computed chemical shifts most closely match the observed ones, among all these conformations. This identified set of individual amino acid conformations corresponds to only one conformation of the whole chain: the ‘theoretical minimal-rmsd model’ [33]. In this model, the 13Cα chemical shift of each residue individually best matched the experimental one, thereby providing a new set of ϕ, ψ, and χ torsional angle constraints for all amino acid residues in the sequence, i.e., not just for the amino acid residues in regular structures. Because the chemical shifts are a multivalued function of the ϕ, ψ, and χ torsional angles, the set of torsional angles derived from the ‘theoretical minimal-rmsd model’ does not, necessarily, represent a unique solution to a given set of observed 13Cα chemical shifts values.

Step 2: Only one conformation among all the conformations produced in Step 1 is selected, for example, the conformation possessing the lowest rmsd between the computed and observed 13Cα chemical shifts. The selected conformation is used as a starting one in a new conformational search with the Monte Carlo with Minimization (MCM) method [68, 69]. The MCM search is carried out with two types of constraints: the original set of NOE-derived distance constraints and the new set of ϕ, ψ, χ torsional angles derived in Step 1. This time the conformational search is carried out using a complete force-field including the internal potential energy described by ECEPP/05 [70], the solvent free energy calculated by using a solvent-accessible surface area model [71], and an additional energy terms aimed at penalizing violations of the distance and torsional angle constraints [72]. Convergence of the determination protocol is monitored using the ca-rmsd between the computed and observed 13Cα chemical shifts.

Step 3: If the computed ca-rmsd is lower than certain, arbitrary chosen, cutoff value (ξ), then the procedure is ended. Otherwise, the Step 2 is repeated using a new set of (ϕ,ψ,χ) derived from the minimal-rmsd-model of the previous step.

It is worth noting that after our physics-based protocol was published [44] an alternative knowledge-based method that makes use of 1H, 13Cα, 13Cβ and 15N chemical shifts as restraints, was successfully applied to structure determination of several proteins [53]. A blind test of computational methods, included several that use also chemical shifts as restraints, aimed at fully automated determination of protein structures has been carried out recently [60].

2.5 Computation of the 13Cα Chemical Shifts as Function of the PH

For a given residue i, of a protein in a conformation k, the average charge distribution, <ρi,k> , could be determined by solving the Poisson equation by considering the 2ξ ionization states, with ξ being the number of ionizable groups in the molecule. Regarding this problem, it is worth noting that ξ could be a large number because ~30% of all residues in a protein sequence are, on average, ionizable and, hence, an accurate solution would require a fast algorithm. Consequently, in all the applications mentioned in this chapter, we used the Multiple Boundary Element (MBE) method [73, 74], in which the free energy associated with the state of ionization of the ionizable groups at a fixed pH value, namely 6.5, is calculated with the general multi-site titration formalism [75, 76]. The charges and atomic radii from the PARSE (Parameters for Solvation Energy) algorithm [77] were used for the solvation free energy calculations using the MBE method, and the internal (εint) and solvent (εsolv) dielectric constants of 2 and 80, respectively [76] were adopted for the calculations of <ρi,k> . The value of εint = 2 is consistent with the use of PARSE charges [78] and is also commonly assumed as an adequate representation of the protein interior. Following these approximations, for a given conformation k, the average degree of ionization of the ith ionizable group of this conformation is computed as:

$$ < \rho_{i,k} > = Z^{ - 1} \sum\limits_{n = 1}^{{2^{\xi } }} {\rho_{i,k}^{n} } [ - \Delta G(P_{k} ,x_{k}^{n} )/k_{B} T] $$
(4)

where Z is the partition function, kB is the Boltzmann constant, T is the absolute temperature, \( x_{k}^{n} = (\rho_{1,k}^{n} , \ldots ,\rho_{i,k}^{n} , \ldots ,\rho_{N,k}^{n} ) \) with \( \rho_{i,k}^{n} \) = (1 or 0) is the nth protonation microstate of conformation k for protein Pk. \( \Delta G(P_{k} ,x_{n}^{k} ) \) is the free energy of ionization of the nth microstate of protein Pk in conformation k [75].

It should be noted that for any ionizable residue i of a single conformation k, Eq. (4) can lead to a non-integer average degree of charge, although we know that such non-integer charges do not make physical sense. Due to the Boltzmann nature of the averaged value computed by Eq. (4), a fractional charge should physically be interpreted as follows: for a given conformation k, there are many identical replicas of such a conformation in solution and, hence, a fractional charge computed by Eq. (4), e.g., 0.75, means that 75% of these replicas possess the ionizable group i protonated/deprotonated with an integral charge while the remaining 25% of the replicas possess the same ionizable group as deprotonated/protonated, depending on whether the ionizable group is basic or acidic.

Assuming that the protonation/deprotonation reactions are instantaneous on the NMR time scale, i.e., microsecond to millisecond [79], the theoretical 13Cα chemical shifts, \( \delta_{i}^{computed} (pH) \), for a given residue i in the sequence (except for histidine that possess 2 tautomers) are computed as a function of the pH using the following equation:

$$ \delta_{i}^{computed} (pH) = (1/\Omega )\sum\limits_{k = 1}^{\Omega } {\{ < \rho_{i,k} } > \delta^{ + ,i,k} + (1 - < \rho_{i,k} > )\delta^{0,i,k} \} $$
(5)

where δ+,i,k and δ0,i,k are the computed 13Cα chemical shifts, for the amino acid i in conformation k, with fully charged and neutral side chains, respectively, Ω is the number of conformers in the protein ensemble, and < ρi,k> the averaged degree of charge, as given by Eq. (4).

3 Factors Affecting the Calculation of 13Cα Chemical Shifts

3.1 Transferability of the Results

The current methodology [33, 34] relies on a crucial observation: once residue conformations are established by their interactions with the rest of the protein the 13Cα shielding of each residue depends, mainly, on its backbone and side-chain conformations, with no significant influence by the nature of the nearest-neighbor amino acids, except for residues immediately preceding proline [48].

The above observation allows us to parallelize the 13Cα shielding calculations in proteins and, hence, to make them computationally feasible. Moreover, a given set of accurately-determined amino acid residue conformations representing the accessible conformational space for all the 20 naturally occurring amino acids and showing a good distribution of side-chain conformations will constitute a reasonable ensemble with which to carry out tests of the current methodology. The results of these tests should be transferable to proteins of any class or size. Consequently, we used structures of three proteins solved by NMR and X-ray, namely PDB id 1D3Z, 2JVD and 1NS1 to evaluate the performance of different DFT functionals and basis sets, as explained below.

3.2 Performance of Different DFT Functionals to Reproduce Observed 13Cα Chemical Shifts

DFT has become a method of choice for QM calculations of the electronic structure and properties of many molecular and solid systems. Because the exact exchange-correlation functional is unknown, a large number of approximations has been proposed in the literature making it essential to pursue more accurate and reliable approximate functional, a process which, on the other hand, depends on the applications. Selection of the most appropriate density functional model for a particular application becomes one of the main problems of the DFT method. For this reason we decided [28] to test several density functional models (namely B3LYP, OLYP, PBE1PBE, OPBE, O3LYP, OPW91, OB98, BPW91, BPBE and B971). The benchmarking was intended to find not only the most accurate functional with which to reproduce the observed 13Cα chemical shifts in solutions but also the fastest one, in terms of CPU time, because speed of DFT calculations could severely limit their applicability to proteins. The test was applied to 10 NMR-derived conformations of the 76-residue α/β protein ubiquitin (PDB id 1D3Z).

Comparison of the observed and computed 13Cα chemical shifts shows that there are five functionals, namely OPW91, OB98, OPBE, OLYP, and O3LYP, which are among the faster ones and, even more importantly, behave very similarly in their ability to reproduce accurately the observed 13Cα chemical shifts. In particular, we observe that OB98 appears to be slightly better than any other of the five functionals in terms of both the correlation coefficient, R, (or Pearson coefficient) between the observed and the conformational-averaged 13Cα chemical shifts and the standard deviation of the computed conformational-averaged 13Cα chemical shifts from a linear regression. Consequently, we chose the OB98 for all the applications [30].

We also compared the results obtained using OB98 with those obtained with B3LYP, a very popular functional that has been used extensively in our group, and elsewhere. The correlation existing between averaged 13Cα chemical shift values obtained for the 10 conformations of 1D3Z with OB98 and B3LYP functional, is excellent [30], i.e., showing a correlation coefficient R = 0.998 and standard deviation of 0.300 ppm. This test provides solid evidence that the results and conclusions obtained using B3LYP do not need to be revised if the OB98 functional is adopted [30].

3.3 Performance of Different Basis Sets to Reproduce Observed 13Cα Chemical Shifts

To study the dependence of the accuracy and speed of DFT calculations of the 13Cα chemical shifts in proteins on the size of the basis set used, six basis sets, viz., 6-31G/3-21G, 6-31G(d)/3-21G, 6-311G(d, p)/3-21G, 6-311 + G(d, p)/3-21G, and 6-311 + G(2d,p)/3-21G locally-dense basis-set approximations, and uniform 3-21G/3-21G set were initially applied [28] to 10 NMR-derived conformations ubiquitin [54]. For each of these six basis sets, combined with the OB98 functional, the 13Cα shielding was computed for 760 amino acid residues by treating each amino acid X in the sequence as a terminally-blocked tripeptide with the sequence Ac-GXG-NMe in the conformation of the regularized experimental protein structure. Analysis of the results [28], in terms of the agreement between the computed and observed 13Cα chemical shifts shows that the accuracy with which the observed 13Cα chemical shifts are reproduced by using either the small basis set (6-31G/3-21G) or the larger basis set [6-311 + G(2d,p)/3-21G] is very similar, although, use of the small basis set leads to a significant decrease in computational time.

The results also indicates that the 13Cα chemical shifts computed with the large [6-311 + G(2d,p)/3-21G] basis set, can be reproduced accurately (within an average error of ~0.4 ppm) and faster (by ~9 times) by using the small (6-31G/3-21G) basis set after extrapolating it with: \( {}^{13}C^{\alpha } = - 1.597 + 1.040 \times {}^{ 1 3}C_{\mu }^{\alpha } \). In effect, the correlation existing between averaged 13Cα chemical shift values computed for the 32 conformations of 1NS1 with these two basis sets, is excellent [28], i.e., showing a correlation coefficient R = 0.999 and standard deviation of 0.284 ppm. Even more important, an analysis of the magnitude of the errors and their distribution carried out for Val and Arg hypersurfaces, constructed by calculating a grid of 6864 and 6794 points, respectively, corresponding to different combinations of the ϕ, ψ, χ1, and χ2 (only for Arg) torsional angles, indicates that ~70% of them are within ~0.6 ppm and that the most populated regions of the Ramachandran map are not affected by errors higher than ~1.0 ppm [28].

In conclusion, the described analysis enabled us to select the smaller basis set (6-31G/3-21G) that provides accuracy similar to that of a ‘basis set limit’ [6-311 + G(2d,p)/3-21G] to reproduce the computed chemical shifts, but at a significantly lower computational cost [28].

3.4 Effect of Sequential Nearest-Neighbors on the 13Cα Chemical Shifts Calculations

The 13Cα chemical shifts for a residue X in the model peptide Ac-G-X-G-NMe has always been computed [44, 34] considering that all the torsional angles of the residue X are exactly those of the residue in the protein conformation and that the surrounding Gly residues and the end-blocking groups are free to rotate. It is implicit in this approach that the 13Cα chemical shifts of residue X do not depend on the identity of the nearest-neighbor residues. This assumption needs to be proved.

The structure of the Nucleic Acid Binding (NAB) protein of the SARS coronavirus [80], a 116-residue α/β protein containing 9 Prolines (Pro) and with 50% of its residues in loops and turns, was chosen to further evaluate the origin of differences between computed and observed 13Cα chemical shifts, as well as to study the influence of the nearest-neighbor residues on the computed 3Cα chemical shifts.

The results [48] indicate that computation of the 13Cα chemical shifts of a given residue in the sequence of the NAB protein is not influenced significantly, i.e., within ~0.5 ppm, by the nature of the nearest-neighbor amino acids, except for residues immediately preceding proline (see Fig. 2a). For such residues, Pro must be considered during the computation of the 13Cα chemical shifts; otherwise, an overestimation of the computed 13Cα chemical shifts by about +1.7 ppm occurs. This finding is in good agreement with both the experimental evidence [36, 81, 82] and the empirical observations [37, 81]. It is equally important to emphasize the physical nature of this effect: “…an imide bond formed by an XxxPro pairing is generally thought to be much less electron-withdrawing than an amide bond…” [37].

Fig. 2
figure 2

Figure adapted from [48] (with permission of Springer)

Histogram of the average, over all 20 conformers of the protein PDB id 2K87, second-order differences ΔΔ: a with ΔΔ = < (ΔX − ΔYX) > arising from the nature of the sequentially preceding residue-type (Yyy). ΔX and ΔYX are the differences between the observed chemical shifts and those computed using the Ac-Gly-Xxx-Gly-NMe and Ac-Gly-Yyy-Xxx-Gly-NMe model peptides, respectively; b with ΔΔ = < (ΔX – ΔXY) > for the differences arising from the nature of the subsequent residue-type, i.e., with ΔXY computed with Ac-Gly-Xxx-Yyy-Gly-NMe.

Overall, except for the Pro effects, use of the Ac-G-X-G-NMe model peptide for the computation of the 13Cα chemical shifts of residue X is a good approximation because the computed values are accurate within ±0.5 ppm for all residue-types, if neither the subsequent nor precedent residue-type effects are taken into account (see Fig. 2).

3.5 Rigid-Geometry Approximation and Accuracy of the Calculations of 13Cα Chemical Shifts

Experimental protein structures are often solved using force fields which allow variation of bond lengths and bond angles. However, it is known that QM calculations are very sensitive to bond lengths and bond angles [16]. Therefore, we have explored the dependence of the computed 13Cα-chemical shifts on the bond lengths and bond angles to establish whether a rigid- rather than non-rigid geometry approximation is a more accurate representation with which to compute the chemical shifts.

For this test, the structure of ubiquitin deposited in the PDB (PDB id 1UBQ) was chosen because it possesses non-regularized geometry and has been solved by X-ray diffraction at 1.8 Å resolution [83]. We have also examined the corresponding structure with regularized geometry, i.e., the one with all the residues replaced by the standard ECEPP residue geometry [62], named here as 1UBQregular. Analysis of the differences between the computed and observed 13Cα chemical shifts for the 1UBQ and 1UBQregular structures, leads to rmsd of 3.28 ppm and 2.38 ppm, respectively. The better agreement obtained with 1UBQregular, rather than 1UBQ, is consistent with the long-time recognition that the bond lengths and bond angles of both X-ray and NMR-derived structures are not as highly accurately defined as in studies of small molecules [16], with which the ECEPP geometry [62] has been parameterized. Further analysis of the agreement of the two ubiquitin structures with the deposited electron density data [83] of 1UBQ, in terms of the R-factor, leads to 19.2 and 23.1% for 1UBQ and 1UBQregular, respectively; while the all-heavy-atom rmsd between these two structures is 0.142 Å [34].

Overall, the use of regularized geometry, i.e., ECEPP geometry, is an accurate approximation with which to compute the 13Cα chemical shifts in proteins and, hence, is used in most of the application discussed in this chapter.

3.6 13Cα Chemical Shifts as a Function of the Charge Distribution

Among the factors that affect 13Cα-shielding, which are important for an accurate computation of chemical shifts, is the sensitivity of 13Cα nuclei to the shielding/deshielding induced by changes in the protonation/deprotonation of distant ionizable groups [8487]. However, these factors have not been taken into account explicitly in current computations of 13Cα chemical shifts in proteins at the QM level of theory because, usually, the calculations are carried out in the gas phase, and the ionizable residues are treated as neutral groups.

The question of whether the use of neutral, rather than charged, side chains is more accurate for computation of the 13Cα chemical shifts of ubiquitin, at a given fix pH, was investigated as follows [45]. For a given ionizable residue i in a conformation k, first, the average charge distribution, < ρi,k > , was computed by using Eq. (4), i.e., by explicit consideration of the 2ξ ionization states for every conformation [75], with ξ being the number of ionizable groups in the molecule, namely 22; and second, the 13Cα chemical shifts as a function of the pH,\( \delta_{i}^{{}} (pH) \), were computed by using Eq. (5). This analysis was applied to 139 conformations of ubiquitin: 138 (10 conformations from PDB id 1D3Z plus 128 conformations from PDB id 1XQQ) NMR-derived conformations [54, 88], while the remaining one is an X-ray structure (PDB id 1UBQ) solved at 1.8 Å resolution [83].

Additionally, an extra set of 50 randomly generated conformations for each amino acid residue X, in the terminally-blocked tripeptide with the sequence Ac-GXG-NMe, with X being Lysine (Lys), Ornithine (Orn), Diaminobutyric acid (Dab), Glutamic acid (Glu) or Aspartic (Asp) acid, were also obtained. This set of randomly generated conformations was used to determine: (i) the range of shielding/deshielding of the 13Cα nucleus of free acidic/basic amino acid residues in solution, in their fully charged and neutral forms, respectively; (ii) how these ranges of shielding/deshielding variations compare with those derived from 3058 ionizable groups of the 139 conformations of the protein ubiquitin; and (iii) how the computed shielding/deshielding range of variations are influenced by the distance between the charged side-chain group and the 13Cα nucleus (for example, there are two chemical bonds in Asp, rather than three in Glu, separating the deprotonated carboxyl group from the 13Cα nucleus). To examine an analogous effect for a basic side-chain group, such as Lys, use was made of the non-natural amino acids Orn and Dab because, for these amino acids, the protonated amino group is separated from the 13Cα nucleus by four and three chemical bonds, rather than by five in Lys.

The results of this study [45], based on the analysis of 139 conformations of ubiquitin at pH 6.5, indicate that use of neutral, rather than charged, amino acids is a significantly better approximation of the observed 13Cα chemical shifts in solution for the acidic groups, and a slightly better representation, though significantly less expensive computationally, for the basic groups (see Fig. 3).

Fig. 3
figure 3

Figure adapted from [45] (with permission of John Wiley and Sons)

Average difference, Δ, computed over a set of 9 conformations of protein ubiquitin using Eq. (2) for: a acidic and b basic groups, respectively. Grey and white bars denote charged and neutral side-chain, respectively.

Additionally, our analysis of Lys, Orn and Dab revealed a significantly greater deshielding of the 13Cα nucleus (due to the deprotonation of the acidic groups) than the shielding due to the protonation of the basic groups. The origin of such a difference can be found in the distance between the ionizable groups and the 13Cα nucleus, which is shorter for the acidic than for the basic groups.

3.7 13Cα Chemical Shifts as a Function of Side-Chain Flexibility

To what extent are the chemical shifts of the amino acid residues in a protein affected by the side-chain orientation? The basis for such a query arises from the fact that the three torsion angles ϕ, ψ and χ1 are not independent on each other over the whole range because they involve a common N-Cα bond [89, 90]. To find an answer to this question, the dependence of the 13C chemical shifts on side-chain orientation was investigated [35], at DFT level of theory, for two-strand antiparallel β-sheet model peptide with the amino acid sequence Ac-A3-X-A12-NH2 where X represents any of the 17 naturally-occurring amino acids considered here, i.e., not including alanine, glycine and proline. Because the majority of β-sheets are twisted, rather than planar, with a right-hand twist in the approximately ±30° range for the backbone dihedral angles [9194] conformational parameters for β-sheets may deviate from those for planar pleated sheets and, hence, are difficult to model by using canonical values. The fact that β-sheets in proteins appear as parallel or antiparallel strands, or a combination of both, only exacerbates the modeling problem. For this reasons, the dihedral angles adopted for the backbone were taken, and kept fixed, from the experimental structure of an antiparallel β-sheet, specifically from the 16-residue segment (G41-G56) of the B3 binding domain of protein G (PDB id 1P7E).

For the 17 naturally occurring amino acids considered the analysis indicates that there is: (a) good agreement between computed and observed 13Cα and 13Cβ chemical shifts, i.e., with correlations coefficient, R, of 0.95 and 0.99, respectively; (b) significant variability of the computed 13Cα and 13Cβ chemical shifts as function of χ1 for all 17 residues, except for Ser; and (c) a smaller compared to χ1, although significant, dependence of the computed 13Cα chemical shifts of χξ (with ξ ≥ 2) for 11 out of 17 residues.

The above results obtained by Villegas et al. [35] for an antiparallel (16-residue segment) β-sheet were later validated on a 76 residues α/β protein, i.e., by exploring the effects of side-chain conformation on the computed 13Cα chemical shifts [45]. This validation process involved an exhaustive conformational search, starting from an arbitrary selected conformation of the NMR-determined ubiquitin protein (PDB id 1D3Z), in which only the torsional angles of the side chains were allowed to vary, i.e., all backbone dihedral angles (ϕ, ψ, ω) were fixed at their corresponding observed values. Furthermore, the correlation coefficient, R, between computed, by using the Karplus equation [95], and observed vicinal coupling constants 3JN-Cγ and 3JC′-Cγ of 17 valine, threonine and Isoleucine residues, was used to check the accuracy of the side-chain conformational search.

The obtained results on an antiparallel β-sheet segment and the ubiquitin protein enabled us to determine the role and impact of a proper side-chain conformation for an accurate computation of the observed 13Cα chemical shifts in solution.

4 Use of the Structural Information Decoded from 13C Chemical Shifts

We have chosen three examples to illustrate how the structural information decoded from the observed 13C chemical shifts can be used in practice: (1) to determine the fraction of the tautomeric forms of the imidazole ring of histidine (His) in proteins as a function of pH, provided that the observed 13Cγ and 13Cδ2 chemical shifts and the protein structure, or the fraction of H+ form are known; (2) to determine either all α-helical or all β-sheet protein structures in solution; and (3) to assess the reliability of NMR-determined protein models before they are published or deposited in the PDB. Each of these applications is described in the following subsections.

4.1 The Importance of Being His

In 1965 Mandel [96], in a pioneering NMR experiment, detected the imidazole (C2) protons of histidine (His) residues in Ribonuclease A and in 1966, Bradbury and Scheraga [97], were able to distinguish between the histidine residues of Ribonuclease A, i.e., they resolved the NMR-peaks of three out of four histidines of this enzyme. Subsequently, use of NMR spectroscopy, X-ray crystallography and theoretical studies, based on QM calculations, have continuously evolved in their ability to determine properties of the histidine residues in solution and in the solid state [43, 79, 98116]. The reason for this persistent interest in His is due to the fact that this residue is unique among all 20 naturally occurring amino acids because ~50% of all enzymes use His in their active sites [117]. This is, mainly, because of the versatility of imidazole His ring, which includes two neutral, chemically-distinct forms, referred to as Nδ1-Η and Nε2-Η tautomers, and a protonated form, the charged H+ form, with one form favored over the other two by the protein environment and pH. In addition, His with a pK° of 6.6 [118] is the only ionizable residue that titrates around neutral pH, allowing the non-protonated nitrogen of its imidazole ring to serve as an effective ligand for metal binding [79], or to play a crucial role in the proton-transfer process [103].

Certainly, determination of the fraction of the tautomeric forms of the imidazole ring of His in proteins in solution is an important problem for a number of reasons. At a given fixed pH proteins in solution exist as an ensemble of conformations and, hence, the form of each His residue among different protein conformers may vary significantly because the tautomeric equilibrium is determined by the environment [43]. Moreover, because the exchange between different protonation states is assumed to occur in the fast exchange regime [79], the NMR resonances of a given nucleus, which include rotation, protonation and tautomerization, merge into a single average signal. Decoding the information from these exchange processes offers possibility to determine the extent to which the His residues in proteins behave as free His, where the Nε2-H tautomer is favored over the Nδ1-H tautomer in a ratio of 4:1 [108].

To find a solution to this long-standing problem in the biophysical chemistry of proteins, first, each form of His was treated as a terminally-blocked model tripeptide with the sequence: Ac-GHξG-NMe, with Hξ in the Nδ1-H, the Nε2-H tautomeric form or the protonated form H+, respectively. For each of the forms, a set of ~35,000 conformations, representing a uniform sampling of the whole Ramachandran map as function of ϕ, ψ, ω, χ1 and χ2 torsional angles, was generated. Afterward, the gas-phase, isotropic shielding value was computed using the method described in Sect. 2.1. Finally, the distribution of the computed shielding of the imidazole ring of His was analyzed in terms of all 13C nuclei, namely 13Cγ, 13Cδ2, and 13Cε1 (see Fig. 4). Specifically, the histogram of the shielding distribution (among all ~35,000 conformations) was fit by a Gaussian function with a mean value σo (shown as bars in Fig. 4) and standard deviation sd (data not shown). A visual inspection of the histogram shown in Fig. 4 revealed that the mean σo shielding values obtained for the 13Cε1 nucleus is not sensitive to changes in the form of the imidazole ring and, therefore, we confine our interest to those nuclei that are sensitive to such changes, namely 13Cδ2 and 13Cγ.

Fig. 4
figure 4

Figure adapted from Vila et al., 2011 (with permission of PNAS)

Bar diagram of the average σo shielding values computed for each carbon of the imidazole ring of His for each of the two tautomers: Nδ1-H, Nε2-H, and for the H+ form. The values were averaged over ~35,000 conformations of histidine in the model tripeptide Ac-GHG-NMe. Grey, black and white colors indicate the results obtained for the 13Cγ, 13Cδ2 and 13Cε1 nuclei, respectively.

Use of first-order shielding differences for a pair of selected nuclei, 13Cδ2 and 13Cγ, rather than chemical shifts, is a very convenient approach because the experimental referencing problem may be a source of errors [99]. Consequently, we define the first-order shielding difference, Δξ, as Δξ = |σ δ2ο – σ γο |ξ, with ξ denoting the form of the imidazole ring, and σ δ2ο and σ γο are the computed mean values of the shielding distribution for the 13Cδ2 and 13Cγ nuclei, respectively. In other words, the following convention is adopted: ξ = δ, ε, or +, to designate the Nδ1-H, Nε2-H or the H+ form, respectively.

Analysis of the first-order shielding differences indicates that the following inequality holds: Δε > Δ+ > Δδ, and Δδ ~0. Therefore, once the fraction of protonated H+ form, f + = < ρ > , computed with Eq. (4), and Δobs = |13Cδ213Cγ|, with 13Cδ2 and 13Cγ being the observed chemical shifts in solution, at a given pH, are known, the fraction of the Nε2-H tautomer (f ε) can be obtained assuming: (a) that all forms are in fast exchange on the NMR chemical shift time-scale [79], i.e., as: Δobs= f ε Δε + f + Δ+ + f δ Δδ; and (b) that Δδ ≡ 0.

Using these assumptions, together with some physical constraints, enable us to find an analytical expression with which to compute f ε, namely as:\( f^{\varepsilon } = \frac{{\Delta^{obs} (1 - \langle \rho \rangle )}}{{\Delta^{\varepsilon } }} \), with Δε the single-valued first-order shielding difference computed for the Nε2-H tautomer (Δε ~ 31 ppm). The fraction of the f δ tautomer is obtained straightforwardly as: \( f^{\delta } \, = \,1 - < \rho > - f^{\varepsilon } \).

The above formulation was used to determine the tautomeric forms of His for each of 8 selected proteins for which both the structure and the 13Cδ2 and 13Cγ chemical shifts of the imidazole ring of His, are available. In each of these applications the average degree of protonation < ρ > for all ionizable residues was computed by using Eq. (4). The tautomeric forms of His are determined by using the expressions for f δ and f ε given above [43]. Likewise, using the observed values, Δobs, obtained from solid-state NMR for unblocked dipeptides, with the sequence His-Leu, His-Met, Gly-His, Leu-His, His-Ala, His-Glu, Ala-His and His-Asp [99], we also determined the tautomeric fractions of the imidazole ring of His for each of these 8 compounds.

Results obtained from the 8 proteins indicate that the protonated form is the most populated one while the distribution of the tautomeric forms for the imidazole ring varies significantly among different histidine residues in the same protein (see Fig. 5a). Thus, His226 and His250 show comparable degree of protonation, < ρ >, although the tautomeric distribution is very different (see Fig. 5a), i.e., showing the importance of the environment of the histidines in determining the tautomeric forms. Let us explain the origin of this observation. On one hand, the Nδ1 nucleus of H250 is located only 2.9 Å from the carbonyl backbone oxygen of S248 (see Fig. 5b), presumably forming a hydrogen-bond (green dots in Fig. 5b), while the Nε2 nucleus is exposed to the solvent but the imidazole ring is surrounded by fully protonated R264 and R266 (data not shown) and, hence, lowering the probability that a proton binds to Nε2, in good agreement with the computed tautomeric distribution for H250 in Fig. 5a. On the other hand, the Nε2 nucleus of the imidazole ring of H226 is at 3.3 Å from a backbone carbonyl oxygen of W246 (see Fig. 5c), while the Nδ1 is at 3.1 Å from a backbone amino group of H226 (see Fig. 5c). As a result, a preference of Nε2-H over the Nδ1-H tautomeric form for H226 is expected, in agreement with the computed tautomeric fractions for this residue in Fig. 5a.

Fig. 5
figure 5

Figure a adapted from Vila and Arnautova [43] (with permission of PNAS)

a Fraction of His form distribution for 3 out of 6 His residues in protein PDB id 1E1A, for which the chemical shifts were determined in solution at pH 6.5. Blue and green bars represent the fraction of the Nε2-H and Nδ1-H tautomers, respectively, and the red bars represent the fraction of the protonated form, H+. The dotted horizontal line indicates the fraction of the H+ form that a free His residue would have in solution at pH 6.5; b Ball and stick representation of H250 in protein 1E1A. The grey, blue and red colors designate carbon, nitrogen and oxygen atoms, respectively. The background shows a ribbon diagram of part of protein 1E1A. The Nδ1 nucleus of H250 is located at only 2.9 Å from the carbonyl backbone oxygen of S248, presumably forming a hydrogen-bond (indicated by green dotted line); c Same as b for H226. All displayed distances are in Angstroms.

In addition, our results show that for ~70% of the neutral histidine-containing dipeptides the method leads to fairly good agreement between the calculated and the experimental tautomeric form. Co-existence of different tautomeric forms in the same crystal structure may explain the disagreement obtained for the remaining 30% of dipeptides.

4.2 Protein Structure Determination

In this section we illustrate, with two examples, how the structural information encoded in the 13Cα chemical shifts can be used to determine an ensemble of conformations, provided that a set of NOE-derived distance constraints, is available. However, since the chemical shifts are sensitive to the dynamics of a protein on the microsecond time scale [88] the question whether a single rather than an ensemble of conformations is a better representation of the NMR observables, such as the chemical shifts, must be investigated first.

4.2.1 The Crystallographer Dilemma: A Single Structure or an Ensemble of Conformations?

In protein crystallography it is conventional to represent the conformation of a protein by a single structure, although proteins are very flexible in solution, and, hence, the question whether a single structure, rather than an ensemble of conformations, is a more accurate representation of the observed 13Cα chemical shifts in solution deserves to be investigated.

Proteins in solution are flexible molecules which exhibit anisotropic motion and exist as a dynamic ensemble of conformations. Although, protein flexibility in the crystalline state is reduced (compared to solution) as a result of crystal packing, some dynamics and heterogeneity still remain [119, 120] because of the high solvent content in most protein crystals [104]. Despite this, protein structures solved by X-ray diffraction are traditionally represented by a single conformation. Crystallographic temperature (B) factors, which contain information about atomic displacements arising from the combined effects of dynamic, static and lattice disorders within the crystal lattice, provide an important indication of protein motions in the crystalline state.

Consequently, consideration of an ensemble of protein conformations generated by using B-factor values as a guide may potentially improve the agreement between the NMR- and X-ray-derived protein models in terms of some NMR observables, such as 13Cα chemical shifts. To explore such possibility we selected ubiquitin, an α/β 76 residues protein. The structure of this protein was solved by X-ray (PDB id 1UBQ [83]), and NMR (PDB id 1D3Z [54]) methods, with the latter providing the available 13Cα chemical shifts.

Since the deposited PDB structures of 1UBQ were solved and refined by using software and force-field parameters different from those employed in our method, a new set of conformations was generated using MCM and rigid geometry starting from the corresponding regularized experimental X-ray structure (1UBQregular). During the MCM search, variations of the (ϕ, ψ, χ) torsional angles were allowed for all the residues in the sequence. The reported B-factors for 1UBQ were used to estimate the upper limit of the torsional angle variation adopted (±10°). The generated set of conformations was subjected to several rounds of refinement using a standard procedure in X-ray crystallography, i.e., the Crystallography and NMR System (CNS) program [51, 52]. As a result 5 conformations were selected.

All the 5 generated models are quite different among themselves and from the corresponding starting structure, with an all-atom rmsd of 0.36–1.13 Å. Moreover, for all 5 models, no residues were in disallowed regions of the Ramachandran plot [8] and all unfavorable contacts occur between the atoms from the last five residues in the sequence, which were not visible in the electron-density map. In addition, the R and Rfree factors of the 5 models are equivalent to or better than those of the one obtained for a Simulated Annealing Refined (SAR) structure of PDB 1UBQ. This refinement of the deposited 1UBQ structure i.e., named SAR structure, is a necessary step for a consistent comparison between the chemical shifts of the generated 5 models and the PDB structure, because C13 chemical shifts are very sensitive to small differences in bond lengths and bond angles [16].

Figure 6 shows the rmsd values between the observed and computed 13Cα chemical shifts obtained for each of the 5 new models (light-grey bars) and the SAR structure (black-filled bar). The ca-rmsd, computed from the ensemble of 5 new models, is shown as a horizontal solid line in Fig. 6. The ca-rmsd (2.36 ppm) is lower than the value for the SAR structure (2.74 ppm) or for any of the new models. These results obtained for ubiquitin demonstrate that consideration of an ensemble of 5 conformations, derived from the regularized experimental X-ray (1UBQregular) structure, leads to better agreement with the observed 13Cα chemical shifts than does a single conformation (the SAR structure).

Fig. 6
figure 6

Figure adapted from [41] (with permission of the International Union of Crystallography)

Bar diagram of the rmsd (ppm) between the computed and observed 13Cα chemical shifts of ubiquitin. Black-filled bar (2.74 ppm) represents the results from the SAR structure. Grey-filled bars represent the rmsd for each of the generated 5 new models; the horizontal black line represents the ca-rmsd (2.36 ppm) computed from the ensemble of 5 new models.

The above conclusion is in line with the suggestion of crystallographers’ that “…a more suitable representation of a macromolecular crystal structure would be an ensemble of models...” [121]. Analysis of NMR-determined ensemble of conformations also lead to similar conclusion, i.e., use of the ca-rmsd value led to closer agreement with the observed 13Cα chemical shifts in solution than when individual, or the mean, rmsd is used [33]. In other words, proteins in solution are conformationally labile, as indicated by both the ca-rmsd and the theoretical minimal-rmsd model analyses, and this must be taken into account to predict the 13Cα chemical shifts most accurately.

4.2.2 Determination of β-Sheet Structures

Evidence obtained from the probability-based secondary structure identification method of Wang and Jardetzky [122] suggests that the reliability to distinguish an α-helix from a statistical coil based on chemical shift information follows, for the heavy nuclei only, the ranking: 13Cα > 13C′ > 13Cβ > 15N, whereas a different trend (13Cβ > 13Cα ~ 13C′ ~ 15N) was found for the corresponding reliability to distinguish a β-strand conformation from a statistical coil. This trend raises the question whether a mainly 13Cα-driven methodology can be used to predict predominantly β-sheet structures and, if so, how well the corresponding 13Cβ chemical shift predictions would be.

To answer this question, our recently-introduced physics-based protocol (see Fig. 1) was applied to determine the structure a 20-residue peptide capable of forming a three-stranded antiparallel β-sheet in aqueous solution, i.e., the BS2 peptide with the sequence: TWIQNDPGTKWYQNDPGTKIYT, for which both a complete set of 13Cα chemical shifts and a reduced number of NOEs were reported. The experimental structure determination of small proteins and peptides, which are able to fold as monomers and do not contain disulfide bonds, is very valuable because such determinations can provide important information for force-field development and evaluation or improvement of search algorithms aimed at an efficient exploration of the conformational space [123126].

The results obtained indicate that an accurate all β-sheet structure can be determined by simply identifying a set of conformations which simultaneously satisfy a set of constraints including 13Cα-dynamically-derived torsional angle constraints for all amino acid residues in the sequence and a fixed set of NOE-derived distance constraints [29]. Among the thousands of conformations generated by the VTF approach, i.e., during the step 1 of the protein structure determination protocol shown in Fig. 1, 25 of them (see Fig. 7a) were selected by using a clustering procedure. This small set of conformation was used to determine the theoretical minimal-rmsd model that provides us with a set of ϕ, ψ, and χ torsional angle constraints for all the residues in the sequence not just for those in α-helix or β-sheet regions. Using this set of torsional angle constraints (ϕ, ψ, χ), combined with different number of NOE-derived constraints, 2 sets of conformations of the BS2 peptide were determined after the step 2 of the protocol. One set of 20 conformations (shown in Fig. 7b) was obtained by using 118 NOE-derived distance constraints, while the other set of 10 conformations (shown in Fig. 7c) was obtained by using 130 NOE-derived distance constraints. Regardless of the number of the NOE’s-derived distance constraints used, addition of the 13Cα-derived torsional constraints led to a noticeably lower ca-rmsd’s (2.2 and 3.5 ppm, for the set of 20 and 10 conformations, respectively) compared to the 20 models obtained by Santiveri et al. [127] who used a full set of 130 NOE’s-derived distance constraints but no 13Cα chemical shift information (4.6 ppm). In line with this finding, graphical inspection of the results shown in Fig. 7b–c also indicated that use of 13Cα-derived torsional constraints led to sets of conformations with less side-chain torsional angle spreading, i.e., as can be seen from comparison of Fig. 7b and c against 7d, with the latter obtained by Santiveri et al [127]. In addition, the correlation coefficient, R, between the observed and computed 13Cβ chemical shifts was somewhat better for the two sets obtained using the 13Cα-based determination protocol (shown in Fig. 1). Thus, R is 0.99 and 0.98 for the 20 and 10 conformation sets, respectively, while R is 0.97 for the set of conformation derived by Santiveri et al [127].

Fig. 7
figure 7

a Superposition of 25 NMR-derived conformations of BS2 peptide (represented by ribbon diagrams) obtained in Step 1 after the VTF procedure (see Flow-chart in Fig. 1); b Superposition of 20 NMR-derived conformations of BS2 obtained after the conformational search in Step 2 (see Flow-chart in Fig. 1); 118 out of 130 NOE’s distance constraints were used; c Same as b for 10 NMR-derived conformations; 130 NOE’s distance constraints were used; d Superposition of 20 NMR-derived conformations obtained by Santiveri et al. [127] using traditional methods

Overall, analysis of the ca-rmsd, the NOE-derived distance violations, the 13Cβ chemical shifts, and some stereo chemical quality factors for these sets, as a measure of the closeness with which the calculations reproduce the structure in solution, indicates that our self-consistent physics-based method is able to produce a more accurate set of conformations (shown in Fig. 7b and c) than that obtained with the traditional methods [127] [shown in Fig. 7d]. Our results also suggest that for a flexible molecule in solution, like BS2, it may not be possible to determine a single structure that would satisfy all the constraints simultaneously. This is a consequence of the well-known fact that NMR parameters, such as the observed NOE-derived distances and the 13Cα chemical shifts, correspond to a dynamic ensemble of conformations and, therefore, may not be reproduced exactly by a limited set of static structures [44, 128].

Characterization of the structural flexibility of molecules in solution is of fundamental importance for the study of biological function, stability and folding [129, 130]. Therefore, additional analysis of the per-residue average 13Cα conformational shifts was carried out and the results indicated that the third, C-terminal, strand in the β-sheet of the BS2 peptide is the most flexible strand, although less flexible than the turns. In addition, a 20 ns molecular dynamics simulations (MD) using the AMBER 8.0 package [131] were performed. The MD runs yielded a plausible atomic description of the motion of BS2 peptide in solution, as revealed by both the pattern of hydrogen bonds and the generalized Lindemann parameter [132]. The MD results were in line with the per-residue average 13Cα conformational shifts analysis, providing additional evidence of greater flexibility of the C-terminal strand.

The fact that the observed 13Cα chemical shifts, supplemented only by NOE-derived distance constraints, provide accurate information for validation and refinement of protein structures, as well as site-specific information about the flexibility of a molecule in solution, may be very useful for NMR spectroscopists and theoreticians interested in analysis of the stability and protein-folding mechanism.

4.2.3 A Blind Test to Determine an α-Helical Structure

The solution NMR structures of both full length (residues 1–77) and truncated (residues 1–46) forms of YnzC protein (PDB id 2JVD) from Bacillus subtilis [133], that is part of the small yneA SOS response operon that regulates cell division in this organism [134], have been determined recently [135]. The corresponding X-ray crystal structure (PDB ID, 3BHP) was solved by Kuzin et al. [133] at 2.0 Å resolution. The unique two-helix monomeric structure of YnzC, with no disulfide bonds, makes it an attractive subject for testing our physics-based methodology for protein structure determination.

The goal of this application is two two-fold. First, as a blind test, we attempted to determine whether it is possible to obtain an ensemble of conformations for which each individual conformer simultaneously satisfies the NOE-derived distance constraints and the 13Cα-derived torsional constraints for the YnzC protein in solution [136]. Although the solution NMR structure [135] of this protein had been solved at the time of this blind test, the only information provided was a full set of both the observed 13Cα chemical shifts and the NOE-derived distance constraints. In particular, no information about the coordinates of the solved structures of the YnzC protein [135] or the heteronuclear 15N-1H NOE data was provided at the moment of the test.

Our second goal was to carry out a cross-validation test of high-quality sets of conformations obtained for the YnzC protein in solution by using alternative determination methods, namely, the solution NMR set of conformations (PDB id, 2JVD) obtained by using NOE-derived distance constraints, dihedral-angle constraints and hydrogen-bond constraints [135], and the 2.0-Å X-ray crystal structure (PDB id, 3BHP) (Kuzin et al. [133]. For this second goal, several validation scores were used [136], including: (i) Recall, Precision, F-measure (RPF) analysis [6]; (ii) several global quality score indicators provided by Verify3D [10], ProsaII [137], Procheck [8], and MolProbity [5]; (iii) the ca-rmsd and rmsd between observed 13Cα chemical shifts and those computed at the DFT level, and (iv) the backbone rmsd between these refined structures and the mathematical average coordinates of the ensemble of NMR structures of YnzC(1–48) deposited in the PDB.

By carrying out a blind test we demonstrated [136] that an accurate all α-helical set of protein structures can be determined by simply identifying conformations which simultaneously satisfy a set of constraints, including 13Cα-dynamically-derived torsional angle constraints for all amino acid residues in the sequence and a fixed set of 1022 NOE-derived distance constraints. The protein structure determination was carried out as follows: after generation of thousands of conformations using the VTF procedure (step 1) 10 of them, shown in Fig. 8b, were selected, i.e., those possessing a maximum NOE-derived distance violation lower than some fixed cutoff value; only one of the 10 conformations produced in step 1 was selected. The selected conformation was used as a starting one in a conformational search carried out with two types of constraints: the original fixed limited NOE-derived distance constraints and the set of ϕ, ψ, χ torsional angles derived from step 1. The resulting new set of 10 conformations is shown in Fig. 8c. Repetition of the step 2 with a tighter tolerance range, than in the previous iteration, for the torsional angle constraints enabled us to determine the final set of 10 conformations shown in Fig. 8d, i.e., the so-called Set-NOE-CS.

Fig. 8
figure 8

Figure a adapted from [136] (with permission of PNAS)

Results for the 77-residue YnzC protein from Bacillus subtilis. a Bar diagram indicating the rmsd (ppm) between the computed and observed 13Cα chemical shifts for each of the 10 conformations from Set-NOE-CS (red bars), for the 20 conformations from 2JVD (yellow bars), and for each of the three chains in the 2.0 Å crystal structure of YnzC protein, PDB id 3BHP, namely chain a, b and c (black, cyan and green bars). Black (1.54 ppm) and red (1.38 ppm) horizontal lines show the ca-rmsd values computed for the residues 1–46 of 2JVD and Set-NOE-CS, respectively; b Superposition of 10 NMR-derived conformations of YnzC (represented by ribbon diagrams) obtained after the VTF procedure, in Step 1 (see Flow-chart in Fig. 1); c Same as b after the conformational search in Step 2; d Same as c after repeating the conformational search in Step 2 (Set-NOE-CS), i.e., this time by using a new set of torsional angles (φ, ψ, χ) derived from the set of conformations shown in panel (c); e superposition of 20 NMR-derived conformations (PDB id 2JVD) of YnzC protein obtained by Aramini et al. [135]; and f Graphic representation of the X-ray determined structure of YnzC protein (PDB id 3HBP); the asymmetric unit contains 3 similar, but not identical, copies of the YnzC protein molecule, namely chain a, b and c.

A comparative analysis of the rmsd, between the computed and observed 13Cα chemical shifts values for the residues 1–46, for all three sets of conformations is shown in Fig. 8a as a bar diagram, viz., the Set-NOE-CS (shown in Fig. 8d), 2JVD (shown in Fig. 8e) and the three chains of the X-ray crystallography structure 3HBP (shown in Fig. 8f). The results shown in Fig. 8a reveals that the two NMR-derived ensembles of structures (2JVD and Set-NOE-CS) are a better representation for the observed 13Cα chemical shifts in solution in terms of the ca-rmsd (solid horizontal black and red lines in Fig. 8a), than any single conformer (red or yellow bars in Fig. 8a), or any single chain of the X-ray structure (black, cyan and green bars in Fig. 8a). This result is in line with previous calculations for 10 NMR-derived conformations (PDB id 1D3Z) and the X-ray structure (PDB id 1UBQ) of ubiquitin.

Since the ca-rmsd analysis might be biased by the fact that the 10 conformations of Set-NOE-CS were computed using a 13Cα-based method while the others were not, a cross-validation quality test was also carried out. These structures consistently show good values for the RFP and DP-scores as well as for global structure quality indicators. This analysis reveals that all three sets of structures analyzed here display very good agreement with the experimental NOE data, as well as dihedral angle distributions and atomic clash scores typical of good quality protein structures. Taken together, these results indicate that the 20 conformations from the 2JVD set, the DFT-computed 10 conformations from Set-NOE-CS, and each of the three chains of the X-ray structure are highly-accurate sets of conformations which represent the YnzC protein in solution.

4.3 Protein Structure Validation

The PDB is the most important archive of experimental protein structures solved by X-ray crystallography and NMR spectroscopy. The large number of structures deposited in PDB constitutes an extraordinary source of information that has been, and continuously is, used for a wide range of applications in structural drug design, molecular modeling, force-field parameterization, molecular biology applications, etc. Some deposited protein structures, showing few, or a large number, of flaws, are formally withdrawn from the data-base and, hence, considered as obsolete, even though their coordinates remain available in PDB. In most cases, a successor (or superseded) structure replaces the old obsolete one. The large number of obsolete structure indicates that development of accurate validation protocols remains an important task.

4.3.1 A Chemical-Shift-Based Server

An ideal validation method should meet two requirements. First, it should be strong rather than weak. A validation method is considered ‘strong’ if it is able to assess how well a structure, or an ensemble of structures, predicts experimental data not used in the structure-determination process; otherwise it should be considered ‘weak’, since it is limited to reproducing the observed experimental data used in the determination of the protein models [138]. Second, it should be able to detect fast and accurately, at residue level, the existence of structural flaws. With these goals in mind a new server (CheShift) has been developed recently to predict 13Cα chemical shifts of protein structures. It is based on a database of chemical shifts computed for 696,916 conformations as a function of the ϕ, ψ, ω, χ1 and χ2 torsional angles for all 20 naturally occurring amino acids. The 13Cα chemical shifts were computed at the DFT level of theory using the methodology described in Sect. 2.1. Because of the large number of conformations, the computed shielding values were obtained using a small basis set (6-31G/3-21G) and later extrapolated to a large basis set [6-311 + G(2d,p)/3-21G], as described in Methods section.

An analysis of the accuracy and sensitivity of the CheShift predictions, in terms of the correlation coefficient R between the observed and predicted 13Cα chemical shifts, was carried out on 36 X-ray-derived protein structures solved at 2.3 Å, or better, resolution. Results indicate that for all the proteins the R values obtained using the CheShift, SHIFTX [24], SPARTA [25], SHIFTS [38, 39], and PROSHIFT [23] servers were comparable, although the CheShift values were systematically lowest. This raises the following question: do these servers provide a more sensitive validation than CheShift? To answer this question we choose protein 1RGE, solved at 1.15 Å resolution [139]. The corresponding crystal structure of this protein contains two chemically identical but crystallographically independent molecules in the asymmetric unit, named here as A and B [139]. The main structural difference between molecules A and B (with an all-heavy-atom rmsd of 1.1 Å) is due to differences in side chain conformations, especially those occupying different rotameric states. For this test, that do not require a comparison with the observed 13Cα chemical shifts, we computed the correlation coefficient R between the 13Cα chemical-shift predictions obtained for molecules A and B, respectively, by using five servers listed above. The results of this test give the following R values: 0.96, 1.00, 1.00, 0.98, and 1.00 for CheShift, SHIFTX, SPARTA, SHIFTS, and PROSHIFT, respectively. Except for CheShift (0.96) and SHIFTS (0.98), none of the servers is able to discriminate, beyond doubt, between molecules A and B. From a statistical point of view the R values obtained from SHIFTX (1.00), SPARTA (1.00) and PROSHIFT (1.00) servers indicate that molecules A and B are practically indistinguishable protein models. Therefore a lower R value between the predicted and observed 13Cα chemical shifts does not necessarily mean poorer accuracy but it could mean higher sensitivity to subtle structural differences. This conclusion can be confirmed by a similar analysis carried out at a higher level of accuracy, for example, by using a larger basis set and the actual geometry of chains A and B, i.e., without need for any torsional angle interpolations as with the CheShift server. In this case, the R value (0.93) computed with the larger basis set was significantly lower than the R value obtained with CheShift (0.96), or any other server, namely, 1.00, 1.00, 0.98, and 1.00 for SHIFTX, SPARTA, SHIFTS, and PROSHIFT, respectively.

So far, we have shown that the QM basis of the CheShift server enables us to predict the 13Cα chemical shifts with reasonable accuracy in seconds. Our results suggest that CheShift can provide a standard with which to evaluate the quality of protein structures solved by either X-ray crystallography or NMR-spectroscopy, if the experimentally observed 13Cα chemical shifts are available.

4.3.2 CheShift-2: A Picture Is Worth a Thousand Words

Differences between the observed and CheShift-predicted 13Cα chemical shifts can be used as a sensitive probe with which to detect possible local flaws in NMR-determined protein structures; hence, a graphical user interface has been added to the CheShift-2 server [49] to render such flaws easily visible. CheShift was originally developed to return a list of 13Cα predicted chemical-shift values, one for each amino acid in the sequence of a protein, except for the first and last residues [28, 33]. The validation process, i.e., the comparison between the predicted and the observed 13Cα chemical-shift values, is left to the user of the server who can use the provided information to determine the quality of the NMR structure as a whole, e.g., by computing the ca-rmsd [33]. However, it is a highly-desirable goal of any accurate validation method [11, 34] to identify the existence of local flaws in the sequence rather than only the global quality. Therefore, we added a graphical user interface (GUI) to the CheShift server. As a result, it will be possible to facilitate the validation process by displaying the differences between the observed and computed 13Cα chemical shifts by using a three-color code mapped onto a 3D protein model. This graphic validation method, far from being only an aesthetic improvement, will enable users of CheShift-2 to detect local flaws in proteins on a per-residue basis fast and accurately without the need for the user to carry out the extensive DFT calculations on which the server is based.

The CheShfit-2 server [49] makes use of the following sequential steps: (i) for each amino acid residue i the average difference between the observed and predicted 13Cα chemical-shifts, Δi, is computed by using Eq. (2); (ii) the Δi value is smoothed by averaging it over the values of the two nearest-neighbor residues (< Δi>); (iii) the resulting nearest-neighbor averaged value, < Δi> , is discretized, i.e., it is assigned an integer value of 1, 0 or −1, depending on the magnitude of < Δi > ; and (iv) these discrete values are mapped onto the 3D protein model and color coded as blue, white and red, respectively. This color-code assignment is based on the assumption that < Δi> values which are within ~1.7 ppm (blue), are considered as small; within ~3.4 ppm (white), as medium; and beyond 3.4 ppm (red), as large. Differences corresponding to blue and white colors are considered acceptable, while red color indicates possible flaws in the structure. In addition, the yellow color was adopted to specify the absence of observed or computed 13Cα chemical shifts [49].

When more than one protein model exists the averaged Δi values are computed considering all the deposited conformations, although the colored representation is illustrated by using only the first model. This situation is illustrated in Fig. 9 for the 20 NMR-determined conformations (see Fig. 9a) of Bacillus Cereus, a membrane associate protein, PDB id 2K5Q. The large dispersion of conformation in the loops and at the N- and C-termini shown in Fig. 9a, rather than being poor representation of the protein, reflects the flexibility of these segments of the molecules in solution, as is clearly shown by the CheShift-2 validation of 2K5Q (see Fig. 9b).

Fig. 9
figure 9

a Superposition of 20 NMR-derived conformations of Bacillus Cereus, a membrane associate protein, PDB id 2K5Q; b Protein 2K5Q colored according to CheShift-2. The BMRB accession number, from which the observed 13Cα chemical shifts were obtained, is 15,846

4.3.3 Global Versus Local Validation of Proteins

The NMR-determined ensembles of dynein light chain 2A protein, PDB id 1TGQ and 2B95, respectively, show different fold, with one of them, namely 1TGQ (now obsolete) having a wrong fold; while the other one, 2B95 (that replaced the obsolete 1TGQ in the PDB), showing a correct fold. This difference is a result of the oligomeric state assumed during the protein-structure determination, namely a monomer for 1TGQ, and a homodimer for 2B95, as pointed out by Nabuurs et al. [11].

Validation of both protein ensembles, as a whole, shows that 2B95 is a slightly better representation of the observed 13Cα chemical shifts, in terms of the ca-rmsd [34], than 1TGQ, viz., ca-rmsd = 2.08 and 2.35 ppm, for 2B95 and 1TGQ, respectively. However, the ca-rmsd difference between these two ensembles (~0.30 ppm) is not large enough to assure, unambiguously, that the 1TGQ ensemble needs further refinement. In fact, a similar difference in terms of rmsd, i.e., within a range of ~0.30 ppm, was found among 5 new models of the protein ubiquitin (see grey bars in Fig. 6), all of which fit X-ray diffraction data with R and Rfree factors similar to those for the deposited X-ray structure, PDB id1UBQ, solved at 1.8 Å resolution [41]. Certainly, these 5 new models can be considered to be of comparable structural quality. Consequently, variations of ca-rmsd ~0.30 ppm cannot be used as a universal criterion to unequivocally determine if a protein, such as 1TGQ, needs further refinement.

Analysis of dynein light chain 2A protein illustrates that validation of a protein as a whole (global validation), e.g., with the ca-rmsd, may not enable us to determine unambiguously whether one protein model is of better quality than another model of the same protein, while the validation at a per-residue basis (local validation), e.g., as with the CheShift-2 server, does (see Fig. 10). To further test the ability of CheShift-2 server to detect small differences between protein models, a small set of 15 obsolete/successor pairs of proteins was also considered (see Supplementary Data of [49]. The results indicate that the CheShift-2 server constitutes a fast and accurate validation tool with which to determine, at the per-residue basis, the existence of local flaws in protein models even for conformations that differ in small details, as for the obsolete and successor models of Membrane-bound Lytic Murein Transglycosylase D (fragment Lysm Domain) (see Fig. 11).

Fig. 10
figure 10

Figure adapted from [49] (with permission of Oxford University Press)

Two models of the dynein light chain 2A protein: a 1TGQ (obsolete) and b 2B95 (successor). Both models are shown as ribbons and colored according to CheShift-2. The BMRB accession number, from which the observed 13Cα chemical shifts were obtained, is 6527.

Fig. 11
figure 11

Figure adapted from [49] (with permission of Oxford University Press)

Two models of Membrane-bound Lytic Murein Transglycosylase D (fragment Lysm Domain): a PDB id 1E01 (obsolete) and b 1E0G (successor). The BMRB accession number, from which the observed 13Cα chemical shifts were obtained, is 4680.

In general, pairs of obsolete and successor proteins present in PDB can be used as a benchmark set with which to test validation methods. These ensembles of obsolete/successor pairs of proteins are very appealing because their members possess different topology and numbers of residues and a complete sets of 13Cα chemical shifts are available for a large number of them from the Bio Magnetic Resonance Data Bank (BMRB) [117].

5 Conclusions and Future Directions

In this chapter we have illustrated how the information encoded in the 13C chemical shifts can be used for an assorted number of applications, namely, from protein structure prediction to accurate detection of structural flaws, at a residue-level, in NMR-determined protein models.

The ability to detect and accurately characterize the mobility of the surface side chains by computing 13Cα chemical shifts constitutes one of the strengths of the current methodology. Hence, we are planning to focus our research on the development of new physics-based algorithms for a fast and accurate determination and validation of side-chain conformations, with the goal to improve the quality of NMR-determined protein models. Since NMR spectroscopy provides chemical shifts for several other nuclei, besides 13Cα, feasibility of their DFT-computation and benefits of including the information encoded in these data in structure determination protocols is currently under investigation in our group. In general, new developments in the field of NMR spectroscopy are needed in order to develop protocols for high-throughput NMR determination of high-quality protein structures in solution.