The NMR restraints grid at BMRB for 5,266 protein and nucleic acid PDB entries
Several pilot experiments have indicated that improvements in older NMR structures can be expected by applying modern software and new protocols (Nabuurs et al. in Proteins 55:483–186, 2004; Nederveen et al. in Proteins 59:662–672, 2005; Saccenti and Rosato in J Biomol NMR 40:251–261, 2008). A recent large scale X-ray study also has shown that modern software can significantly improve the quality of X-ray structures that were deposited more than a few years ago (Joosten et al. in J. Appl Crystallogr 42:376–384, 2009; Sanderson in Nature 459:1038–1039, 2009). Recalculation of three-dimensional coordinates requires that the original experimental data are available and complete, and are semantically and syntactically correct, or are at least correct enough to be reconstructed. For multiple reasons, including a lack of standards, the heterogeneity of the experimental data and the many NMR experiment types, it has not been practical to parse a large proportion of the originally deposited NMR experimental data files related to protein NMR structures. This has made impractical the automatic recalculation, and thus improvement, of the three dimensional coordinates of these structures. We here describe a large-scale international collaborative effort to make all deposited experimental NMR data semantically and syntactically homogeneous, and thus useful for further research. A total of 4,014 out of 5,266 entries were ‘cleaned’ in this process. For 1,387 entries, human intervention was needed. Continuous efforts in automating the parsing of both old, and newly deposited files is steadily decreasing this fraction. The cleaned data files are available from the NMR restraints grid at http://restraintsgrid.bmrb.wisc.edu.
KeywordsBiomolecular structure BMRB Restraints Database Nuclear magnetic resonance PDB
Abbreviations and symbols
Collaborative Computing Project for NMR
Set of 545 PDB entries
Set of 97 PDB entries
Database of converted restraints
Filtered REstraints database
NMR restraints grid
Protein Data Bank
Protein Data Bank eXchange dictionary
Protein Data Bank in Europe
Residual dipolar coupling
Root mean square
Worldwide Protein Data Bank
The first macromolecular X-ray structure (myoglobin) was solved in 1958 (Kendrew 1958). Thirteen years later, in 1971, the PDB was launched as a central repository for these data (Protein Data Bank 1971; Berman 2007).The idea of the PDB was to have a central data-warehouse where all structures should be deposited and from where researchers from all over the world could get free access to those valuable data. The first NMR-derived protein structures, BUSI IIa (Williamson et al. 1985) and the lac-headpiece (Kaptein et al. 1985) were published in 1985, and in 1988 the PDB accepted the first NMR structure ensemble (Driscoll et al. 1989). In the early nineties, most journals agreed that macromolecular structure data had to be deposited before the corresponding article could be published. The first X-ray reflection files were deposited in 1976 (PDB entry 155C), and X-ray reflection deposition became an obligatory aspect of the data deposition process in 2000 (Commission on Biological Macromolecules 2000). The first experimental NMR data deposited in 1991 consisted almost exclusively of NOE distance and dihedral angle restraints.
Experimental NMR data files are considerably more complex than X-ray reflection files in terms of semantics and associated syntax. In addition, NMR data assigned to specific atoms can be highly valuable even in the absence of a three-dimensional structure. It was proposed that a data bank organized by NMR data experts be instituted to collect and archive such information (Ulrich et al. 1989). The BMRB was launched in 1991 and has evolved into the recognized worldwide database for experimental NMR data (Seavey et al. 1991; Ulrich et al. 2008). In 2006, BMRB became a member of the Worldwide Protein Data Bank (wwPDB). The Advisory Committee of the Worldwide Protein Data Bank (wwPDB) recommended in 2007 that depositions of NMR structures should be accompanied by structural restraints, which was followed by the recommendation in 2008 to additionally deposit the assigned chemical shifts. The deposition of structural restraints became mandatory on February 1, 2008 (Markley et al. 2008), and the mandatory deposition of chemical shifts will be announced in 2009. By tradition, the coordinates of NMR structures, along with the raw restraints underlying the structures, have been deposited in the PDB, and the assigned chemical shifts and other experimental data have been deposited in the BMRB. Upon becoming a member of the wwPDB, the BMRB along with the European branch of the PDB (PDBe), assumed the task of curating the structural restraint data and recruited collaborators for this effort.
Experimental NMR data are highly heterogeneous, and both how certain data types are valued and which data types are actually valued are changing from year to year as the NMR research field develops. Although NOE distance restraints were the basis for the first NMR structures, currently a wide range of experimental data are used: coupling constants, chemical shifts, residual dipolar couplings, cross hydrogen bond couplings, and paramagnetic relaxation effects. As a consequence of this evolution, deposited experimental NMR data are highly heterogeneous, and owing to the lack of ontologies or common practices, these data are now hard to parse by one single computer program. Additionally, the lack of data validation possibilities in the early years of NMR allowed a massive number of errors in the deposited restraints to slip into the database. The concept of how best to represent NMR-derived structures has also evolved over the years. An initial idea, starting around 1986, held that averaging an NMR ensemble into a single structure would lead to a useful single molecular representation. However, following the introduction of validation software, such as PROCHECK and WHAT_CHECK, it was found that averaged structures often have extensive problems (Clore et al. 1986; Hooft et al. 1996; Laskowski et al. 1993; Nilges et al. 1988). Now, most structures are characterized by a family of conformers that represent both the inherent dynamics of the structure and the lack of structural restraints.
In light of these facts, we decided to take a three step approach toward remediating all experimental NMR data files. In the first step (parsing), we ensure that the data are syntactically correct. In the second step (conversion), we ensure that restraints belong to atoms that exist. In the final step (filtering), we enforce semantic correctness, which includes at least some possibility of proximity for atoms that syntactically have been connected by a NOE. The results of the second step have been stored in the Database Of Converted Restraints (DOCR), while the results of the third step have been stored in the Filtered REstraints Database (FRED). DOCR and FRED are freely available from the NMR restraints grid (NRG) at http://restraintsgrid.bmrb.wisc.edu. The initial version of the NRG included data from only 97 PDB entries (a database named “DB97”) (Doreleijers et al. 1998); in 2003 we had 545 entries (Doreleijers et al. 2003) and the previous version of the NRG included data from 1,400 entries (Doreleijers et al. 2005). Here we present the completion of the effort to include all 5,266 entries.
Results from these remediation efforts, in NMR-STAR, CCPN, CYANA, and CNS data formats, are available from the DOCR and FRED databases in the NRG (Fig. 1). The vast majority of restraints (those from distance, dihedral angle and RDC measurements) are processed; those based on other types of information are not processed, because they have proved much more difficult to parse. Entries that could not be processed (fully) because of a variety of issues are tracked on the Google Code web site in the spreadsheet: http://code.google.com/p/nmrrestrntsgrid/source/browse/trunk/nmrrestrntsgrid/data/problemEntryList.csv?r=161, which is constantly updated. At the time of writing (revision 161), 221 entries were linked to 14 issues. The most common issue by far (issue 25), which is active for 154 entries, arises from incomplete parsing of AMBER data by the Wattos software. This issue leads to incomplete conversion of parsed restraints to the NMR-STAR format with the consequence that restraints could not be linked to the coordinate data. The authors of this paper are continuing to resolve these issues, and, as a result, the list of problematic entries is highly dynamic.
Conversion and data linking
The FormatConverter software (Vranken 2007; Vranken et al. 2005) imports an NMR-STAR file into the CCPN framework (Fogh et al. 2005) and subsequently links the restraint information to the coordinate data. Although the number of entries increased by nearly a factor of ten, from the 545 monomeric proteins entered in DOCR and FRED (Doreleijers et al. 2005) to the current 5,266, the number of entries (1,387) that needed a manual setting for the linking only increased by a factor of about two. Two corrections commonly were required: (a) sequence matching for proteins that contained one or more coordinated metals such as zinc or cadmium, (b) atom name matching such as H2′/H2′′/HO2′ and thymine methyl H7 s for nucleic acids. Improvements to the automated part of the workflow included: (i) better automatic matching between the atom information from the experimental data file and the molecular system description from the mmCIF file, both by code improvements and by better reference data, and (ii) more informative output about the conversion process for quicker manual curation (if required). In addition many smaller fixes were made in the code, leading to a more dependable and consistent outcome of the conversion step. The code to export NMR-STAR files was completely rewritten to produce valid and complete version 3.1 files.
Distance restraints (DRs) with violations over 2 Å (up to a maximum of three per entry) were categorized as ‘Typos’ and left out of the FRED database as outliers. Although DRs identified as typos are sometimes real, the impact of leaving them out is expected to have a minimal impact on the overall structure. Often these restraints are errant violations that were not observed at the time of structure calculation but arose as a consequence of correcting other problems, such as typographical errors that led to a restraint being accidentally uncommented or to the incorrect mapping of one or two atom names.
In April 2006, we began to contact authors when our processing identified deposited data that led to high violations or were suspected of being incomplete. We received many positive responses, and this type of direct communication has led to improvements in processing by annotators at BMRB and to improved data sets available at the wwPDB. This procedure also caught an estimated 100 cases in which incomplete or incorrect data were sent since 2006.
A large collaborative project such as this inevitably requires the identification and remediation of issues with software developed and procedures used. Initially, the problems were identified and shared by a spreadsheet. In March, 2008, the issues were converted to a Google Code repository at: http://code.google.com/p/nmrrestrntsgrid which is used to track these issues and to link them to codes in the NRG project. Currently, almost all of the ~200 issues listed have been addressed. The documentation is conveniently described in Wiki pages at the same site. In addition, weekly video conferences and several in-person visits from JFD, WFV, and CJP to the BMRB in Madison have helped to keep this project organized which is deemed essential to maintain the databases up to date as well as reliable.
NRG database overall composition
Sets of PDB entries in relation to set selection criteria
Set of entries
NMR with or without restraints
NMR with restraints
With parsed restraints
With parsed DRs
Set 1 < 80% restraints linked
Set 2 < 33% restraints after filtering left
Set 3 maximum DR violation > 2 Å
Set 4 Rms DR violation > 0.25 Å
Set union of 1–4: (union 1–2: 475, 3–4: 417)
‘Good’ set (with parsable restraints minus set union of 1–4)
The BMRB, in collaboration with the NMR community and the Collaborative Computing Project for NMR (CCPN) (Vranken et al. 2005) is developing the next version of the NMR-STAR data dictionary (http://www.bmrb.wisc.edu/dictionary/htmldocs/nmr_star/dictionary.html). Many programs use the NMR-STAR format for exchanging experimental NMR data. All three databases available from the NRG user interface: (parsed data sets, DOCR, and FRED) adhere to the “developmental predecessor of NMR-STAR version 3” and will be updated to the final version 3 data dictionary when released.
Stereospecificity and surplus
Selection criteria for the ‘Good’ set
We have presented the completion of the NRG effort to include all 5,266 PDB entries with NMR restraints. The vast majority of entries (4,014) was found to fulfill reasonable criteria on consistency and agreement between restraint and coordinate data. For a significant number of ‘suspect’ validated entries we have contacted authors. This has led to improvements in our processing and more importantly in more complete and correct data sets conveniently available to all NMR spectroscopists.
This effort also provides an important stepping stone for new longitudinal analyses (studies over many entries) (Vranken 2007), and for validation with the CING software (Vuister et al. to be published, http://nmr.cmbi.ru.nl/cing and http://nmr.cmbi.ru.nl/NRG-CING), and it provides comparison datasets for structure recalculation efforts such as the recent competition with blind targets in an eNMR workshop http://www.enmr.eu/softwareworkshop. The effort resulted in the setup of a continuing effort for the Critical Assessment of automated Structure Determination from NMR data/CASD-NMR (Rosato et al. 2009).
A number of clear improvements need to be addressed. (a) The parsers for the AMBER-formatted restraints need extensive overhaul so that they can fully process this class of restraints. (Google Code issue 25). (b) The NRG setup needs to be able to support the NMR-STAR and CCPN data formats directly as input, because these two formats are becoming more common (issue 209). (c) NRG processing should be integrated with deposition systems such as ADIT-NMR in order to have more efficient communication with the authors at the time of deposition (issue 210). (d) RDC restraint violations need to be calculated (issue 211). (e) Many of the dihedral angle restraint violations should be eliminated by correcting for Phe/Tyr sidechain rotation (issue 212). (f) Last but not least, the NRG data should be integrated with the main BMRB data on chemical shifts that will soon be mandatory for PDB submission.
We acknowledge the many authors who contributed the results of their scientific investigations to the PDB and BMRB. They created the true resource that this secondary database relies upon. Financial support for this work came from the Netherlands Organisation for Scientific Research; NWO 700.55.443, Netherlands Bioinformatics Centre (NBIC) and EU FP6 EMBRACE grant LHSG-CT-2004-512092, EU FP6 STREP Extend-NMR grant LSHG-CT-2005-018988 (Nijmegen), BBSRC grant BB/E007511/1 (Hinxton), and the US National Library of Medicine (grant P41 LM05799) (Madison).
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
- Commission on Biological Macromolecules (2000) Guidelines for the deposition and release of macromolecular coordinate and experimental data. Acta Cryst D56:2Google Scholar
- Driscoll PC, Gronenborn AM, Beress L, Clore GM (1989) Determination of the three-dimensional solution structure of the antihypertensive and antiviral protein BDS-I from the sea anemone Anemonia sulcata: a study using nuclear magnetic resonance and hybrid distance geometry-dynamical simulated annealing. Biochemistry 28:2188–2198CrossRefGoogle Scholar
- Dyer R (2008) MySQL in a nutshell, 2nd edn. O’Reilly, CambridgeGoogle Scholar
- Henrick K, Feng Z, Bluhm WF, Dimitropoulos D, Doreleijers JF, Dutta S, Flippen-Anderson JL, Ionides J, Kamada C, Krissinel E, Lawson CL, Markley JL, Nakamura H, Newman R, Shimizu Y, Swaminathan J, Velankar S, Ory J, Ulrich EL, Vranken WF, Westbrook J, Yamashita R, Yang H, Young J, Yousufuddin M, Berman HM (2008) Remediation of the Protein Data Bank archive. Nucleic Acids Res 36:D426–D433CrossRefGoogle Scholar
- Joosten RP, Salzemann J, Bloch V, Stockinger H, Berglund A, Blanchet C, Bongcam-Rudloff E, Combet C, Costa ALD, Deleage G, Diarena M, Fabbretti R, Fettahi G, Flegel V, Gisel A, Kasam V, Kervinen T, Korpelainen E, Mattila K, Pagni M, Reichstadt M, Breton V, Ticklei IJ, Vriend G (2009) PDB_REDO: automated re-refinement of X-ray structure models in the PDB. J Appl Crystallogr 42:376–384CrossRefGoogle Scholar
- Markley JL, Bax A, Arata Y, Hilbers CW, Kaptein R, Sykes BD, Wright PE, Wüthrich K (1998) Recommendations for the presentation of NMR structures of proteins and nucleic acids. IUPAC-IUBMB-IUPAB inter-union task group on the standardization of data bases of protein and nucleic acid structures determined by NMR spectroscopy. J Biomol NMR 12:1–23CrossRefGoogle Scholar
- Nederveen AJ, Doreleijers JF, Vranken WF, Miller Z, Spronk CA, Nabuurs SB, Güntert P, Livny M, Markley JL, Nilges M, Ulrich EL, Kaptein R, Bonvin AM (2005) RECOORD: a recalculated coordinate database of 500+ proteins from the PDB using restraints from the BioMagResBank. Proteins 59:662–672CrossRefGoogle Scholar
- Protein Data Bank (1971) Protein Data Bank. Nature New Biol 233:223Google Scholar
- Rosato A, Bagaria A, Baker D, Bardiaux B, Cavalli A, Doreleijers JF, Giachetti A, Guerry P, Güntert P, Herrmann T, Huang YJ, Jonker H, Mao B, Malliavin TE, Montelione GT, Nilges M, Raman S, van der Schot G, Vranken WF, Vuister GW, Bonvin AM (2009) CASD-NMR: a rolling experiment for the critical assessment of automated structure determination from NMR data. Nat Meth 6(9):625–626Google Scholar
- Ulrich EL, Markley JL, Kyogoku Y (1989) Creation of a nuclear magnetic resonance data repository and literature database. Protein Seq Data Anal 2:23–37Google Scholar