SimShiftDB; local conformational restraints derived from chemical shift similarity searches on a large synthetic database
- 373 Downloads
We present SimShiftDB, a new program to extract conformational data from protein chemical shifts using structural alignments. The alignments are obtained in searches of a large database containing 13,000 structures and corresponding back-calculated chemical shifts. SimShiftDB makes use of chemical shift data to provide accurate results even in the case of low sequence similarity, and with even coverage of the conformational search space. We compare SimShiftDB to HHSearch, a state-of-the-art sequence-based search tool, and to TALOS, the current standard tool for the task. We show that for a significant fraction of the predicted similarities, SimShiftDB outperforms the other two methods. Particularly, the high coverage afforded by the larger database often allows predictions to be made for residues not involved in canonical secondary structure, where TALOS predictions are both less frequent and more error prone. Thus SimShiftDB can be seen as a complement to currently available methods.
KeywordsChemical shift Homology modeling Alignment Database search
Chemical shifts are now routinely used as a source of local conformational restraints in the structure determination of proteins by NMR, due mostly to the widespread use of programs such as TALOS (Cornilescu et al. 1999) and SHIFTOR/PREDITOR (Neal et al. 2006; Berjanskii et al. 2006). These programs share a common approach and output similar data; both search a database that correlates local patterns of chemical shifts with local conformation, and both provide backbone dihedral angle restraints for individual residues. This approach has been very successful, but has some limitations in the stringent criteria needed for selecting proteins or protein fragments to populate the database, i.e. only those with both highly reliable chemical shift and structural data can be included. This restricts current databases to less than a few hundred proteins. Although this may seem adequate—for example, the TALOS database contains 186 proteins subdivided into over 24,000 tripeptide fragments (http://spin.niddk.nih.gov/NMRPipe/talos/)—the sequence/conformation search space is very large, and the database coverage is unevenly distributed. As a result, rare combinations of amino-acid and conformation may be under-represented in the database, leading to significant under-prediction and even to errors outside of the heavily populated regions of the Ramachandran map.
We have adopted an alternative approach to extracting structural data from chemical shifts based on our SimShiftDB algorithm. The original SimShift was designed to test for structural similarities between proteins in a pair wise manner using chemical shifts to supplement sequence data. Experimental query shifts were compared to those back-calculated from the target. SimShift showed improved ability to detect distant structural relationships when compared to state-of-the-art methods based on the sequence alone. A natural further development of pairwise comparison was to adapt the SimShift algorithm for database searching, resulting in SimShiftDB (Ginzinger et al. 2007b). Given a target sequence and shifts, SimShiftDB provides a list of matching proteins in the database, scored by a measure of statistical significance. In effect it searches a synthetic chemical shift database of 13,000 proteins based on the Astral library (Chandonia et al. 2004). The matching sequence can be of any length, and structurally similar regions can be found ranging from small, locally similar fragments up to full domains.
In principle, any structural alignment method can also be used to make predictions of local conformation by extracting torsion angles from matching regions of the target proteins, and it is this implementation of the SimShiftDB algorithm we present here. We benchmark the program against TALOS as a standard for current methods and HHpred, a sequence search method based on hidden Markov models (Söding et al. 2005), as a standard for purely sequenced based methods. We show that SimShiftDB can significantly increase the amount of information that can be derived from chemical shifts. We combine SimShiftDB with our CheckShift (Ginzinger et al. 2007a) routine for standardizing chemical shift referencing to produce a pipeline for analysis of chemical shift data.
Step 1: Local similarities are found by looking for high scoring combinations of parts of the target protein sequence (s) with parts of the template protein sequence (t). Fig. 1 shows a depiction of a set of local similarities. For example, block b in the figure shows that the chemical shifts of the target protein sequence from index Xmin to Xmax are similar to the chemical shifts of the template protein sequence from index Ymin to Ymax. The similarity is calculated by summing the pairwise scores of the residues in the similar region in analogy to a pairwise sequence alignment. The pairwise similarity scores are given by the so-called Chemical Shift Substitution Matrices, which give a score for each combination of two residues with associated chemical shifts (for more details see Ginzinger 2008).
Step 2: The set of local similarities from Step 1 is taken as an input for Step 2, where the most significant combination of blocks is identified, according to a statistical model of alignment scores (Karlin and Altschul 1993). Additionally, two blocks have to fulfill two constraints for their combination to be considered:
Blocks may not overlap in the target or in the template protein; this would otherwise result in an ambiguous alignment.
As the three-dimensional structure of the template protein is known, we further require that the euclidean distance between the end of the first block and the beginning of the second block may be bridged (according to chemical restraints) by the relevant sequence of amino acids in the target protein.
Finally, we calculate an e-value for the optimal combination of blocks. This e-value represents the number of alignments of equal or better quality, which are expected to occur by chance, given the distribution of the amino acids with associated chemical shifts in the target protein and the template database. Additionally, the e-value takes the size of the template database into account. According to the following evaluation, an e-value of <10−3 guarantees a high quality alignment.
The benchmark set
A 100% sequence match to an ASTRAL entry.
At least 100 residues with associated chemical shifts (to exclude very short protein fragments; e.g. single helices).
To identify protein structures corresponding to the respective BMRB entries, a BLAST-search (Altschul et al. 1990) against the sequences from the ASTRAL database is conducted for each BMRB entry. If the full BMRB sequence can be matched without gaps against an ASTRAL sequence, the corresponding ASTRAL structure is assigned to the BMRB entry. As some entries in BMRB match more than one sequence in ASTRAL, one representative structure has to be chosen. This is accomplished by using the AEROSPACI score (Chandonia et al. 2004) provided for each ASTRAL entry, thereby selecting the structure with the best resolution. Through this procedure a benchmark set containing 144 entries was derived.
Evaluation of prediction accuracy
When calculating the similarity score for two residues, SimShiftDB is restricted to at most three chemical shifts from the following list: 1Hα, 1HN, 15N, 13Cα, 13Cβ, and 13C′. Thus it is important to select a combination of shifts to extract maximum information, and a priority for replacing missing shifts. To identify the most successful strategy, we tested all possible priorities for the six atom types, resulting in 6! = 720 evaluations. The most successful priority was: 13Cα > 13C′ > 1HN > 13Cβ > 1Hα > 15N. This is the default priority, and is used in the following analysis.
Comparison to HHsearch
To show empirically that SimShiftDB uses the information in the chemical shift data to yield more sensitive alignments, especially in the case of low sequence similarity, we compare SimShiftDB to HHsearch (Söding 2005). HHSearch, a sensitive search tool based on hidden Markov models, calculates alignments between proteins using the primary sequence complemented by sequence-based predictions of secondary structure. HHpred (Söding et al. 2005), a protein structure prediction method based on HHsearch alignments, ranked second best in the CASP7 (Battey et al. 2007) experiment. Additionally, it is freely available for download and gives the user the possibility to define arbitrary template databases. Therefore it is perfectly suited to serve as a reference for purely sequence-based methods.
A SimShiftDB prediction is called better if it has an error of ≤30° and the corresponding HHSearch prediction has an error which is worse by more than 5°.
Two predictions are called equal if both have an error of ≤30° and the difference between the errors is less than 5°, or both predictions have an error >30°.
Missing predictions are treated as predictions with an error >30°.
Comparison to TALOS
We have presented SimShiftDB and shown that the program is able to sensitively extract structural information from chemical shift data. This information is to a certain extent complementary to that from currently available tools. On one hand we have compared SimShiftDB to a sequence-based method. SimShiftDB shows its strength especially in cases of low sequence similarity, which underlines the advantage of including chemical shift information in the alignment algorithm. On the other hand, we were able to show that one-third of the predictions by SimShiftDB clearly have a higher quality than the corresponding TALOS predictions, and this is largely independent of sequence similarity.
The main advantage of SimShiftDB is derived from its superior coverage of the search space, due to the large and quickly adaptable template database. SimShiftDB outperforms TALOS especially in those cases where TALOS finds no predictions classified as “Good” according to its selection criteria. SimShiftDB and TALOS are therefore complementary, and can be used in parallel to increase the number of available predictions.
This example highlights the major difference in the SimShiftDB and TALOS approaches, i.e. the length of the template structures found by SimShiftDB when compared to the tripeptides used to make TALOS predictions. The second difference is the use of the e-value as a continuous measure of quality, rather than a discrete selection criterion based on a consensus of the ten best hits. In some cases there may be only one or two templates found for any region of the protein, but low e-value scores can nevertheless allow predictions with high confidence.
We have established an accuracy of above 85% for SimShiftDB predictions, based on our benchmark set of proteins. This may at first glance compare poorly to TALOS, where an accuracy of 97–98% is reported. However, it must be considered that this value is based on single SimShiftDB predictions, rather than the consensus of 10 predictions. Also, it is worth noting that TALOS is very accurate within secondary structure, and therefore the 2–3% of errors must be concentrated in the smaller fraction of other predictions. In our experience, these errors often result from predictions made out of structural context; e.g. for a residue in a β-turn based on tripeptides from a helix. The wider context provided by SimShiftDB results should therefore add both to the confidence of its predictions and those from TALOS.
The optimum chemical shift priority found for SimShiftDB searches is somewhat surprising in that it contains 1HN, which is not generally regarded as containing much structural information. Perhaps this is due to some complementarities of the information from 1HN and that from other shifts. It is worth noting, though, that the difference between the best priorities is small, and it may be worth testing a range of priorities. This is easily possible; although the program searches a database of 13,000 protein structures, an average SimShiftDB run takes only 30 seconds on a standard laptop (Intel T2500, 2.0 GHz, 1 GB RAM). The different results are comparable using the calculated e-values, thereby enabling the user to select the most promising result.
We thank Prof. Horst Kessler and his group at the Technical University of Munich for access to data on Ph1500C and other test cases and Johannes Söding for helpful discussions and the integration of the SimShiftDB server in the MPI Bioinformatics Toolkit. Additionally the first author would like to thank Robert Konrat and Manfred J. Sippl for first introducing him to the problem of searching similarities in chemical shift sequences.
SimShiftDB is available via a web server (http://simshiftdb.services.came.sbg.ac.at), see Fig. 8 for a sample screenshot. This server also provides a variety of functions for analyzing the results of a SimShiftDB Search interactively. Additionally, SimShiftDB will be included in the MPI Bioinformatics Toolkit (http://toolkit.tuebingen.mpg.de).
- Altschul FS, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410Google Scholar
- Berjanskii M, Neal S, Wishart D (2006) PREDITOR: a web server for predicting protein torsion angle restraints. Nucleic Acids Res. 34 (Web Server issue): W63Google Scholar
- Ginzinger SW (2008) Bioinformatics methods for NMR chemical shift data. PhD thesis, Ludwig-Maximilians Universität München URL: http://edoc.ub.uni-muenchen.de/8077/1/Ginzinger_Simon_Wolfgang.pdf
- Ginzinger SW, Gräupl T, Heun V (2007b) SimShiftDB: Chemical-shift-based homology modeling. In: Proceedings of the First Conference on Bioinformatics Research and Development, Lecture Notes in Bioinformatics, vol 4414, pp 357–370Google Scholar