Protein Secondary Structure Prediction in 2018
Protein secondary structure prediction aims at the prediction of secondary structure on the residue level from sequence information alone. Predicted are commonly alpha-helices and beta-strands, i.e., the most prevalent regular secondary structure segments. On the opposite side of regular secondary structure are irregular or disordered regions often referred to as loops, random coils, or disorder.
Fifteen years ago, science leaped when putting up the almost entire blueprint for human life. Now that the parts are known, can this blueprint be used as a manual to understand how the machine works? “Like with every proper manual, usually we do not find the information we need and in the rare cases that we do, we do not understand the answer” jokes Anna Tramontano (La Sapienza, Rome, 1957–2017). Every year since, new surprising findings gave glimpses at how incomplete the knowledge is. Despite immense advances in molecular biology over the last 15 years, substantial experimental information for around 15% of human proteins remains missing (Baker et al. 2017). Structural biology has also leaped over the last 15 years: 90% of the experimental high-resolution structures known today have been determined after 2000. In parallel with this increase, the number of proteins of known sequence has exploded. In September 2018, about 144,000 experimental protein structures were in the Protein Data Bank (PDB, www.pdb.org; Berman et al. 2000; Rose et al. 2017) as opposed to about 125 million protein sequences in UniProtKB (The UniProt Consortium 2017).
Secondary structure is arguably the simplest, meaningful aspect of protein structure. It is derived from the amino acid sequence (misleadingly dubbed “primary structure” in the past) and serves as the building blocks to form the full three-dimensional (3D) or tertiary structure. In terms of information content, secondary structure is essentially one dimensional as it can be mapped onto a string of letters that assign a secondary structure state to each residue.
Protein Structure Classification Based on Secondary Structure
Michael Levitt proposed to classify proteins according to these main constituents into alpha, beta, and alpha+beta (Levitt and Greer 1977). He thereby introduced the first step that is still at the heart of today’s two major protein structure classifications, namely, of CATH (www.cathdb.info; Sillitoe et al. 2015) and SCOP (scop2.mrc-lmb.cam.ac.uk/scop; Andreeva et al. 2014). A more recent addition to structure classification is TopSearch (topsearch.services.came.sbg.ac.at; Wiederstein et al. 2014) that importantly deviates from this concept by not using derivatives of the original Levitt classes for any major classification step. However, it still uses the concept of secondary structure.
There are different types of helices and strands, and there are several methods that automatically assign those types from 3D coordinates as deposited in the PDB. Most often used is DSSP (Kabsch and Sander 1983). DSSP identifies hydrogen bonds through simple electrostatic energy and then assembles regular patterns of hydrogen bonds into eight classes. About 37% of residues for which experimental structures are available can be classified as alpha-helix and 22% as beta-strand. All other residues are in regions that appear less regular. Often, they are referred to as loops or even more misleadingly as random coils.
Application of Secondary Structure Prediction
Secondary structure helps in many contexts, for example, when aligning proteins in the twilight and midnight zones of sequence comparisons (Rost 1999) in which evolutionary sequence relations are hardly recognizable. It can be used to verify comparative models, to predict protein structure, and to support guesses about protein function and evolution. Furthermore, it serves as potentially the most important input feature used by higher-level prediction methods that address questions beyond structure such as predictions of protein binding, functional classes, and subcellular localization. Secondary structure predictions are also one of the most important input features used by methods that predict the effect of non-synonymous SNPs, i.e., single amino acid changes, or more generally of methods that explore dynamical features of proteins. Thus, starting with good predictions of secondary structure is always a good idea.
Prediction Continuously Improved to over 80% Q3
Before the first protein structure was solved, A.G. Szent-Györgyi predicted secondary structure based on the intrinsic biophysical features of proline residues, which are known to break helices (Szent-Györgyi and Cohen 1957). Even though it is often true, this simple rule on its own makes a very poor prediction method. In fact, the following statement is overwhelmingly valid: Simple rules that may suggest a pseudo-understanding of protein structure formation do not suffice to predict structure from sequence.
First-Generation Methods. A major step forward in secondary structure prediction condensed around the deep mind of Barry Robson. The first steps in the mid-1970s compiled simple statistics that measured the preference of a certain amino acid for particular types of secondary structure. Jean Garnier and David Osguthorpe complemented this and shaped the first successful approach among these methods, called GOR (Garnier et al. 1978). This first type of method reached levels of Q3 around 55% (Rost et al. 1993).
Second-Generation Methods. The next big step forward expanded the simple single residue statistics (i.e., what are the odds of a proline residue to be in a helix), to the level of segments of several consecutive residues, (i.e., what are the odds of a proline residue to be in a helix when flanked by these specific residues on either side). GOR3 (Garnier et al. 1996; Gibrat et al. 1987) was one of the most successful representatives of these methods that reached Q3 levels around 62% (Rost et al. 1993).
Secondary structure formation is determined by local interactions within a fragment of residues with length N, as well as global interactions involving partners from outside a fragment of N residues. Second-generation methods predicted the secondary structure of the residue at the center of a fragment of N consecutive residues, with N ranging from 5 to 50 and thereby addressing only local interactions. One could hypothesize that only 62% of the interactions are determined locally, and therefore, prediction accuracy is capped at 62%.
One way to address this hypothesis would be to use larger fragments. The empirical finding was that N = 11 was better than N = 9 was better than N = 8 and so forth. However, there is a limit: N = 50 is not better than N = 11. This observation could mean that global interactions do not matter much. On the other hand, this finding can also be explained differently. It is probable that global interactions are important and that longer fragments are dominated in terms of their information content by noise, i.e., increasing N decreases the signal-to-noise ratio. The smaller the dataset in question, the harder it is to detect the signal.
Third-Generation Methods. This challenge was addressed by the third-generation methods that use evolutionary information to embed global information and to increase the information density (signal-to-noise ratio). To achieve this, users first have to build a multiple sequence alignment using proteins related to the query sequence for which they seek the prediction. When introduced, these methods used relatively simple pairwise alignments against databases that were small by today’s numbers. Nevertheless, performance was immediately boosted to levels above 72%. The first method that surpassed a sustained level of Q3 > 72% was PHDsec (Rost 1996). Several other improvements specifically addressed issues such as improved prediction for beta-strands. PROFsec (Rost 1996) further improved the approach by extending the sequence profiles and by combining predictions of solvent accessibility and secondary structure. Better database search methods and larger databases brought the performance to levels of 78% Q3 (Jones 1999; Przybylski and Rost 2002). Due to the ongoing search for improvements and the increase of available data, today’s methods such as s2D (Sormanni et al. 2015), RaptorX Property (Wang et al. 2016), SPIDER3 (Heffernan et al. 2015, 2017), or ReProf (Yachdav et al. 2014) are citing Q3 values from 80% to 85%.
Prediction Based on Known Structures. All the above numbers hold for proteins that have no significant similarity to proteins for which experimental structures are already available. However, the structural coverage, i.e., the percentage of proteins that can be modeled based on the known experimental structures, is increasing steadily (Kiefer et al. 2009). Therefore, most proteins known today have some local region for which some aspects of structure can be modeled. Is secondary structure inferred from these models better than de novo secondary structure prediction methods when Q3 is measured?
The answer depends on the quality of the model, i.e., ultimately on the similarity between the protein under investigation and the protein with a similar sequence for which an experimental structure is available. For high similarity, reading secondary structure predictions off the comparative models is better than using expert prediction methods. For low levels of similarity, prediction methods are better (Marti-Renom et al. 2002; Eyrich et al. 2003; Faraggi et al. 2012). However, the increased information stored in data banks have given rise to new methods that rely more heavily on available structural data (Zhang et al. 2011).
Hints for Users
Find the right method. Good and mediocre secondary structure prediction methods differ substantially in their usefulness. Newer methods are sometimes better than older ones, but the newest methods are not necessarily the best. More readily available methods are also not necessarily better. Thus, the first advice is to spend some time on identifying a few of the good methods.
Compare methods but rely more on reliability indices than on consensus. Assume you applied the “top methods” for your protein. Which one of these predictions should you use? Many users tend to believe more in residues that are predicted the same way by different methods, i.e., that would exploit some consensus between the methods (Zhang et al. 2011). There are many good reasons for expecting that such consensus or averages help. However, in a field in which tools are as diligently crafted as the top secondary structure prediction methods, what is usually advisable may become a mistake. The best methods provide estimates for the reliability of each prediction (Fig. 3), and if the developers have done this well, such estimates are much more relevant than any type of simple consensus (Eyrich et al. 2003). Although some publications demonstrate that consensus methods can fare better (Zhang et al. 2011), finding such a method will be difficult.
There may or there may not be today’s best method. The previous two rules suggest that this entry would be most helpful if it suggested a list of top methods. There are many reasons why this entry does not provide such a list. The simplest is that the objective comparison of methods that automatically assessed the state of the art in the field until a few years ago are no longer alive (Eyrich et al. 2003; Rychlewski and Fischer 2005). Conclusions about the state of the art from the literature alone might be very misleading. Furthermore, any noncontinuous assessment method will at best provide a correct view up to a certain time. This encyclopedia will hopefully help you long after that time.
Better alignments, better predictions. The major source of improvement of prediction methods over the last two decades originated from growing databases and from adequately integrating this increased information. The single most important way to improve predictions is by improving the extraction of information contained in the multiple sequence alignment utilized to get the prediction. For the particular case of secondary structure prediction of water-soluble proteins, this largely boils down to the more diverse family members you include in the alignment, the better. (Even at the price of including some remote nonmembers by mistake!)
Typically, hidden Markov models (HMMs) are better at identifying distant relations than PSI-BLAST (Altschul 1997), and typically profile-profile alignment methods are the best. For instance, one recent method (ReProf, http://www.predictprotein.org) increased secondary structure prediction by almost an entire percentage point simply by switching from PSI-BLAST to HHblits (http://toolkit.lmb.uni-muenchen.de/hhblits; Remmert et al. 2012).
Predictions have 20% mistakes: find them! Levels of about 80% accuracy imply that 20% of the residues are predicted incorrectly. Roughly a fourth of these are “bad” mistakes of the type “helix predicted where strand is observed” and vice versa (Rost 2005; Zhang et al. 2011). Most mistakes for helices and strands tend to be on the protein surface. In contrast, few mistakes tend to characterize active and binding sites. However, all of these facts comprise averages: predictions are substantially worse for some proteins and much better for others (see Fig. 2). Thus, the error rate ranges from almost entirely correct in some proteins to levels close to random predictions for others. Reliability indices can provide some clue whether your protein is more likely to be an average performer or belonging to either of those two extremes: the more residues are predicted at unusually high levels of reliability, the higher the accuracy for the protein. Taking a closer look at residues with low reliability often reveals an important story.
After several decades of research in the field, it is still not possible to identify what types of proteins fall more often into which extreme of these classes. There are trivial correlations of the type: Since most methods predict strand less accurately than helix, proteins with more helix content tend to be predicted more accurately. However, no correlation claimed between the success in secondary structure prediction and functional traits has withstood the winds of time. Extreme examples are orthologous enzymes, i.e., enzymes that perform an identical or similar function in different organisms. These may differ strongly in terms of prediction accuracy, although no such trends can reliably be established for the difference in performance between different organisms, in general.
Secondary structure predictions help to predict protein disorder. The study of protein structures reveals again and again how the intricate details of structures determine function. However, many proteins, in particular in eukaryotes, have long regions of what is often referred to as intrinsically unstructured or disordered, i.e., regions that are dynamic and fluctuate in conformational space (Dunker and Obradovic 2001; Dunker et al. 2008). Differently put, if one shone a light at them at different time points, one would not observe the same pattern. Disorder occurs in regions that need flexibility to bind to many different substrates or to impose access to a large space to sense intruders into this region (like light sensors in airports) or to buffer intrusion (like filling material in packages). Estimates about how much disorder can be observed in human proteins differ widely according to the choice of parameters and methods. One simplification is that 15–30% of all human proteins have at least 1 region with at least 50 disordered residues. These numbers increase almost twofold when reducing the criterion to a minimum of 30 consecutive residues (Schlessinger et al. 2011). Some disorder is strongly enriched in non-regular secondary structure and others in contact-deprived helices. Furthermore, other disorder has a high propensity for secondary structure switching. Consequently, secondary structure prediction methods help in the identification of protein disorder.
- 7.Important new information for understanding function and evolution. The pursuit of understanding protein function and evolution typically begins with the study of sequence data. Often, analyses also end there, wasting the wealth of detail available from protein structures that usually is needed to discriminate between alternative hypotheses. Since models based on experimental 3D structures are commonly available for fewer than 30–40% of all residues, important details are often missing. Fortunately, secondary structure already captures some of these details in many cases. Therefore, predictions of secondary structure often help more to identify evolutionary and functional similarities than comparisons of secondary structure derived from experimental 3D structures (Przybylski and Rost 2004).
Examples for Methods
In the following, a collection of secondary structure prediction methods is presented with respect to their quality and availability. As with all methods in computational biology, for many reasons the most readily available prediction methods are often not the best ones. In fact, some of the easiest-to-reach methods perform substantially worse than a method any advanced student can develop in a short summer.
PHDsec/PROFsec/ReProf, PredictProtein. ReProf is the latest improvement in this series of methods and increases performance substantially by replacing PSI-BLAST with HHblits. ReProf is available as web service, as package, and as a web server through the first Internet server in the field, namely, PredictProtein (www.predictprotein.org, Yachdav et al. 2014).
PSIPRED, developed by the group of David Jones. PSIPRED (Jones 1999) is clearly one of the top performers (Eyrich et al. 2003). The important step introduced by PSIPRED was the move from the simple BLAST-like pairwise alignments to using further reaching PSI-BLAST-based alignments. The group has continuously updated the service. It is available as a stand-alone version and as a web server together with a suite of other prediction methods (http://bioinf.cs.ucl.ac.uk/psipred/, Buchan et al. 2013).
SABLE. SABLE initially improved over PSIPRED by combining PSI-BLAST profiles and neural networks and predicted solvent accessibility to improve the prediction of secondary structure (Adamczak et al. 2005). A web server and a stand-alone version are available (http://sable.cchmc.org/).
PORTER, Distill. PORTER defines one important point in the ongoing struggle for advances (Mirabello and Pollastri 2013). Improvements come through using more advanced machine learning devices than those exploited in ReProf and PSIPRED, in addressing particular shortcomings such as finding an optimal path between comparative modeling and de novo prediction and in the particular way in which output from different structure prediction methods is combined to predict secondary structure. The web server Distill (distill.ucd.ie/distill) combines various structure prediction methods from the group (Bau et al. 2006).
S2D. This method uses multiple layers of neural networks for the prediction (Sormanni et al. 2015). It also differs from the other methods in this list through its usage of a training set built from chemical shift analyses of NMR structures which allows it to better distinguish between ordered secondary structure and disordered regions.
SPIDER3. This recent deep learning-based approach combines the prediction of several protein structural features, including secondary structure and solvent accessibility, simultaneously by training multiple bi-directional recurrent neural network and then using all predicted outputs from the previous iterations as input in the following ones (Heffernan et al. 2017).
This entry introduces the prediction of protein secondary structure, namely, of alpha-helix, beta-strand, and others. Since secondary structure is the simplest, yet meaningful aspect of protein structure, it is utilized in a large variety of applications: in protein structure prediction, when aligning proteins with little sequence similarity, and in the prediction of protein binding and subcellular localization among others. In general, predicting secondary structure is always a good idea to start any attempt at analyzing a protein sequence.
The history of secondary structure predictions is described, i.e., important aspects and the ongoing improvement of the prediction methods. Furthermore, useful hints for users are provided and a number of today’s secondary structure prediction methods described.
- Baker MS, Ahn SB, Mohamedali A, Islam MT, Cantor D, Verhaert PD, Fanayan S, Sharma S, Nice EC, Connor M, Ranganathan S (2017) Accelerating the search for the missing proteins in the human proteome. Nat Commun 8(May 2016):14271–14271. https://doi.org/10.1038/ncomms14271CrossRefPubMedPubMedCentralGoogle Scholar
- Faraggi E, Zhang T, Yang Y, Kurgan L, Zhou Y (2012) SPINE X: improving protein secondary structure prediction by multistep learning coupled with prediction of solvent accessible surface area and backbone torsion angles. J Comput Chem 33(3):259–267. https://doi.org/10.1002/jcc.21968CrossRefPubMedGoogle Scholar
- Heffernan R, Paliwal K, Lyons J, Dehzangi A, Sharma A, Wang J, Sattar A, Yang Y, Zhou Y (2015) Improving prediction of secondary structure, local backbone angles, and solvent accessible surface area of proteins by iterative deep learning. Sci Rep 5(May):11476–11476. https://doi.org/10.1038/srep11476CrossRefPubMedPubMedCentralGoogle Scholar
- Heffernan R, Yang Y, Paliwal K, Zhou Y (2017) Capturing non-local interactions by long short-term memory bidirectional recurrent neural networks for improving prediction of protein secondary structure, backbone angles, contact numbers and solvent accessibility. Bioinformatics 33(18):2842–2849. https://doi.org/10.1093/bioinformatics/btx218CrossRefPubMedGoogle Scholar
- Rose PW, Prlić A, Altunkaya A, Bi C, Bradley AR, Christie CH, Di Costanzo L, Duarte JM, Dutta S, Feng Z, Green RK, Goodsell DS, Hudson B, Kalro T, Lowe R, Peisach E, Randle C, Rose AS, Shao C, Tao YP, Valasatava Y, Voigt M, Westbrook JD, Woo J, Yang H, Young JY, Zardecki C, Berman HM, Burley SK (2017) The RCSB protein data bank: integrative view of protein, gene and 3D structural information. Nucleic Acids Res 45(D1):D271–D281. https://doi.org/10.1093/nar/gkw1000CrossRefPubMedGoogle Scholar
- Sillitoe I, Lewis TE, Cuff A, Das S, Ashford P, Dawson NL, Furnham N, Laskowski RA, Lee D, Lees JG, Lehtinen S, Studer RA, Thornton J, Orengo CA (2015) CATH: comprehensive structural and functional annotations for genome sequences. Nucleic Acids Res 43(D1):D376–D381. https://doi.org/10.1093/nar/gku947CrossRefPubMedGoogle Scholar
- Yachdav G, Kloppmann E, Kajan L, Hecht M, Goldberg T, Hamp T, Hönigschmid P, Schafferhans A, Roos M, Bernhofer M, Richter L, Ashkenazy H, Punta M, Schlessinger A, Bromberg Y, Schneider R, Vriend G, Sander C, Ben-Tal N, Rost B (2014) PredictProtein-an open resource for online prediction of protein structural and functional features. Nucleic Acids Res 42(Web Server issue):W337–W343. https://doi.org/10.1093/nar/gku366CrossRefPubMedPubMedCentralGoogle Scholar