TOPS++FATCAT: Fast flexible structural alignment using constraints derived from TOPS+ Strings Model
- 6k Downloads
Protein structure analysis and comparison are major challenges in structural bioinformatics. Despite the existence of many tools and algorithms, very few of them have managed to capture the intuitive understanding of protein structures developed in structural biology, especially in the context of rapid database searches. Such intuitions could help speed up similarity searches and make it easier to understand the results of such analyses.
We developed a TOPS++FATCAT algorithm that uses an intuitive description of the proteins' structures as captured in the popular TOPS diagrams to limit the search space of the aligned fragment pairs (AFPs) in the flexible alignment of protein structures performed by the FATCAT algorithm. The TOPS++FATCAT algorithm is faster than FATCAT by more than an order of magnitude with a minimal cost in classification and alignment accuracy. For beta-rich proteins its accuracy is better than FATCAT, because the TOPS+ strings models contains important information of the parallel and anti-parallel hydrogen-bond patterns between the beta-strand SSEs (Secondary Structural Elements). We show that the TOPS++FATCAT errors, rare as they are, can be clearly linked to oversimplifications of the TOPS diagrams and can be corrected by the development of more precise secondary structure element definitions.
The benchmark analysis results and the compressed archive of the TOPS++FATCAT program for Linux platform can be downloaded from the following web site: http://fatcat.burnham.org/TOPS/
TOPS++FATCAT provides FATCAT accuracy and insights into protein structural changes at a speed comparable to sequence alignments, opening up a possibility of interactive protein structure similarity searches.
KeywordsReceiver Operating Characteristic Curve String Model Alpha Helix Longe Common Subsequence Longe Common Subsequence
Structural biology is one of the most successful fields of modern biology. Over 50,000 solved protein structures illustrate details of many specific biological processes. The same data also provide us with information about the global features of protein structure space and can be studied to discover the evolutionary, physical, and mathematical rules governing them. How many fundamentally different protein shapes (folds) are there? How do protein structures evolve? How do new structural features appear, and if they are coupled with changes in function, how does this process occur? Such questions can be studied by classifying, comparing and analyzing known protein structures. Two different, but synergistic strategies are typically used for this purpose. In classification systems such as SCOP  or CATH , human intuition is used to simplify the description of protein structures to a manageable size, and a human eye, sometimes supported by automated analysis, can recognize patterns and types of structures. In the second approach, specialized comparison algorithms, such as DALI , CE , or FATCAT  can be used to calculate a distance-like metric in the protein structure space. This in turn can be used to cluster proteins into groups. Many such algorithms have been developed over the past few decades and have been mostly used for the classification of protein structures into families.
An exact solution of an alignment between two structures is formally equivalent to a threading problem and is therefore NP-complete . However, a practical solution can be obtained by heuristics reducing the problem to a manageable size . In human classification systems, the protein is usually reduced to a set of several structural elements, which obviously involve many arbitrary thresholds. Automated algorithms have the same problem and also suffer from inconsistencies between different numerical measures of protein structure similarity . Interestingly, despite these problems, results of different approaches are broadly similar. They all identify approximately a few hundred general classes of protein structures, usually called folds  or topologies , distinguished by how the main chain of the protein folds around itself in the three-dimensional space. At the same time, the comparison of different approaches, both between and within the two classes, shows that fold/topologies (or cluster) definitions are somewhat fuzzy, with some proteins being occasionally difficult to classify and joining different groups depending on various assumptions. This lead some to question the concept of the fold , but practical application of protein structure comparison leaves little doubt that protein structure space has some natural granularity that overlaps well with the traditional fold classification.
Flexible structure alignment method FATCAT
FATCAT, as well as most other protein structure comparison programs, is very slow when compared to sequence alignments. The computing time of FATCAT is determined by the size of the collection of AFPs detected between the two structures being compared. FATCAT is available from a server http://fatcat.burnham.org with an option to search in SCOP or PDB databases for similar structures. This search typically takes between 8 to 16 hours of CPU time, and this is the main obstacle to broader use of this option. FATCAT has been used to construct a Flexible Structure Neighborhood (FSN) database that contains pre-computed results of structure similarity searches and it takes several weeks of CPU time to update the FSN database. Other protein structure comparison resources, such as DALI or CE have very similar problems.
TOPS cartoons and TOPS graph models
As discussed in the Background, TOPS cartoons capture the simplified, fold-level description of protein structure and at the same time can be automated . The TOPS algorithm uses structural features such as hydrogen bonds and chirality of the beta strands to provide a scoring function to optimize the cartoon (see Figure 1(b)). In TOPS, the secondary structural elements (SSEs) are derived from the DSSP program . Based on TOPS cartoons, a formal graph model and graph-based definitions of protein topology and pattern discovery and comparison methods were developed [26, 27]. The TOPS database and comparison, pattern discovery and matching programs are accessible from http://www.tops.leeds.ac.uk.
Novel TOPS+ and TOPS+ strings models
In detail, each node (SSE segment) of the TOPS+ strings is described by its type, orientation, PDB start number, segment length, total number of incoming (InArc) and outgoing (OutArc) arcs (edges), total number of ArcTypes, and total number of ligand arcs (LigArc). The type of the segment (SSEType) could be one of [E, e, H, h, U, u], where, "E" and "e" represent the "up"- and "down"-oriented beta strands; "H" and "h" indicate the "up"- and "down"-oriented alpha helices; and "U" and "u" represent ligand-bound and ligand-free loops. The InArcType can be classified as an/a [R, L, P, A], where "R" and "L" represent right and left chiralities; and "P" and "A" represent parallel and anti-parallel hydrogen bonds, respectively. The OutArcType is represented in a similar manner by [R', L', P', A']. Ligand arcs are indicated by LT = AA, where LT is the ligand type and AA is the PDB number. For example, Figure 3(a) and 3(b) contain visual representations of TOPS+ and TOPS+ strings models, respectively, for the protein domain d1fnb_1. Here the triangles represent the beta strands; the red curve represents the alpha helix; gray ellipsoids indicate loops; and green arcs indicate hydrogen bonds between two beta strands, called anti-parallel beta sheets. The length of a TOPS+ strings model is defined by number of SSEs; thus, the length of d1fnb_1 is 19. For further details, see .
TOPS+ strings comparison method
TOPS+ is a comparison method that computes a distance between TOPS+ strings models of two proteins based on a dynamic programming approach and identifies the longest common subsequence (LCS), consisting of the list of the topologically equivalent SSEs between two proteins. For example, Figure 3(c) shows the TOPS+ strings alignment between Dihydropteridine reductase proteins from rat (1dhr) and human (1hdr). The TOPS+ strings models for 1dhr and 1hdr are represented by a linear string-model, where a yellow triangle and red curves indicate the beta strands and alpha helices in their "up" or "down" orientations, respectively. The grey line and purple stubs represent the loop regions and the NAD ligand interactions, respectively. Note that the ligand-interaction information is optional and in this work we have not used it. The incoming and outgoing arcs are depicted in the SSEs (top and bottom of the beta strands), where red and green arcs represent the parallel and anti-parallel hydrogen-bond interactions that show beta-sheet information, while yellow and blue arcs indicate the right and left chirality relationships between the SSEs. A pink arrow between the TOPS+ strings elements indicates the conserved SSE. The dotted arrows indicate the conserved alpha helices and beta strands, while the plain arrows indicate the conserved loop regions.
SCOP Superfamily-Level Homolog vs Non-Homolog Protein Domain Pairs Statistics
Protein Domains from
Protein Domains from
Total Number of
All alpha Class
All beta Class
We performed the Receiver Operating Characteristics (ROC) curve and the AUC (Area Under the ROC Curve) analyses to compare the performance of the TOPS++FATCAT method with the original FATCAT method, using SCOP classification at the superfamily level as a standard of comparison .
ROC and AUC Analyses
AUC Values Based on p-values from the FATCAT and TOPS++FATCAT Methods.
All Alpha Class
All Beta Class
For all protein classes, the rigid FATCAT performs best, usually followed by the flexible FATCAT, the rigid TOPS++FATCAT, and the flexible TOPS++FATCAT. The performance of all four methods is best for all alpha and all beta proteins, and all four perform markedly worse (but similar to each other) for alpha/beta proteins. Only alpha+beta proteins show a clear difference between the FATCAT and TOPS++FATCAT methods. It is important to note that the TOPS+ strings models consider the parallel and anti-parallel properties of the beta-sheet information in the form of total number of incoming and outgoing arcs with their ArcTypes. Thus, the TOPS++FATCAT method discriminates the protein domain pairs more efficiently compared to the original FATCAT method. For example, in the all-beta protein domain pairs, both the flexible and the rigid TOPS++FATCAT methods perform well. The flexible TOPS++FATCAT method covers nearly 84% of protein domains with 0% false positives, but the flexible and rigid FATCAT methods cover only 76% and 49% of the true positives, respectively, with 0% false positives. The zoomed-in version of the ROC curves with up to 10% false positives for all-beta rich protein families is shown in Figure 5(f); where both the rigid TOPS++FATCAT (green) and flexible (red) TOPS++FATCAT methods have coverage rates of 82% and 84% true positives respectively with 0% false positives. The overall results for all protein classes show that TOPS++FATCAT performance is only slightly lower (3%–7% AUC value difference (see Table 2)) as compared to FATCAT while providing a significant, more than 10-fold speedup (see next section).
AFP and Runtime Analyses
AFP and Runtime from FATCAT and TOPS++FATCAT.
Average Runtime (sec)
Flexible and rigid FATCAT and TOPS++FATCAT comparison results for d2trxa_ and d1kte_
Discussion and conclusion
The overall results for all protein classes show that TOPS++FATCAT performance is only slightly lower (3%–7% AUC value difference) as compared to FATCAT while providing a significant, more than 10-fold speedup. The main reason for the discrepancies is that TOPS+ strings alignments occasionally misalign the secondary structure elements and subsequent FATCAT alignment, constrained by the TOPS+ strings alignment, cannot overcome the earlier errors. There is a clear trade-off between the runtime and the accuracy; limiting the pool of fragments being compared speeds up the algorithm but results in (slightly) lower accuracy. At the same time, these results offer clear suggestions for future development. Using a more advanced version of the TOPS+ strings comparison method would remove some of the false positives might be at a cost of significantly slowing the total performance of the TOPS++FATCAT method.
This research was supported by NIH grant P20 GM076221 (Joint Center for Molecular Modeling). We would like to thank TOPS project for TOPS+ resources.
- 15.Amit P. Singh DLB: Hierarchical Protein Structure Superposition Using Both Secondary Structure and Atomic Representations. Proc Int Conf Intell Syst Mol Biol 1997, 5: 284–293.Google Scholar
- 18.Gilbert D, Westhead D, Viksna J, Thornton J: Topology-based protein strcuture comparison using a pattern discovery technique. Edited by: Martin A, Corne D. The Society for the Study of Artificial Intelligence and the Simulation of Behaviour; 2000:11–17.Google Scholar
- 19.Viksna J, Gilbert D: Pattern matching and pattern discovery algorithms for protein topologies. Volume LNCS 2149. LNCS 2149 Springer-Verlag; 2001:98–111.Google Scholar
- 20.Krissinel E HK: Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Crystallogr D Biol Crystallogr 2004, 2256–2268. 10.1107/S0907444904026460Google Scholar
- 23.Gusfield D: Algorithms on strings, trees and sequences: Computer science and computational biology. 1999., 2nd edition, Cambridge University Press, New York.:Google Scholar
- 28.Veeramalai M: A novel method for comparing topological models of protein structures enhanced with ligand information. In Department of Computing Science. Volume PhD in Computing Science. Glasgow , University of Glasgow; 2005.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.