Topology independent protein structural alignment
 6.4k Downloads
 18 Citations
Abstract
Background
Identifying structurally similar proteins with different chain topologies can aid studies in homology modeling, protein folding, protein design, and protein evolution. These include circular permuted protein structures, and the more general cases of noncyclic permutations between similar structures, which are related by nontopological rearrangement beyond circular permutation. We present a method based on an approximation algorithm that finds sequenceorder independent structural alignments that are close to optimal. We formulate the structural alignment problem as a special case of the maximumweight independent set problem, and solve this computationally intensive problem approximately by iteratively solving relaxations of a corresponding integer programming problem. The resulting structural alignment is sequence order independent. Our method is also insensitive to insertions, deletions, and gaps.
Results
Using a novel similarity score and a statistical model for significance pvalue, we are able to discover previously unknown circular permuted proteins between nucleoplasmincore protein and auxin binding protein, between aspartate rasemase and 3dehydrogenate dehydralase, as well as between migration inhibition factor and arginine repressor which involves an additional strandswapping. We also report the finding of noncyclic permuted protein structures existing in nature between AML1/core binding factor and ribofalvin synthase. Our method can be used for large scale alignment of protein structures regardless of the topology.
Conclusion
The approximation algorithm introduced in this work can find good solutions for the problem of protein structure alignment. Furthermore, this algorithm can detect topological differences between two spatially similar protein structures. The alignment between MIF and the arginine repressor demonstrates our algorithm's ability to detect structural similarities even when spatial rearrangement of structural units has occurred. The effectiveness of our method is also demonstrated by the discovery of previously unknown circular permutations. In addition, we report in this study the finding of a naturally occurring noncyclic permuted protein between AML1/Core Binding Factor chain F and riboflavin synthase chain A.
Keywords
Similarity Score Structural Alignment Conflict Graph Secondary Structure Content Circular PermutationBackground
The classification of protein structures often rely on the topology of secondary structural elements. For example, the Structural Classification of Proteins (SCOP) system classifies proteins structures into common folds using the topological arrangement of secondary structural units [1]. Most protein structural alignment methods can reliably classify proteins into similar folds given the structural units from each protein are in the same sequential order. However, the evolutionary possibility of proteins with different structural topology but with similar spatial arrangement of their secondary structures pose a problem. One such possibility is the circular permutation.
Permutation by duplication [4, 5] is a widely accepted model where a gene first duplicates and fuses. After fusion, a new start codon is inserted into one gene copy while a new stop codon is inserted into the second copy. Peisajovich et al. demonstrated the evolutionary feasibility of permutation via duplication by creating functional intermediates at each step of the permutation by duplication model for DNA methyltransferases [6]. Identifying structurally similar proteins with different chain topologies, including circular permutation, can aid studies in homology modeling, protein folding, and protein design. An algorithm that can structurally align two proteins independent of their backbone topologies would be an important tool.
The biological implications of thermodynamically stable and biologically functional circular permutations, both natural and artificial, has resulted in much interest in detecting circular permutations in proteins [3, 7, 8, 9, 10, 11]. The more general problem of detecting nontopological structural similarities beyond circular permutation has received less attention. We refer to these as noncyclic permutations from now on. Tabtiang et al. were able to create a thermodynamically stable and biologically functional noncyclic permutation, indicating that noncyclic permutations may be as important as circular permutations [12]. In this study, we present a novel method that detects spatially similar structures that can identify structures related by circular and more complex noncyclic permutations. Detection of noncyclic permutation is possible by our algorithm by virtue of a recursive combination of a localratio approach with a global linearprogramming formulation. This paper is organized as follows. We first show that our algorithm is capable of finding known circular permutations with sensitivity and specificity. We then report the discovery of three new circular permutations and one example of a noncyclic permutation that to our knowledge have not been reported in literature. We conclude with remarks and discussions.
Results and discussion
For availability of our alignment software please see [13].
Detection of known circular permutations
Known circular permutation results
Protein 1  Protein 2  Us  DaliLite  K2  

PDB(Length)  PDB(Length)  N  RMSD  p – value  N  RMSD  N  RMSD 
1rinA(180)  2cna_(237)  152*  0.875  10^{6}  106  1.7  60  0.92 
1glh_(214)  1cpn_(208)  192*  1.163  10^{5}  156  0.4  156  0.41 
1exg_(110)  1tul_(102)  74*  1.485  10^{4}  63  4.0  34  2.26 
1rhgA(145)  1bcfA(158)  118*  1.500  10^{4}  94  2.3  81  1.51 
1ihwA(52)  1sso_(62)  46*  0.502  10^{3}  45  2.9  28  1.93 
Known circular permutation results
Protein 1  Protein 2  Us  MASS  OPAAS  SAMO  Topofit  

PDB(Length)  PDB(Length)  N  R  p  N  R  N  R  N  R  N  R 
1rinA(180)  2cna_(237)  152*  0.875  10^{6}  164*  1.2  167*  1.48  174*  1.581  152*  1.09 
1glh_(214)  1cpn_(208)  192*  1.163  10^{5}  206*  0.49  No  solution  170*  3.283  206*  0.49 
1exg_(110)  1tul_(102)  74*  1.485  10^{4}  60*  1.9  No  solution  93*  2.88  52*  1.79 
1rhgA(145)  1bcfA(158)  118*  1.500  10^{4}  106*  1.7  63*  2.12  126*  2.309  109*  1.4 
1ihwA(52)  1sso_(62)  46*  0.502  10^{3}  39*  1.7  No  solution  48*  2.713  35*  1.47 
In Table 1 we compare results against DaliLite and K2. As expected, DaliLite returned the largest sequential alignment. K2 did not find circular permutations even when the option to ignore sequence order constraints was selected.
In Table 2 we compare our alignment results to the methods of MASS [14], OPAAS [15], SAMO [7], and Topofit [16]. Each method is able to detect circular permutations. However, Table 2 shows that our method normally finds more equivalent residues with a lower RMSD. Compared with SAMO our method found less aligned residues in 4 of the 5 shown alignments. However, our cRMSD values are considerably better. At the time of this writing, SAMO only outputs the cRMSD and the number of equivalent residues (N) of the alignment, without specifying the residue equivalence relationships between the two aligned protein structures. This makes it difficult to compare the quality of the alignments. Table 2 shows that our method finds better alignments in terms of cRMSD than other structural alignment methods when the two proteins are related by a circular permutation.
The GANSTA method by Kolbeck et al [17] can also align similar structures independent of the connectivity. The approach is somewhat similar to the Blast method in sequence alignment, where a set of seeds of highsimilarity pairs of secondary structural elements (SSE) are first identified, and are then aligned through a genetic algorithm, regardless of the connectivity.
The SCALI method by Yuan and Bystroff [18] assembles from a library of gapless alignment of fragments of local sequencestructure hierarchically, enforcing compactness and conserved contacts, but disregard the sequence ordering of the fragments. The aligned local fragments are then incremented by adding a new fragment pair. This process is organized as a tree, where nodes corresponds to the addition of new fragments. A breadthfirst tree search method was then carried out, with a number of heuristic conditions to limit the search space.
Instead of only aligning regular SSE fragments, our method differs from GANSTA and has no restriction on spatial patterns belonging to a regular SSE, and therefore is also applicable to loop regions. Our method differs from SCALI in that our fragments are not prebuilt, but are exhaustive fragments ranging from size 4–7. Compared to both methods, our method provides a guaranteed optimal ratio of aligned structures, whereas the heuristics employed by GANSTA and SCALI cannot guarantee that a good alignment can be found, and when an alignment is found, there is no guarantee that it will be within a certain ratio of the best possible alignment. In practice, we find that GANSTA often requires 3–5 hours for aligning a pair of proteins, and sometimes no results are returned. In contrast, our method usually terminates between 30 seconds – 5 minutes. The SCALI website consists of precomputed results of aligned structures and does not allow user input for a customized alignment, therefore it is difficult to compare performance of our method with SCALI on the examples reported in Table 2.
Discovery of novel circular permutations and a novel noncyclic permutation
The effectiveness of our method is also demonstrated by the discovery of previously unknown circular permutations. In an attempt to test our algorithm's ability to discover new circular permutations, we structurally aligned a subset of 3,336 structures from PDBSELECT 90% [19]. We first selected proteins from PDBSELECT90 (sequences have less than 90% identities) whose N and C termini were no further than 30Å apart. From this subset of 3,336 proteins, we aligned two proteins if they met the following conditions: the difference in their lengths was no more than 75 residues, and they had approximately the same secondary structure content. To compare secondary structure content, we determined the percentage of the residues labeled as helix, strand, and other for each structure. Two structures were considered to have the same secondary structure content if the difference between each secondary structure label was less than 10%. Within the approximately 200,000 alignments, we found 426 candidate circular permutations. Of these circular permutations, 312 were symmetric proteins that can be aligned with or without a circular permutation. Of the 114 nonsymmetric circular permutations, 112 were already known in literature, and 3 are novel. We describe these three novel circular permutations as well as a novel noncyclic permutation in some details.
Nucleoplasmincore and auxin binding protein
Aspartate racemase and type II 3dehydrogenate dehyrdalase
Migration inhibition factor and arginine repressor
Beyond circular permutation
Algorithm comparison
Alignment quality
Proteins  HOMSTRAD  FAST  US  

PDB(PDB)  PDB(PDB)  N  RMSD  N  M%  RMSD  N  M%  RMSD 
1dfaA  1qceA  57  2.5  55  55%  1.2  45  72%  1.1 
1hx8A  1hg5A  258  1.1  255  99%  1.1  247  98%  1.0 
2ahjA  1rieA  192  4.3  187  89%  2.0  168  99%  1.3 
1h7sA  1b63A  105  2.2  98  99%  2.0  96  100%  1.9 
1ed9A  1ew2A  403  5.6  343  98%  1.7  252  100%  1.2 
1oyc_  2tmdA  330  3.6  284  97%  2.3  193  94%  1.4 
1fjnA  1ica_  33  4.7  28  100%  1.9  33  100%  1.8 
1tpn_  1fbr_  43  2.4  40  93%  2.2  39  97%  2.2 
1e12A  1c3wA  220  1.7  214  97%  1.5  170  100%  0.9 
1af6A  1a0tP  377  4.6  323  97%  1.8  281  97%  1.5 
1hc1_  1lla_  582  2.3  546  97%  1.7  380  100%  1.4 
Conclusion
The approximation algorithm introduced in this work can find good solutions for the problem of protein structure alignment. Furthermore, this algorithm can detect topological differences between two spatially similar protein structures. The alignment between MIF and the arginine repressor demonstrates our algorithm's ability to detect structural similarities even when spatial rearrangement of structural units has occurred. In addition, we report in this study the finding of a naturally occurring noncyclic permuted protein between AML1/Core Binding Factor chain F and riboflavin synthase chain A.
In our method, the scoring function plays a pivotal role in detecting substructure similarity of proteins. We expect future experimentation on optimizing the parameters used in our similarity scoring system can improve detection of topologically independent structural alignment. In this study, we were able to fit our scoring system to an Extreme Value Distribution (EVD), which allowed us to perform an automated search for circular permuted proteins. Although the pvalue obtained from our EVD fit is sufficient for determining the biological significance of a structural alignment, the structural change between the microphage migration inhibition factor and the Cterminal domain of arginine repressor (Figure 3) indicates a need for a similarity score that does not bias heavily towards cRMSD measure for scoring circular permutations.
Whether naturally occurring circular permutations are frequent events in the evolution of protein genes is currently an open question. Lindqvist et al. (1997) pointed out that when the primary sequences have diverged beyond recognition, circular permutations may still be found using structural methods [3]. In this study, we discovered three examples of novel circularly permuted protein structures and a noncyclic permutation among 200,000 protein structural alignments for a set of nonredundant 3,336 proteins. This is an incomplete study, as we restricted our studies to proteins whose N and C termini distance were less than 30Å. We plan to relax the N to C distance and include more proteins in future work to expand the scope of the investigation.
Methods
Approach
In this study, we describe a new algorithm that can align two protein structures or substructures independent of the connectivity of their secondary structure elements. We first exhaustively fragment the two proteins separately. An approximation algorithm based on a fractional version of the localratio approach for scheduling splitinterval graphs [32] is then used to search for the combination of peptide fragments from both structures that will optimize the global alignment of the two structures.
Constraints
Interval clique inequalities:  (2) 

${y}_{{\chi}_{5,{\lambda}_{a}}}\le 1$  Line sweep at a_{ t }= 1 
${y}_{{\chi}_{1,{\lambda}_{a}}}\le 1$  Line sweep at a_{ t }= 5 
${y}_{{\chi}_{4,{\lambda}_{a}}}\le 1$  Line sweep at a_{ t }= 9 
${y}_{{\chi}_{3,{\lambda}_{a}}}+{y}_{{\chi}_{4,{\lambda}_{a}}}\le 1$  Line sweep at a_{ t }= 10 
${y}_{{\chi}_{3,{\lambda}_{a}}}\le 1$  Line sweep at a_{ t }= 12 
${y}_{{\chi}_{3,{\lambda}_{a}}}+{y}_{{\chi}_{2,{\lambda}_{a}}}\le 1$  Line sweep at a_{ t }= 14 
${y}_{{\chi}_{2,{\lambda}_{a}}}\le 1$  Line sweep at a_{ t }= 16 
Interval clique inequalities:  (3) 
${y}_{{\chi}_{1,{\lambda}_{b}}}\le 1$  Line sweep at b_{ t }= 1 
${y}_{{\chi}_{4,{\lambda}_{b}}}\le 1$  Line sweep at b_{ t }= 6 
${y}_{{\chi}_{4,{\lambda}_{b}}}+{y}_{{\chi}_{3,{\lambda}_{b}}}\le 1$  Line sweep at b_{ t }= 7 
${y}_{{\chi}_{3,{\lambda}_{b}}}\le 1$  Line sweep at b_{ t }= 9 
${y}_{{\chi}_{2,{\lambda}_{b}}}+{y}_{{\chi}_{3,{\lambda}_{b}}}\le 1$  Line sweep at b_{ t }= 12 
${y}_{{\chi}_{2,{\lambda}_{b}}}\le 1$  Line sweep at b_{ t }= 13 
${y}_{{\chi}_{5,{\lambda}_{b}}}\le 1$  Line sweep at b_{ t }= 14 
Consistency inequalities:  (4,5) 
${y}_{{\chi}_{1,{\lambda}_{a}}}{x}_{{\chi}_{1}}\ge 0$  ${y}_{{\chi}_{1,{\lambda}_{b}}}{x}_{{\chi}_{1}}\ge 0$ 
${y}_{{\chi}_{2,{\lambda}_{a}}}{x}_{{\chi}_{2}}\ge 0$  ${y}_{{\chi}_{2,{\lambda}_{b}}}{x}_{{\chi}_{2}}\ge 0$ 
${y}_{{\chi}_{3,{\lambda}_{a}}}{x}_{{\chi}_{3}}\ge 0$  ${y}_{{\chi}_{3,{\lambda}_{b}}}{x}_{{\chi}_{3}}\ge 0$ 
${y}_{{\chi}_{4,{\lambda}_{a}}}{x}_{{\chi}_{4}}\ge 0$  ${y}_{{\chi}_{4,{\lambda}_{b}}}{x}_{{\chi}_{4}}\ge 0$ 
${y}_{{\chi}_{5,{\lambda}_{a}}}{x}_{{\chi}_{5}}\ge 0$  ${y}_{{\chi}_{5,{\lambda}_{b}}}{x}_{{\chi}_{5}}\ge 0$ 
Basic Definitions and Notations
The following definitions/notations are used uniformly throughout the paper unless otherwise stated:

Protein structures are denoted by S_{ a }, S_{ b },....

A substructure ${\lambda}_{i,k}^{a}$ of a protein structure S_{ a }is a continuous fragment ${\lambda}_{i,k}^{a}$, where i is the residue index of the beginning of the substructure and k is the length (number of residues) of the substructure. We will denote such a substructure simply by λ^{ a }if i and k are clear from the context or irrelevant.

A residue a_{ t }∈ S_{ a }is a part of a substructure ${\lambda}_{i,k}^{a}$ if i ≤ t ≤ i + k  1.

Λ_{ a }is the set of all continuous substructures or fragments of protein structure S_{ a }that is under consideration in our algorithm.

χ_{ i, j, k }(or simply χ when the other parameters are understood from the context) denotes an ordered pair (${\lambda}_{i,k}^{a},{\lambda}_{j,k}^{b}$) of equal length substructures of two protein structures S_{ a }and S_{ b }.

Two ordered pairs of substructures (${\lambda}_{i,k}^{a},{\lambda}_{j,k}^{b}$) and (${\lambda}_{{i}^{\prime},{k}^{\prime}}^{a},{\lambda}_{{j}^{\prime},{k}^{\prime}}^{b}$) are called inconsistent if and only if at least one of the pairs of substructures {${\lambda}_{i,k}^{a},{\lambda}_{{i}^{\prime},{k}^{\prime}}^{a}$} and {${\lambda}_{j,k}^{a},{\lambda}_{{j}^{\prime},{k}^{\prime}}^{a}$} are not disjoint.
We are now ready to formalize our substructure similarity identification problem as below:
Problem name: Basic Substructure Similarity Identification (BSSI_{Λ, σ}).
Instance: a set Λ = {χ_{ i, j, k }i, j, k ∈ ℕ} ⊂ Λ_{ a }× Λ_{ b }of ordered pairs of equal length substructures of S_{ a }and S_{ b }and a similarity function σ : Λ ↦ ℝ^{+} mapping each pair of substructures to a positive similarity value.
Valid Solutions: a set of substructure pairs {${\chi}_{{i}_{1},{j}_{1},{k}_{1}},{\chi}_{{i}_{2},{j}_{2},{k}_{2}},\mathrm{...}{\chi}_{{i}_{t},{j}_{t},{k}_{t}}$} that are mutually consistent.
Objective: maximize the total similarity of the selection ${\sum}_{\ell =1}^{t}\sigma ({\chi}_{{i}_{\ell},{j}_{\ell},{k}_{\ell}})$.
An Algorithm Based on the LocalRatio Approach
The BSSI_{Λ, σ}problem is a special case of the wellknown maximum weight independent set problem in graph theory [33]. In fact, BSSI_{Λ, σ}itself is MAXSNPhard (i. e., there is a constant 0 <ε < 1 such that no polynomialtime algorithm can return a solution with a value of the objective function that is within 1  ε times the optimum [34] unless P = NP) even when all the substructures are restricted to have lengths at most 2 [32, Theorem 2.1]. Our approach is to adopt the approximation algorithm for scheduling splitinterval graphs [32] which itself is based on a fractional version of the localratio approach. For ease in description of our algorithm, we introduce the following definitions.
Definition 1 For any subset, Δ ⊆ Λ the conflict graph G_{Δ} = (V_{Δ}, E_{Δ}) is the graph in which V_{Δ} = {χχ ∈ Δ} and E_{Δ} = {{χ, χ'}χ, χ', ∈ Δ and the pair {χ, χ'} is not consistent}
Definition 2 The closed neighborhood Nbr_{Δ} [χ] of a vertex χ of G_{Δ} is {χ'  {χ, χ' } ∈ E_{Δ}} ∪ {χ}.
 1.
Solve a linear programming (LP) formulation of BSSI_{Λ,σ}by relaxing a corresponding integer programming version of the BSSI_{Λ,σ}problem.
 2.For every vertex χ ∈ V_{Δ} of G_{Δ}, compute its local conflict number ${\alpha}_{\chi}={\displaystyle {\sum}_{{\chi}^{\prime}\in {\text{Nbr}}_{\Delta}[\chi ]}{x}_{{\chi}^{\prime}}}$. Let χ_{ min }be the vertex with the minimum local conflict number. Define a new similarity function σ_{ new }from σ as follows:${\sigma}_{new}(\chi )=\{\begin{array}{ll}\sigma (\chi )\hfill & \text{if}\chi \notin {\text{Nbr}}_{\Delta}[{\chi}_{min}]\hfill \\ \sigma (\chi )\sigma ({\chi}_{min})\hfill & \text{otherwise}\hfill \end{array}$
 3.
Create Δ_{ new }⊆ Δ by removing from Δ every substructure pair χ such that σ_{ new }(χ) ≤ 0. Push each removed substructure on to a stack in arbitrary order.
 4.
If Δ_{ new }≠ ∅ then repeat from step 1 setting Δ = Δ_{ new }and σ = σ_{ new }. Otherwise, continue to step 5.
 5.
Repeatedly pop the stack, adding the substructure pair to the alignment as long as the following conditions are met:
 (a)
The substructure pair is consistent with all other substructure pairs that already exist in the selection.
 (b)
The cRMSD of the alignment does not change by a threshold. This condition bridges the gap between optimizing a local similarity between substructures and optimizing the tertiary similarity of the alignment by guaranteeing that each substructure from a substructure pair is in the same spatial arrangement in the global alignment.
A brief intuitive explanation of the various inequalities in the LP formulation as described above in terms of their original integer programming formulation is as follows:

The "interval clique" inequalities in Equation (2) (resp. Equation (3)) ensure that the various substructures of S_{ a }(resp. S_{ b }) in the selected substructure pairs from Δ are mutually disjoint.

Inequalities in Equation 4 and Equation 5 ensure consistencies between the indicator variable for each substructure pair χ and its two substructures λ_{ a }and λ_{ b }.

Inequalities in Equation 6 relax the 0–1 values of the indicator variables to any fractional value between 0 and 1.
In implementation, the graph G_{Δ} is considered implicitly via intersecting intervals. The interval clique inequalities can be generated via a sweepline approach (see Figure 6c). The running time depends on the number of iterations needed to solve the LP formulations. Let LP(n, m) denote the time taken to solve a linear programming problem on n variables and m inequalities. Then the worst case running time of the above algorithm is O(Λ·LP(3Λ, 5Λ + Λ_{ a } + Λ_{ b })). However, the worstcase time complexity happens under the excessive pessimistic assumption that each iteration removes exactly one vertex of G_{Λ}, namely χ_{ min }only, from consideration, which is unlikely to occur in practice as our computational results show. A theoretical pessimistic estimate of the performance ratio of our algorithm can be obtained as follows. Let α be the maximum of all the ${\alpha}_{{\chi}_{min}}$ 's over all iterations. Proofs in [32] translate to the fact that the algorithm returns a solution whose total similarity is at least $\frac{1}{\alpha}$ times that of the optimum and, if Step 5(b) is omitted from the algorithm, then α ≤ 4. The value of α even with Step 5(b) is much smaller than 4 in practice (e.g. α = 2.89).
Simple example
We present a simplified example to illustrate the first iteration of our algorithmic approach for two protein structures S_{ a }and S_{ b }(Figure 6a) selected for alignment. Here S_{ b }is the structure to be aligned to the reference structure S_{ a }. We systematically cut S_{ b }into fragments of length 4–7 and exhaustively compute a similarity score of each fragment from S_{ b }to all possible fragments of equal length in S_{ a }. Each fragment pair can be thought of as a vertex in a graph (Figure 6b). Abusing notations slightly for ease of understanding, let the vertices be denoted by vertex corresponds to a rectangle in Figure 6c. Suppose we have the following similarity scores for aligned substructures: σ (χ_{1}) = 8, σ (χ_{2}) = 5, σ(χ_{3}) = 7, σ (χ_{4}) = 3 and σ(χ_{5}) = 6. Then, our objective function is to maximize $8{x}_{{\chi}_{1}}+5{x}_{{\chi}_{2}}+7{x}_{{\chi}_{3}}+3{x}_{{\chi}_{4}}+6{x}_{{\chi}_{5}}$. Figure 6b shows the conflict graph for the set of fragments. A sweep line (shown as dashed lines in Figure 6c) is implicitly constructed (O (n) time after sorting) to determine which vertices of fragment pairs overlap. A conflict is shown in Figure 6b as edges between vertices. χ_{1} and χ_{5} do not conflict with any other substructure pairs, while χ_{2} and χ_{4} conflict with χ_{3}. For this graph, the constraints in the linear programming formulation are shown in Table 4. The linear programming problem is solved using the BPMPD package [35].
Similarity Score σ
The similarity score σ(χ_{ i, j, k }) between two aligned substructures ${\lambda}_{i,k}^{a}$ and ${\lambda}_{j,k}^{b}$ is a weighted sum of a shape similarity measure derived from the cRMSD value, which is then modified for the secondary structure content of the aligned substructure pairs, and a sequence composition score (SCS). Here cRMSD values are the coordinate root mean square distance, which are the square root of the mean of squares of Euclidean distances of coordinates of corresponding C_{ α }atoms. Formally, for two sets of n points v and w, the cRMSD is defined as $\sqrt{{\displaystyle {\sum}_{1}^{n}{\Vert {v}_{i}{w}_{i}\Vert}^{2}}}$.
cRMSD scaling by secondary structure content
to modify the cRMSD of the aligned substructure pairs to remove bias due to different secondary structure content. We use DSSP [36] to assign secondary structure to the residues of each protein.
Sequence composition
where A_{ a, i }and A_{ b, i }are the amino acid residue types at aligned position i. B(A_{ a, i }, A_{ b, i }) is the similarity score between A_{ a, i }and A_{ b, i }based on a modified BLOSUM50 matrix, in which a constant is added to all entries such that the smallest entry is 1.0.
Combined similarity score
In current implementation, the values of α and C are empirically set to 100 and 2, respectively.
Similarity score for aligned molecules
The output of the above algorithm is a set of aligned substructure pairs X = {χ_{1}, χ_{2}, ... χ_{ m }} that maximize Equation (1).
In this case k = N_{ X }, where N_{ X }is the total number of aligned residues.
Notes
Acknowledgements
A shorter preliminary version of this paper was presented at the 7^{th} Workshop on Algorithms in Bioinformatics (WABI) during September 2007. Bhaskar DasGupta was supported by NSF grants IIS0346973, IIS0612044 and DBI0543365. Joe Dundas was partially supported by NSF grant IIS0612044. Jie Liang was supported by NSF grant DBI0133856 and NIH grants GM68958 and GM079804.
Supplementary material
References
 1.Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structure. J Mol Biol 1995, 247: 536–540. 10.1006/jmbi.1995.0159PubMedGoogle Scholar
 2.Binkowski TA, DasGupta B, Liang J: Order independent structural alignment of circularly permuted proteins. Conf Proc IEEE Eng Med Biol Soc 2004, 4: 2781–2784.PubMedGoogle Scholar
 3.Lindqvist Y, Schneider G: Circular permutations of natural protein sequences: structural evidence. Curr Opinions Struct Biol 1997, 7: 422–427. 10.1016/S0959440X(97)800619CrossRefGoogle Scholar
 4.Ponting CP, Russell RB: Swaposins: circular permutations within genes encoding saposin homologues. Trends Biochem Sci 1995, 20: 179–180. 10.1016/S09680004(00)890039CrossRefPubMedGoogle Scholar
 5.Jeltsch A: Circular permutations in the molecular evolution of DNA methyltransferase. J Mol Evol 1999, 49: 161–164. 10.1007/PL00006529CrossRefPubMedGoogle Scholar
 6.Peisajovich SG, Rockah L, Tawfik DS: Evolution of new protein topologies through multistep gene rearrangements. Nature Genetics 2006, 38: 168–173. 10.1038/ng1717CrossRefPubMedGoogle Scholar
 7.Chen L, Wu LY, Wang Y, Zhang S, Zhang XS: Revealing divergent evolution, identifying circular permutations and detecting activesites by protein structure comparison. BMC Struct Biol 2006, 6: 18. 10.1186/14726807618PubMedCentralCrossRefPubMedGoogle Scholar
 8.Chen L, Zhou T, Tang Y: Protein structure alignment by deterministic annealing. Bioinformatics 2005, 21: 51–62. 10.1093/bioinformatics/bth467CrossRefPubMedGoogle Scholar
 9.Szustakowski JD, Weng Z: Protein structure alignment using a genetic algorithm. Proteins: Structure, Function, and Genetics 2000, 38: 428–440. Publisher Full Text 10.1002/(SICI)10970134(20000301)38:4%3C428::AIDPROT8%3E3.0.CO;2NCrossRefGoogle Scholar
 10.Jung J, Lee B: Protein structure alignment using environmental profiles. Prot Eng 2000, 13: 535–543. 10.1093/protein/13.8.535CrossRefGoogle Scholar
 11.Uliel S, Fliess A, Amir A, Unger R: A simple algorithm for detecting circular permutations in proteins. Bioinformatics 1999, 15: 930–936. 10.1093/bioinformatics/15.11.930CrossRefPubMedGoogle Scholar
 12.Tabtiang RK, Cezairliyan BO, Grant RA, Cochrane JC, Sauer RT: Consolidating critical binding determinants by noncyclic rearrangement of protein secondary structure. PNAS 2004, 7: 2305–2309.Google Scholar
 13.CPalign: Software for topology independent protein structural alignment[http://gila.bioengr.uic.edu/lab/]
 14.Dror O, Benyamini H, Nussinov R, Wolfson HJ: MASS: multiple structural alignment by secondary structures. Bioinformatics 2003, 19: i95i104. 10.1093/bioinformatics/btg1012CrossRefPubMedGoogle Scholar
 15.Shih ES, Hwang MJ: Alternative alignments from comparison of protein structures. Proteins 2004, 56: 519–527. 10.1002/prot.20124CrossRefPubMedGoogle Scholar
 16.Ilyin VA, Abyzov A, Leslin CM: Structural alignment of proteins by a novel TOPOFIT method, as a superimposition of common volumes at a topomax point. Protein Science 2004, 13: 1865–1874. 10.1110/ps.04672604PubMedCentralCrossRefPubMedGoogle Scholar
 17.Kolbeck B, May P, SchmidtGoenner T, Steinke T, Knapp EW: Connectivity independent proteinstructure alignment: a hierarchical approach. BMC Bioinformatics 2006, 7: 510. 10.1186/147121057510PubMedCentralCrossRefPubMedGoogle Scholar
 18.Yuan X, Bystroff C: Nonsequential structurebased alignments reveal topology independent core packing arrangements in proteins. Bioinformatics 2005, 21(7):1010–1019. 10.1093/bioinformatics/bti128CrossRefPubMedGoogle Scholar
 19.Hobohm U, Sander C: Enlarged representative set of protein structures. Protein Science 1994, 3: 522.PubMedCentralCrossRefPubMedGoogle Scholar
 20.Dutta S, Akey IV, Dingwall C, Hartman KL, Laue T, Nolte RT, Head JF, Akey CW: The crystal structure of nucleoplasmincore implication for histone binding and nucleosome assembly. Mol Cell 2001, 8: 841–853. 10.1016/S10972765(01)003549CrossRefPubMedGoogle Scholar
 21.Woo EJ, Marshall J, Bauly J, Chen JG, Venis M, Napier RM, Pickersgill RW: Crystal structure of the auxinbinding protein 1 in complex with auxin. EMBO J 2002, 21: 2877–2885. 10.1093/emboj/cdf291PubMedCentralCrossRefPubMedGoogle Scholar
 22.Liu L, Iwata K, Yohda M, Miki K: Structural insight into gene duplication, gene fusion and domain swapping in the evolution of PLPindependent amino acid racemases. FEBS LETT 2002, 528: 114–118. 10.1016/S00145793(02)032647CrossRefPubMedGoogle Scholar
 23.Hermoso JA, Monterroso B, Albert A, Galan B, Ahrazem O, Garcia P, MartinezRipoll M, Garcia JL, Menendez M: Structural basis for selective recognition of pneumococcal cell wall by modular endolysin from phage Cp1. Structure 2003, 11: 1239. 10.1016/j.str.2003.09.005CrossRefPubMedGoogle Scholar
 24.Suzuki M, Takamura Y, Maeno M, Tochinai S, Iyaguchi D, Tanaka I, Nishihira J, Ishibashi T: Xenopus laevis macrophage migration inhibitory factor is essential for axis formation and neural development. J Biol Chem 2004, 279: 21406–21414. 10.1074/jbc.M311416200CrossRefPubMedGoogle Scholar
 25.Van Duyne GD, Ghosh G, Maas WK, Sigler PB: Structure of the oligomerization and Larginine binding domain fo the arginine repressor of Escherichia Coli. J Mol Biol 1996, 256: 377–391. 10.1006/jmbi.1996.0093CrossRefPubMedGoogle Scholar
 26.Alexandrov NN, Fischer D: Analysis of topological and nontopological structural similarities in the PDB: New examples with old structures. Proteins 1996, 25: 354–365. Publisher Full Text 10.1002/(SICI)10970134(199607)25:3%3C354::AIDPROT7%3E3.3.CO;2WCrossRefPubMedGoogle Scholar
 27.Warren AJ, Bravo J, Williams RL, Rabbitts TH: Structural basis for the heterodimeric interaction betweeen the acute leukemiaassociated transcription factors AML1 and CBFbeta. EMBO J 2000, 19: 3004–3015. 10.1093/emboj/19.12.3004PubMedCentralCrossRefPubMedGoogle Scholar
 28.Meining W, Eberhardt S, Bacher A, Ladenstein R: The structure of the Nterminal domain of riboflavin synthase in complex with riboflavin at 2.6A resolution. J Mol Biol 2003, 331: 1053–1063. 10.1016/S00222836(03)008441CrossRefPubMedGoogle Scholar
 29.Zhu J, Weng Z: FAST: A novel protein structure alignment algorithm. PROTEINS: Structure, Function, and Bioinformatics 2005, 58: 618–627. 10.1002/prot.20331CrossRefGoogle Scholar
 30.Mizuguchi K, Deane CM, Blundell TL, Overington JP: HOMSTRAD: a database of protein structure alignments for homologous families. Protein Science 1998, 7: 2469–2471.PubMedCentralCrossRefPubMedGoogle Scholar
 31.Holm L, Park J: DaliLite workbench for protein structure comparison. Bioinformatics 2000, 16: 566–567. 10.1093/bioinformatics/16.6.566CrossRefPubMedGoogle Scholar
 32.BarYehuda R, Halldorsson MM, Naor J, Shacknai H, Shapira I: Scheduling split intervals. 14th ACMSIAM Symposium on Discrete Algorithms 2002, 732–741.Google Scholar
 33.Alon N, Feige U, Wigderson A, Zuckerman D: Derandomized graph products. Computational Complexity 1995, 5: 60–75. 10.1007/BF01277956CrossRefGoogle Scholar
 34.Arora S, Lund C, Motwani R, Sudan M, Szegedy M: Proof verification and hardness of approximation problems. Journal of ACM 1998, 45: 501–555. 10.1145/278298.278306CrossRefGoogle Scholar
 35.Meszaros C: Fast Cholesky factorization for interior point methods of linear programming. Comp Math Appl 1996, 31: 49–51. 10.1016/08981221(95)002154CrossRefGoogle Scholar
 36.Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogenbonded and geometrical features. Biopolymers 1983, 22: 2577–2637. 10.1002/bip.360221211CrossRefPubMedGoogle Scholar
 37.Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389PubMedCentralCrossRefPubMedGoogle Scholar
 38.Binkowski TA, Adamian L, Liang J: Inferring functional relationship of proteins from local sequence and spatial surface patterns. J Mol Biol 2003, 332: 505–526. 10.1016/S00222836(03)008829CrossRefPubMedGoogle Scholar
Copyright information
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.