MotifCluster: an interactive online tool for clustering and visualizing sequences using shared motifs
- 8.9k Downloads
MotifCluster finds related motifs in a set of sequences, and clusters the sequences into families using the motifs they contain. MotifCluster, at http://bmf.colorado.edu/motifcluster, lets users test whether proteins are related, cluster sequences by shared conserved motifs, and visualize motifs mapped onto trees, sequences and three-dimensional structures. We demonstrate MotifCluster's accuracy using gold-standard protein superfamilies; using recommended settings, families were assigned to the correct superfamilies with 0.17% false positive and no false negative assignments.
KeywordsGibbs Sampler Pairwise Sequence Identity Domain Shuffling Shared Motif Single Connected Component
cytochrome maturation protein
Longest common substring
Multiple EM for motif elicitation
Multiple sequence comparison by log-expectation
Protein Data Bank
Unweighted pair group method with arithmetic mean.
Detection of evolutionary relationships between very distantly related protein families is important for efforts to assign functions to newly identified proteins, as well as to understand the evolutionary mechanisms by which new functions have emerged. Pairwise sequence identities between proteins in distantly related families are often statistically insignificant. Algorithms such as COMPASS  that evaluate relationships between profiles representative of protein families are perhaps the most powerful method for identification of distant sequence relationships, although iterative BLAST approaches, such as PSI-BLAST  and SHOTGUN , are also valuable. Identification of evolutionary relationships between protein families and superfamilies sets the stage for analysis of the sequence changes that led to the distinctive structural and functional characteristics of protein families. In many enzyme superfamilies, the ability to catalyze an ancestral catalytic step has been retained, while additional steps have been added before or after the ancestral step. For example, in the enolase superfamily, abstraction of a proton from a position alpha to a carbonyl is the conserved catalytic step; the fate of the resulting enolate intermediate varies in different families according to the disposition of catalytic groups in the active site .
Identification of short, highly conserved sequences, known as motifs, in proteins provides important insights into the regions of proteins that have been conserved within a superfamily or suprafamily, as well as those that have diverged in specific families. Consideration of these motifs in conjunction with mechanistic and structural information can provide a picture of the sequence changes that led to acquisition of new catalytic capabilities. Two motif-finding algorithms, MEME (Multiple EM for motif elicitation)  and the Gibbs Sampler , are in widespread use. MEME identifies motifs by searching for a set of short, conserved sequences (motifs) in a set of longer, less conserved sequences. MEME assumes that each of the sequences in the input set contains at least one motif. The Gibbs Sampler works by searching for a predefined number of motifs with minimum and maximum lengths. A background probability model for chance matches based on amino acid occurrences is determined from the input set of sequences. Motifs are discovered by searching for regions in the set of sequences that do not fit this background probability model. Both algorithms report motifs present in subsets of a user-provided set of sequences, along with statistical information regarding the significance of each motif in the entire set as well as within a particular sequence.
A drawback of these algorithms is that the results depend on the order of sequences provided in the input set in an unpredictable way. Clustering sets of sequences based upon visual analysis of motifs, as in , is both subjective and time-consuming, as it requires re-ordering of the input set. Furthermore, motifs are presented solely in terms of primary sequence; mapping of motifs onto structures, which is critical for recognizing the roles played by specific motifs, requires additional manipulation.
In this paper, we present a new online tool, MotifCluster, that clusters input sequences according to the presence or absence of user-supplied motifs. MotifCluster uses any of six different distance metrics. Some of these metrics group sequences that contain the same motifs in the same order, and others look solely at which motifs are shared. The ability to take order into account is critical in some cases, because motifs may need to be in the context of a specific structural context to have biological activity. However, domain shuffling and circular permutation of sequences are not uncommon, so it can also be important to recognize the occurrence of shared motifs in an unusual order. Longer or more highly significant motifs can be given more weight than shorter or less significant motifs, or all motifs can be treated equally. Sequences can also be labeled with user-defined designations, such as family assignment, and the associations between families and motifs can then be used to explore functional relationships. In addition to clustering input sequences according to the motifs they contain, MotifCluster automatically maps motifs onto the structures of all proteins in the set for which structural information is available, providing an immediate visual assessment of the location of each motif. MotifCluster can be used online from the URL provided in the abstract, which also links to documentation and a downloadable version.
Summary of key features of MotifCluster and a selection of other programs that perform clustering of motifs or remote homology detection
Overview of program
Clustering proteins by motifs they contain
Takes aligned or unaligned protein and nucleotide sequences and a MEME file showing motifs; allows clustering of the sequences according to the motifs they contain, and visualization of the motifs on the aligned and unaligned sequences and three-dimensional structures
Clustering of transcription factor binding sites (in DNA)
Takes list of transcription factor binding sites as input: uses hidden Markov models to find cis-regulatory modules in DNA
Takes list of transcription factor binding sites as input: uses Forward algorithm and expected uniform distribution to find motif co-occurrence in DNA
Takes list of transcription factor binding sites as input: uses r-scan algorithm and sweep over parameter values to visualize significant clusters as peaks on the DNA sequence
Calculates significance of collection of position-specific score matrices that appear in order: can apply to DNA or protein, in principle
Calculates significance of collection of transcription factor binding sites that appear at specified distance from transcription start site or other feature in the DNA
Aligns all pairs of motifs that appear significant in different promoters, then groups these into clusters using the CAST algorithm. DNA-specific
Identifies groups of DNA motifs that co-occur significantly within a defined distance using both order-dependent and order-independent models
Uses Bayesian method to find clusters of evolutionarily conserved DNA motifs that appear in different promoters.
Clusters genes based on microarray analysis: feeds promoters to Gibbs sampler to find DNA motifs overrepresented in each cluster
Identifying kernels for SVMs*
Introduces kernels based on k-word occurrences and best BLAST hit for SVM clustering: does not focus on conserved motifs
WCM (word correlation matrices)
Introduces k-word kernel for SVM clustering based on correlations in appearance of pairs of k-words: does not focus on conserved motifs.
ODH (oligomer distance histograms)
Introduces new kernel for SVM clustering based on histograms of distances between all words in protein: does not focus on conserved motifs
BLAST-based approach for identifying remote homologs by iterative searches: not motif-based
Among other features, can perform BLAST and PSI-BLAST versions of Shotgun and choose representative sequences of each group: not motif-based
Performs iterative steps of PSI-BLAST, otherwise like Shotgun: not motif-based.
Performs graph-based connection of proteins based on pairwise sequence similarity: not motif based
Clusters proteins based on shared segments of overall sequence, not by motifs already known to be significant
Performs profile-profile alignments for remote homology detection: assesses statistical significance matches in the profiles overall, rather than specifically using shared motifs
Clustering of motifs
Aligns motifs with one another so that relationships among motifs can be detected; performs many other tasks for promoter characterization, but specific to promoters
Performs many functions for cis-regulatory analysis: is able to cluster DNA motifs with one another
Aligns and clusters DNA motifs with one another to improve transcription factor binding site searches
Identification of functions in labeled structures
Takes set of three-dimensional structures with annotated functions; identifies three-dimensional motif fragments that are common to the structures with each function.
Several reports are generated after uploading a set of aligned or unaligned sequences, motif information, and, optionally, mappings that relate sequence identifiers (IDs) to known gene families. In these reports, each motif is assigned a unique style (color and font display, for example, bold or italic) that is used consistently throughout the displays. Each report format can be selected from a drop down list on the search results page.
Displaying motifs on trees
The first four report formats display motifs, either including or excluding the rest of the sequence, on a tree based either on the sequences or the motifs. These reports are important for establishing whether a particular motif fits the overall phylogenetic pattern, or has evolved convergently in different lineages, and can be especially useful for establishing relationships between sequences that are too diverse for construction of phylogenetic trees using standard methods. They are also important for visualizing where the motifs occur in the sequences, which can be important for detecting domain shuffling. These four report formats are described below.
Motifs on motif-based tree
This tree is built using a matrix of distances calculated from the motif-based alignment. The metric used to build this tree reflects similarities only among the motifs (and not in the rest of the sequences). The color-coded motifs and location information are displayed, along with links to available Protein Data Bank (PDB) structures. The PDB links allow the user to view the motifs found in a particular sequence on the corresponding structure using PyMol . This format is especially useful for visually evaluating whether the clustering method chosen groups the motifs together in an intuitively reasonable way, and for checking whether motifs are shuffled or circularly permuted.
Sequences on motif-based tree
Similar to the 'Motifs on motif-based tree' format above, except that the full sequences are shown (rather than just the sequences of the motifs). This format is useful for deciding whether there are extended regions of conservation around the motifs in specific groups of sequences, and like the format above, for identifying domain shuffling or circular permutation of motifs.
Motifs on sequence-based tree (full-length)
Similar to the 'Sequences on motif-based tree' format above, except that a phylogenetic tree is generated using MUSCLE (Multiple sequence comparison by log-expectation; with default parameters) based on similarities among the full-length sequences. Motifs are highlighted and displayed as for the 'Motifs on motif-based tree' format. This display is useful for testing whether the pattern of motif conservation follows the overall conservation of the sequences: if motifs have not evolved convergently and the sequences are sufficiently closely related to retain phylogenetic signal, the tree built using the motifs will be approximately the same as the tree built using the sequences, and, in both cases, large clades of sequences containing the motifs should be observed. Alternatively, if the sequences are so highly diverged that phylogenetic reconstruction is unreliable (the so-called 'twilight zone' below 30% conservation ), the motif tree may cluster family members together when the sequence tree is uninformative.
Motifs on sequence-based tree (motif regions only)
Similar to 'Sequences on motif-based tree' and 'Motifs on sequence-based tree (full-length)' formats above: displays only the sequences of the motifs on the tree built using the full-length sequences.
Identifying which sequences contain particular motifs
The next group of reports shows which sequences contain each motif. These reports are useful for evaluating which motifs are meaningful, and which tend to occur together in the same sequences.
Statistics by sequence
Displays a table of motifs and associated P-values, grouped by sequence. For each sequence, motifs are displayed in order of decreasing significance (ascending P-value). A display underneath the sequence indicates conservation: positions annotated with asterisks match the motif consensus at positions that are not highly conserved (according to a user-defined threshold, which is set to 90% by default); positions annotated with gray shaded + sighs match the consensus at positions that are highly conserved, and red positions are mismatches at positions that are highly conserved. This format is especially useful for finding systematic differences that may be functionally important within motifs. For example, single amino acid changes in a motif conserved in a superfamily may be related to divergence in function in a particular family.
Statistics by motif
Displays a table of motifs and associated P-values, grouped by motif. For each motif, an alignment of the motif regions is displayed, with the majority consensus of the motif displayed above the alignment. Highly conserved columns (determined by the 'conservation threshold' parameter) in the consensus motif are colored. Positions within individual motifs are highlighted in grey if they match the consensus sequence. Like the 'Statistics by sequence' format, this format is useful for finding sequence changes that are potentially associated with functional changes.
The final group of reports provides tools for exploratory analysis.
Displays an interactive form allowing the user to select specific motifs to highlight in the alignment. This format is useful for assisting in decisions about which motifs are likely to be real and which are false positives, and reduces the visual complexity in the motif- and sequence-based tree formats by allowing the user to focus on specific motifs of interest.
Displays all sequences in a network representation, with each connected component drawn as a separate network. The connected components are determined by the 'edge threshold' parameter. For example, if this parameter is 1, each connected component consists of all the sequences that share at least one motif with any other sequence in the connected component. If the parameter is 2, all sequences in a connected component must share at least two motifs with at least one other sequence in the same connected component. The list of IDs for the sequences in each connected component is displayed, and the actual sequences in each connected component can be downloaded as a FASTA file.
Supported motif formats
We currently support motifs generated using either MEME  or the Gibbs sampler  so that users can easily compare the two methods. We plan to add support for other motif definitions, including user-supplied weight matrices, and for other motif finding algorithms.
Distance calculations and clustering
Measuring distance between sequences using motifs
Common fraction score
The Common fraction score method (Figure 1a) calculates the fraction of motifs shared between each pair of sequences, ignoring the order in which the motifs occur in the sequence and the number of times each motif occurs.
Longest common substring score
The Longest common substring (LCS) score method (Figure 1b) finds the longest common substring of motifs that occurs in both sequences. Instead of using the actual motif sequences, the substring is constructed by assigning a unique character to each motif, and then using suffix trees to calculate the longest pattern of motifs that occurs in the same order in both sequences. This method does not account for differences in spacing between the motifs. The LCS score is a measure of similarity, which is converted into a distance metric in two different ways. LCS (max-actual) scores the distance as the difference between the best LCS score for any pair of sequences in the set and the LCS score for the pair of sequences under consideration. LCS (1-(actual/max)) scores the distance based on the ratio between the LCS score for the pair of sequences under consideration and the best LCS score for any pair of sequences in the set.
The Needleman-Wunsch (NW) score method (Figure 1c) uses the NW global pairwise alignment algorithm  to align the two motif strings, converted from the raw sequences to unique characters as described for LCS scores above. The method can either be unweighted (all motifs are treated equally), or weighted (highly significant motifs count for more than less significant motifs). Like the LCS score, the raw NW score is a measure of similarity. It is converted into a distance metric using the same methodology (either (max-actual) or (1-(actual/max))). Like the LCS score, the NW score takes into account the order, but not the spacing, between the motifs. Unlike the LCS score, the NW score is robust to insertions and deletions that disrupt what is otherwise a long, shared sequence of motifs.
The delta-delta score (Figure 1d) measures the distances between aligned motifs in each pair of sequences, and sums the differences in distances between each pair of motifs in the aligned pair of sequences. Aligned motifs with equal spacing in both sequences have a delta-delta score of zero. When motif spacing is unequal, the delta-delta score is > 0. The delta-delta score is thus a distance metric, and does not need to be converted from a similarity metric as do the LCS and NW scores.
The UPGMA (Unweighted pair group method with arithmetic mean) clustering algorithm  uses a distance matrix to find successive nested clusters by identifying the nearest neighbors at each step, then merging these neighbors. When performing motif-based clustering, we generate the distance matrix using one of the user-specified distance measures, and use this distance matrix as input into the UPGMA routine, yielding a tree that clusters the sequences according to the motifs they contain. To compare the motif-based clustering with traditional sequence-based clustering, we also generate trees using MUSCLE . MUSCLE groups the sequences using the fraction of words of a specific length that are shared between the sequences, and thus estimates the overall distance between the entire sequences rather than just between the motifs.
Graphs and connected components
We generate a weighted graph showing the relationships between all sequences in terms of the motifs they share. Vertices in the graph represent a sequence in the input set, and each edge in the graph represents a relationship in which two sequences share one or more motifs. Thus, each sequence is connected to every other sequence with which it shares at least one common motif. The weight of each edge in the graph is calculated as the number of motifs shared by each pair of sequences. Once the full weighted graph has been generated, edges whose weight is less than the 'edge threshold' are removed from the graph. For example, if the edge threshold is 2 (the default), the connection is broken between any two sequences that share only a single motif.
The display shows a thumbnail of each connected component, suppressing figures for connected components that consist of only one sequence. These thumbnails can be expanded into larger figures, including EPS output for printing or publication. The graphs are visualized using the random layout option in NetworkX , which we found to be both the fastest and most readable option for the highly connected graphs produced by MotifCluster.
Most code described here was written in Python 2.4 and tested on MacOSX and Linux. The exceptions are the NW algorithm , which we implemented in C for performance reasons, the MUSCLE , MEME  and Gibbs Sampler  programs, and the libstree suffix tree library , for which we used the published implementations. The web interface uses Apache and mod_python (Apache Software Foundation). Motif clustering jobs are submitted to our Beowulf cluster using PBS/TORQUE. NetworkX  is used in the graph calculations. PyMol  is used for visualization of protein structures. Calculation and display code have been contributed to the PyCogent project . A standalone version of the program is available for download at the MotifCluster web site.
We describe the capabilities of MotifCluster using four cases as examples. First, we use the case of the two convergently evolved families of ribose 5-phosphate isomerases to show that structurally distinct proteins that have the same function can be correctly clustered. Second, we use a set of curated superfamilies that contains 4,887 sequences in 91 families divided among 5 superfamilies . These 'gold-standard' families and superfamilies allow us to test how well we can recapture known relationships within and between superfamilies. Third, we use two families within the haloacid dehalogenase superfamily  to illustrate the utility of the clustering and mapping features of MotifCluster. Finally, we show how clustering of motifs in a set of proteins from the highly divergent thioredoxin-fold suprafamily  captures evolutionary relationships between proteins of different functions when standard phylogenetic analyses fail.
MotifCluster distinguishes between unrelated families when the edge threshold is 2 or greater: RpiA/RpiB as a case study
Ribose 5-phosphate isomerases catalyze the interconversion of ribulose 5-phosphate and ribose 5-phosphate. Two structurally distinct families of ribose 5-phosphate isomerases have been identified, exemplified by RpiA from Escherichia coli and RpiB from E. coli [18, 19]. This is one of many cases of convergent evolution of the same catalytic activity in the context of different structural folds. It is unlikely that the same motifs would evolve in different structural contexts. Thus, a potential application of MotifCluster is the identification of unrelated families in sets of proteins that have a common function. In such cases, clustering of sequences into two or more families does not constitute evidence of convergent evolution in the absence of structural information, but it raises a possibility that can be further investigated.
Analysis of the gold-standard superfamilies shows that the sensitivity and specificity of MotifCluster are excellent
We tested MotifCluster on the gold-standard set of mechanistically diverse superfamilies described by Brown et al. , which contains 4,887 sequences belonging to 91 families, distributed among 5 superfamilies. This set of sequences has been carefully curated to provide a reliably clustered set for testing computational algorithms. Every sequence assigned to a gold-standard family has either an experimentally determined function, or is closely related to a protein of known function (BLAST e-value ≤ 10-175). We tested MotifCluster using different edge threshold settings (that is, the number of motifs required for a shared connection), and using the Gibbs sampler and MEME to find the underlying motifs, in order to test how well it was able to cluster family members within the same superfamily. Specifically, we expect sequences from the same superfamily to be connected by multiple motifs, but we do not expect members of different superfamilies to be bridged in this manner.
Figure 3b shows a graph of a single connected component that contains sequences from two different families belonging to the amidohydrolase superfamily. Haloacid dehalogenase (blue) and β-phosphoglucomutase (red) are divergent members of the haloacid dehalogenase superfamily (Figure 3c). The maximum pairwise identity between members of the two families is 45.5%. When an edge threshold of 3 or less is used, all of the sequences are grouped into a single connected component.
The false negative rate in this analysis was essentially zero (data not shown). No false negatives were found at all except when an edge threshold of three was used for motifs found by MEME. In that case, the algorithm failed to find a link between the deoxy-D-mannose-octulosonate 8-phosphate phosphatase and P-type ATPase families in the haloacid dehalogenase superfamily. Thus, the presence of at least two shared MEME motifs is a strong indicator of shared superfamily membership, whereas absence of at least two shared MEME motifs is a strong indicator of lack of shared superfamily membership. However, failure to assign all sequences to a single cluster (for members of the same superfamily) or to two distinct clusters (for members of different superfamilies) is frequent (Figure 4b). The average fraction of unassigned sequences ranged from 5.2% (Gibbs sampler, edge threshold of 1, two superfamilies) to 30.4% (MEME, edge threshold of 3, one superfamily), showing that not all family members share motifs (at least, as defined by MEME or the Gibbs sampler), even when homology exists at the primary sequence level. Thus, the presence or absence of shared motifs at the whole family level is informative, but the fact that individual sequences lack motifs shared by the other sequences in the set does not indicate that they are not homologous. The error rates were robust to variation in the degree of divergence between the sequences (average pairwise identities between the families ranged from 38.4-57.9%), the number of sequences in each family (which ranged from 5-366), and differences between the sample size in the two families (the fraction of sequences represented by one of the two families ranged from 0.014-1). No significant correlations were observed between these variables and false positive rate, false discovery rate, false negative rate, sensitivity, or specificity (data not shown).
MotifCluster facilitates identification of conserved and variable residues in active sites of mechanistically divergent families
The various clustering methods available in MotifCluster facilitate analysis of extremely distantly related families
The Trx-fold suprafamily encompasses an extremely divergent set of proteins with a wide range of functions. All members of the suprafamily share the canonical Trx-fold structure, but the ancestral function (reduction of disulfide bonds using a pair of active site cysteines) has been modified in some superfamilies. For example, in the peroxiredoxin family, a cysteine corresponding to the more buried cysteine in Trxs is involved in reduction of peroxides, but the other cysteine has been changed to a threonine . In the glutathione transferase superfamily, both cysteines have been lost, and these enzymes catalyze a completely different reaction: nucleophilic attack of glutathione upon an electrophilic substrate to form a glutathione conjugate. Analysis of sequence relationships among such highly divergent proteins is difficult because the overall pairwise sequence identities fall within the twilight zone. In such cases, identification of shared motifs in proteins that share a common structural fold can provide good evidence for a very distant evolutionary relationship.
An analysis of the relationship between thioredoxins (Trxs) and peroxiredoxin (Prxs) was reported in 2004: no significant pairwise identity could be demonstrated between sequences in these two families, but the Shotgun algorithm  identified a family of proteins, the cytochrome maturation proteins (CMPs), that bridges the Trxs and Prxs . Motifs found in subsets of these three families were identified by MEME. The results were clustered manually, a time-consuming and qualitative process. MotifCluster performs a comparable analysis in a few minutes. Here we demonstrate that the use of different distance metrics produces different clustering results. Notably, clustering using motifs rather than whole sequences produces biologically meaningful results even when standard phylogenetic clustering methods fail due to the extremely divergent set of proteins in the analysis.
Implications for motif analyses
Motif identification informs functional, mechanistic and evolutionary analyses in several ways. First, the patterns of motifs observed in subsets of the input set can be used to cluster the proteins into families, a useful tool for prediction of function for unannotated proteins. Second, motifs indicate regions of proteins that have been conserved for reasons of structure and/or function. Changes in a region of a protein family, either resulting in a different motif or in subtle, family-specific changes within a motif, suggest the changes that have led to emergence of new functions in an ancestral scaffold.
MotifCluster takes input from motif-finding algorithms such as MEME or the Gibbs sampler, and the results are therefore dependent upon the choice of the input set because the characteristics of the input set have a strong effect upon the motifs that are found. A crucial limitation of existing techniques is that motif-finding algorithms typically assume that each sequence is drawn independently from a background distribution, and thus can be biased by the presence of closely related sequences. We previously addressed this problem with our software DivergentSet , which allows the user to rapidly select an unbiased sample of divergent sequences from the starting population. In several sequence families, our analysis of different divergent sets drawn repeatedly from each family showed that the motifs found by MEME were highly dependent on the particular set chosen: the number of motifs, and the lengths and locations of the motifs, varied from run to run . Although highly conserved motifs that are critical for function are reliably recovered, the robustness of any motif analysis is enhanced when multiple analyses can be carried out using divergent sets generated randomly from a larger set of homologous sequences.
Analysis of the importance of sequence motifs is greatly enhanced when motifs can be mapped onto the structure of representative proteins. Such mapping allows visualization of motifs that are found in the core of the protein and may be responsible for maintaining the overall structural fold, motifs that are on the surface and may be involved in interactions with ligands or other proteins, and motifs that are found in clefts or crevices that may harbor active sites. The automation of the clustering of sequences and the mapping of motifs onto structures substantially reduces the time required to carry out such analyses, and will provide many new insights into the relationships among highly divergent protein families.
This work was funded by a grant from the Air Force Office of Scientific Research (FA9550-06-0014) to SC and RK. MH was supported by the National Institutes of Health/University of Colorado at Boulder Molecular Biophysics Training Program (T32GM065103), NSF EAPSI fellowship OISE0812861 (MH), and by a gift from the Butcher Foundation.
- 9.PyMOL Home Page. [http://pymol.sourceforge.net/]
- 11.Sokal RR, Sneath PHA: Numerical Taxonomy: the Principles and Practice of Numerical Classification. 1973, San Franscisco: WH Freeman & CoGoogle Scholar
- 13.NetworkX. [https://networkx.lanl.gov/]
- 14.libstree - A generic suffix tree library. [http://www.icir.org/christian/libstree/]
- 15.Knight R, Maxwell P, Birmingham A, Carnes J, Caporaso JG, Easton BC, Eaton M, Hamady M, Lindsay H, Liu Z, Lozupone C, McDonald D, Robeson M, Sammut R, Smit S, Wakefield MJ, Widmann J, Wikman S, Wilson S, Ying H, Huttley GA: PyCogent: a toolkit for making sense from sequence. Genome Biol. 2007, 8: R171-10.1186/gb-2007-8-8-r171.PubMedPubMedCentralCrossRefGoogle Scholar
- 19.Zhang RG, Andersson CE, Skarina T, Evdokimova E, Edwards AM, Joachimiak A, Savchenko A, Mowbray SL: The 2.2 A resolution structure of RpiB/AlsB from Escherichia coli illustrates a new approach to the ribose-5-phosphate isomerase reaction. J Mol Biol. 2003, 332: 1083-1094. 10.1016/j.jmb.2003.08.009.PubMedPubMedCentralCrossRefGoogle Scholar
- 39.Ausiello G, Gherardini PF, Marcatili P, Tramontano A, Via A, Helmer-Citterich M: FunClust: a web server for the identification of structural motifs in a set of non-homologous protein structures. BMC Bioinformatics. 2008, 9 (Suppl 2): S2-10.1186/1471-2105-9-S2-S2.PubMedPubMedCentralCrossRefGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.