UFO: a web server for ultra-fast functional profiling of whole genome protein sequences
- 4.7k Downloads
Functional profiling is a key technique to characterize and compare the functional potential of entire genomes. The estimation of profiles according to an assignment of sequences to functional categories is a computationally expensive task because it requires the comparison of all protein sequences from a genome with a usually large database of annotated sequences or sequence families.
Based on machine learning techniques for Pfam domain detection, the UFO web server for ultra-fast functional profiling allows researchers to process large protein sequence collections instantaneously. Besides the frequencies of Pfam and GO categories, the user also obtains the sequence specific assignments to Pfam domain families. In addition, a comparison with existing genomes provides dissimilarity scores with respect to 821 reference proteomes. Considering the underlying UFO domain detection, the results on 206 test genomes indicate a high sensitivity of the approach. In comparison with current state-of-the-art HMMs, the runtime measurements show a considerable speed up in the range of four orders of magnitude. For an average size prokaryotic genome, the computation of a functional profile together with its comparison typically requires about 10 seconds of processing time.
For the first time the UFO web server makes it possible to get a quick overview on the functional inventory of newly sequenced organisms. The genome scale comparison with a large number of precomputed profiles allows a first guess about functionally related organisms. The service is freely available and does not require user registration or specification of a valid email address.
KeywordsDomain Family Profile Divergence Pfam Domain Functional Profile Adjusted Rand Index
The assignment of genes to certain functional categories is a central task in genome annotation. The distribution of assignments, i.e. the functional profile, provides a highly informative summary of a genome. Functional profiling plays a key role in comparative genomics for studying aspects of systems biology on a genome wide scale . Without the restriction of DNA sequencing to culturable organisms, metagenomics allows to study the genomic potential of whole microbial communities. Functional profiling of metagenomes is an essential tool for comparative analysis of microbial ecosystems . In the context of functional genomics, gene clusters and protein domains are widely used for homology-based annotation. Both approaches cover different aspects of the annotation and are often used in parallel to obtain a comprehensive description. While gene clusters as used for COGs  or within the SEED framework  provide a valuable resource for functional annotation based on the identification of homologous genes, the domain based approach is focussed on modelling and detection of functional modules which usually involve only parts of a gene. At the level of functional modules, the Pfam domain family database  currently provides the highest coverage. State-of-the-art methods for protein domain detection, like HMMER , are computationally expensive and several approximation techniques have been suggested to accelerate the model based prediction of protein domains. With a slight loss of sensitivity fast prefiltering methods can achieve speed ups of about two orders of magnitude as compared with HMMER . Computational speed is of particular importance for the design of web-based sequence analysis tools. Due to computational expense most web servers for protein domain search only provide a single-sequence submission interface [8, 9]. In addition to single sequence submission, the Pfam web server  also offers a batch option which allows the user to submit small multiple fasta files. These files are restricted to a maximum of 1000 protein sequences with a maximum sequence length of 2000 residues.
Using machine learning techniques for feature-based protein sequence classification [10, 11, 12], the UFO web server for u ltra-fast f unctional pro filing provides an instantaneous estimation of Pfam profiles, i.e. frequencies of Pfam domains, for large sets of protein sequences. With a speed up of four orders of magnitude, UFO is well prepared to cope with the rapidly growing amount of genomic and metagenomic sequence data.
Construction and content
The UFO web server has been built around an efficient implementation of machine learning techniques for protein sequence classification which have been described in [10, 11, 12]. Fast feature-based techniques for protein sequence representation have been combined with a multi-class multi-label approach  to assign protein sequences to Pfam domain families. While our previous model was obtained from training with about 1.5 × 105 sequences from the Pfam A release 22 seed alignments, UFO is based on training with the complete Pfam A release 23 full alignments which comprise more than 6 × 106 domain sequences. As an important difference, our previous publication  only considers a prefiltering method that uses the family specific scores from feature space discriminants to produce a ranking of domain models which in turn can be used to reduce the set of HMMER models in subsequent searches. UFO also uses a high-dimensional word-based feature space  according to a word length of 20 amino acids, but in addition the discriminant scores of the five highest scoring domain families are passed to a small neural network to decide whether a score actually indicates a valid match. The neural network architecture and its training has been described in  for the case of metagenomic gene prediction. UFO uses a network with five hidden units and with three inputs which correspond to the particular discriminant score and the mean and maximum score over all models. The output corresponds to an estimated posterior probability of a true match. Currently, domain families with a probability above 0.5 are reported as valid matches. In comparison with profile hidden Markov models , the feature-based machine learning approach does not provide a localization of protein domains but merely an indication of the presence or absence of a certain domain within a protein sequence. This implies that also the order or the repetition of domains cannot be predicted by the utilized approach. However, for the purpose of functional profiling this kind of "pure" domain detection usually does not mean a limitation. Actually, it has been shown that the prediction of protein function can be realized fairly well without considering domain repetitions or the ordering of domains . For reasons of speed, another restriction as compared with Pfam/HMMER arises from the maximum number of domain families which can be detected within a single sequence. Currently, a protein sequence can be assigned to at most five different families. Only in rare cases we observed that this number was exceeded in the existing annotations.
The probabilities are estimated from the corresponding domain frequencies using a pseudocount parameter c. A suitable value of c was determined by hierarchical cluster analysis based on the above divergence measure. For that purpose, a complete linkage clustering was applied to a collection of 1017 prokaryotic profiles from 21 different phyla. To cope with the typical database bias towards particular culturable organisms , from all profiles that correspond to the same genus only the medoid profile, which by definition yields the minimal sum of divergences to the members of that genus, was selected for clustering. For a varying pseudocount parameter with 101 logarithmically spaced values in the interval [10-8, 102] and each partition in the range between 10 and 50 clusters the agreement of the clustering with the given taxonomic groups on phylum level was measured by the adjusted Rand index . The best agreement was obtained for a pseudo count c = 0.01 with 22 clusters which resulted in a maximal adjusted Rand index of 5.17. For that partition the maximal within cluster divergence was d c = 3.53. This value is actually used by the UFO server to scale the profile divergence by D(P, Q)/d c to a more meaningful range, where values clearly below 1 usually correspond to phylogenetically and functionally related organisms.
Utility and discussion
In the "Downloads" section of the main results page, several output files are available in plain ascii format. In addition a Perl script "ufo2hmmer" for postprocessing of the UFO assignments by means of selected HMMER/Pfam searches can be obtained. This script requires local HMMER and Perl installations and can be used to further increase the specificity of the domain detection. In addition, postprocessing of UFO matches with HMMER provides additional information about sequence positions, repetitions, and the order of the domains.
The first output file contains the complete Pfam profile in terms of domain specific detection counts sorted in descending order. The second file contains the corresponding GO profile which shows the assignment frequencies with respect to Gene Ontology categories. The GO counts result from applying the Pfam2GO mapping to the Pfam profile and again the frequencies are shown in descending order. The third file contains the sequence specific assignments to Pfam domain families together with the corresponding GO annotation and the match probability score of the neural network. This file may also be used for further processing, e.g. for Pfam/HMMER search of the UFO detected domains using the provided "ufo2hmmer" script.
Runtime comparison for five small genomes between HMMER, RPS-BLAST, UFO and batch search at the Pfam web site (March 2009).
19 h 37 m
4 m 11 s
9 h 28 m
17 h 58 m
3 m 13 s
9 h 19 m
13 h 54 m
3 m 46 s
7 h 38 m
14 h 36 m
3 m 42 s
8 h 03 m
16 h 49 m
3 m 40 s
8 h 42 m
Example application and discussion
For an example application the proteome of a novel Clostridium thermocellum strain (DSM 4150, Integr8 ID: 32332) from the above collection of 206 test genomes was used to demonstrate the servers capabilities. Specifying the multiple fasta file of protein sequences on the UFO job submission page (see Figure 1) and pressing the "Start UFO" button initiates uploading and subsequent analysis of that file. After the processing of all 2916 sequences which takes about 7 seconds, the results page is generated and displayed. The results are based on 3021 assignments to Pfam domains which have been found in a total number of 2087 sequences. This implies that no domains have been found for 829 sequences. Besides the "assignments" page (see Figure 3) and the output files (hyperlinks) which allow a more detailed analysis of the profile properties, the "top ten" lists provide a brief summary of the most prevalent features. In the example (see Figure 2), the most abundant Pfam family is the "Dockerin type I repeat" PF00404 which has been found in 67 sequences. Clicking the identifier shows the corresponding Pfam description of the family which indicates a key role of that domain in cellulose metabolization. Among the top 10 Pfams, also the "Cellulose binding domain" PF00942 found in 18 sequences indicates the importance of cellulose metabolism. The first entry of the nearest species list (see Figure 4) corresponds to another strain of C. thermocellum (ATCC 27405/DSM 1237) with a slightly bigger proteome set including 3102 proteins. According to the corresponding HAMAP description (hyperlink), C. thermocellum is a gram-positive, anaerobic, and thermophilic organism capable of cellulose degradation. The remaining species in the list also belong to the class of Clostridia, most of them are thermophilic. The closest five species are all able to ferment organic substrates.
As indicated by the application example above, the strength of UFO is its capability to produce a quick overview on the functional inventory of whole genomes in terms of the most abundant protein domains and in terms of the closest organisms with the most similar profiles. In comparison to full annotation servers like RAST  the UFO server only covers a particular aspect of genome annotation. It merely provides a first step of a functional analysis which can nevertheless be of great utility for addressing many biological questions and problems. It is not restricted to the analysis of prokaryotic genomes, and it can be applied to eukaryotes as well, if gene predictions are available. The runtimes for complete eukaryotic proteomes are usually above the average runtime for prokaryotes. For example, the proteome set of D. melanogaster (15410 sequences) takes 64 seconds of processing time, the C. elegans proteome (22984 sequences) requires 80 seconds. In addition, UFO can also be used to annotate large collections of (translated) expressed sequence tags. For prokaryotes the UFO domain detection can be used as a basis for the prediction of operons or regulons. Furthermore, the server supports researchers in the identification of functionally related species that can be used for annotation. For the analysis of microbial communities, gene prediction tools specialized on short anonymous DNA fragments [13, 23, 24] or a simple six-frame translation can be used to apply UFO in functional metagenomics. In comparison to the more comprehensive MG-RAST server , UFO provides an easy-to-use interface with immediate response. For example, UFO profiling of the first of ten depth-specific data sets from the hypersaline microbial mat metagenome , which contains 12218 sequencing reads with an average length of 700 bp, requires 75 seconds for processing of the six-frame translated reads. The processing of all ten data sets takes about 15 minutes. Inspection of the top ten Pfams shows a remarkable count for sulfatase (PF00884) assignments in lower layers with a maximum of 135 assignments in the fifth layer (4-5 mm depth), which is in accordance with the results in . In general, the profile divergence with respect to the reference genomes is of limited use for metagenome analysis because a metagenomic profile actually corresponds to a mixture of several different species. However, if the habitat is dominated by a few closely related species, the UFO "top 10 nearest species" list may nevertheless be informative. In case of the hypersaline microbial mat the UFO results indicate a dominant role of Cyanobacteria in the two upmost layers (0-1 mm and 1-2 mm) with 9 and 6 out of 10 nearest species, respectively. This observation is in agreement with the analysis in  which indicates that Cyanobacteria together with Alphaproteobacteria are the most abundant phyla in these layers. Especially in metagenomics, the GO profile may facilitate the analysis because Pfam assignments are accumulated in categories, which directly relate to the biological questions. For example, considering the frequencies of the "chemotaxis" term (GO:0006935) for the hypersaline microbial mat data, the maximum count (55 assignments) is found in the third layer (2-3 mm) at the oxic-anoxic boundary, which agrees well with the original study .
With a considerable speed up of protein domain detection, UFO shows a new perspective in web-based large scale analysis of protein sequence data. As a consequence of its speed it can be used for instantaneous profiling of genome scale protein sequence files. The processing time roughly corresponds to the duration of a single sequence analysis as provided by current protein database servers. On the scale of prokaryotic genomes, UFO can process thousands of whole genome protein sets a day. In that way UFO is well prepared for next generation sequencing technologies like single cell sequencing , which allows to extract whole genomes from highly diverse metagenomic samples.
Availability and requirements
The UFO web service is freely available at http://ufo.gobics.de.
I would like to thank Thomas Lingner for fruitful discussions and lots of technical support, Rasmus Steinkamp for web server and cluster support, Katharina Hoff for proof reading and Fabian Schreiber for Perl scripting.
- 2.Dinsdale EA, Edwards RA, Hall D, Angly F, Breitbart M, Brulc JM, Furlan M, Desnues C, Haynes M, Li L, McDaniel L, Moran MA, Nelson KE, Nilsson C, Olson R, Paul J, Brito BR, Ruan Y, Swan BK, Stevens R, Valentine DL, Thurber RV, Wegley L, White BA, Rohwer F: Functional metagenomic profiling of nine biomes. Nature. 2008, 452: 629-632. 10.1038/nature06810.CrossRefPubMedGoogle Scholar
- 3.Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, Kiryutin B, Galperin MY, Fedorova ND, Koonin EV: The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res. 2001, 29: 22-28. 10.1093/nar/29.1.22.PubMedCentralCrossRefPubMedGoogle Scholar
- 4.Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang HY, Cohoon M, de Crecy-Lagard V, Diaz N, Disz T, Edwards R, Fonstein M, Frank ED, Gerdes S, Glass EM, Goesmann A, Hanson A, Iwata-Reuyl D, Jensen R, Jamshidi N, Krause L, Kubal M, Larsen N, Linke B, McHardy AC, Meyer F, Neuweger H, Olsen G, Olson R, Osterman A, Portnoy V, Pusch GD, Rodionov DA, Ruckert C, Steiner J, Stevens R, Thiele I, Vassieva O, Ye Y, Zagnitko O, Vonstein V: The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res. 2005, 33 (17): 5691-702. 10.1093/nar/gki866.PubMedCentralCrossRefPubMedGoogle Scholar
- 7.Portugaly E, Ninio M: HMMERHEAD - Accelerating HMM Searches On Large Databases. Proc. Eighth Ann. Int'l Conf. Computational Molecular Biology (RECOMB) - Poster Abstracts. 2004, 250-251.Google Scholar
- 12.Lingner T, Meinicke P: Fast Target Set Reduction for Large-Scale Protein Function Prediction: A Multi-class Multi-label Machine Learning Approach. Algorithms in Bioinformatics, 8th International Workshop, WABI, Lecture Notes in Computer Science. 2008, 5251: 198-209. 10.1007/978-3-540-87361-7.Google Scholar
- 15.Lima T, Auchincloss AH, Coudert E, Keller G, Michoud K, Rivoire C, Bulliard V, de Castro E, Lachaize C, Baratin D, Phan I, Bougueleret L, Bairoch A: HAMAP: a database of completely sequenced microbial proteome sets and manually curated microbial protein families in UniProtKB/Swiss-Prot. Nucleic Acids Res. 2009, 37 (suppl_1): D471-8. 10.1093/nar/gkn661.PubMedCentralCrossRefPubMedGoogle Scholar
- 16.Jeffreys H: Theory of Probability. 1961, Oxford: Clarendon Press, thirdGoogle Scholar
- 19.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genet. 2000, 25: 25-29. 10.1038/75556.PubMedCentralCrossRefPubMedGoogle Scholar
- 20.Kersey P, Bower L, Morris L, Horne A, Petryszak R, Kanz C, Kanapin A, Das U, Michoud K, Phan I, Gattiker A, Kulikova T, Faruque N, Duggan K, Mclaren P, Reimholz B, Duret L, Penel S, Reuter I, Apweiler R: Integr8 and Genome Reviews: integrated views of complete genomes and proteomes. Nucleic Acids Res. 2005, 33 (suppl_1): D297-302.PubMedCentralPubMedGoogle Scholar
- 21.Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bork P, Buillard V, Cerutti L, Copley R, Courcelle E, Das U, Daugherty L, Dibley M, Finn R, Fleischmann W, Gough J, Haft D, Hulo N, Hunter S, Kahn D, Kanapin A, Kejariwal A, Labarga A, Langendijk-Genevaux PS, Lonsdale D, Lopez R, Letunic I, Madera M, Maslen J, McAnulla C, McDowall J, Mistry J, Mitchell A, Nikolskaya AN, Orchard S, Orengo C, Petryszak R, Selengut JD, Sigrist CJA, Thomas PD, Valentin F, Wilson D, Wu CH, Yeats C: New developments in the InterPro database. Nucleic Acids Res. 2007, 35 (suppl_1): D224-8. 10.1093/nar/gkl841.PubMedCentralCrossRefPubMedGoogle Scholar
- 22.Aziz RK, Bartels D, Best AA, DeJongh M, Disz T, Edwards RA, Formsma K, Gerdes S, Glass EM, Kubal M, Meyer F, Olsen GJ, Olson R, Osterman AL, Overbeek RA, McNeil LK, Paarmann D, Paczian T, Parrello B, Pusch GD, Reich C, Stevens R, Vassieva O, Vonstein V, Wilke A, Zagnitko O: The RAST Server: rapid annotations using subsystems technology. BMC Genomics. 2008, 9: 75-10.1186/1471-2164-9-75.PubMedCentralCrossRefPubMedGoogle Scholar
- 25.Meyer F, Paarmann D, D'Souza M, Olson R, Glass E, Kubal M, Paczian T, Rodriguez A, Stevens R, Wilke A, Wilkening J, Edwards R: The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics. 2008, 9: 386-10.1186/1471-2105-9-386.PubMedCentralCrossRefPubMedGoogle Scholar
- 26.Kunin V, Raes J, Harris J, Spear J, Walker J, Ivanova N, von Mering C, Bebout B, Pace N, Bork P, Hugenholtz P: Millimeter-scale genetic gradients and community level molecular convergence in a hypersaline microbial mat. Mol Syst Biol. 2008, 4: 198-10.1038/msb.2008.35.PubMedCentralCrossRefPubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.