FastAnnotator- an efficient transcript annotation web tool
- 5.9k Downloads
Recent developments in high-throughput sequencing (HTS) technologies have made it feasible to sequence the complete transcriptomes of non-model organisms or metatranscriptomes from environmental samples. The challenge after generating hundreds of millions of sequences is to annotate these transcripts and classify the transcripts based on their putative functions. Because many biological scientists lack the knowledge to install Linux-based software packages or maintain databases used for transcript annotation, we developed an automatic annotation tool with an easy-to-use interface.
To elucidate the potential functions of gene transcripts, we integrated well-established annotation tools: Blast2GO, PRIAM and RPS BLAST in a web-based service, FastAnnotator, which can assign Gene Ontology (GO) terms, Enzyme Commission numbers (EC numbers) and functional domains to query sequences.
Using six transcriptome sequence datasets as examples, we demonstrated the ability of FastAnnotator to assign functional annotations. FastAnnotator annotated 88.1% and 81.3% of the transcripts from the well-studied organisms Caenorhabditis elegans and Streptococcus parasanguinis, respectively. Furthermore, FastAnnotator annotated 62.9%, 20.4%, 53.1% and 42.0% of the sequences from the transcriptomes of sweet potato, clam, amoeba, and Trichomonas vaginalis, respectively, which lack reference genomes. We demonstrated that FastAnnotator can complete the annotation process in a reasonable amount of time and is suitable for the annotation of transcriptomes from model organisms or organisms for which annotated reference genomes are not avaiable.
The sequencing process no longer represents the bottleneck in the study of genomics, and automatic annotation tools have become invaluable as the annotation procedure has become the limiting step. We present FastAnnotator, which was an automated annotation web tool designed to efficiently annotate sequences with their gene functions, enzyme functions or domains. FastAnnotator is useful in transcriptome studies and especially for those focusing on non-model organisms or metatranscriptomes. FastAnnotator does not require local installation and is freely available at http://fastannotator.cgu.edu.tw.
KeywordsGene Ontology Functional Annotation Domain Identification Annotation Tool Annotation Pipeline
As sequencing technologies have improved, transcriptome sequencing (RNA-Seq) or whole-genome sequencing have become faster and cheaper than ever before. In addition, sequencing projects addressing non-model organisms or environmental samples (metagenomics) have now become feasible and affordable. For example, the sequencing of relatively unexplored organisms, such as the human gut microbiome or the transcriptome of clams or butterflies, has been accomplished [1, 2, 3]. However, sequencing is only the first step in the process of understanding the biology underlying the obtained sequences. Bioinformatics analyses performed after sequencing, including assembly, functional annotation and classification, are becoming increasingly important. Several annotation pipelines have been developed, such as CycADS, PRIAM and Blast2GO [4, 5, 6], which can provide putative functions for a transcript based on sequence similarity to known genes. Although these tools are useful for understanding the newly explored sequences, they usually contain many manual steps and often are not easy to implement for biologists who are unfamiliar with the command line inputs. Web tools have been developed for the annotation of expressed sequence tags (EST), such as ESTAnnotator, ESTpass, and ESTExplorer [7, 8, 9, 10]. These pipelines are specifically designed for EST analyses that include the cleaning, assembly and clustering and functional annotation of ESTs. However, we found that the online Uniform Resource Locators (URLs) for ESTpass and ESTAnnotator are no longer functional for ESTpass and ESTAnnotator, and ESTExplorer cannot simultaneously process more than ten thousand contigs. Thus the development of a web server that can provide an automatic annotation pipeline for contigs derived from RNA-Seq data with a user-friendly interface would be beneficial to the research community.
Functional annotation strategies are typically based on previously identified protein functions. Many protein function databases, such as BRENDA, Enzyme and Amigo, have been established as collections based on the functions of enzymes or genes [11, 12, 13]. These databases categorize protein functions into structured groups as either Enzyme Commission numbers (EC numbers) or Gene Ontology (GO) terms. EC numbers are assigned to enzymes according to the chemical reactions they catalyze , whereas GO terms include information on the molecular function, cellular component and biological processes of genes and can be used to describe the biological function of proteins . These protein function or enzyme function databases document well-studied and well-annotated biological functions and provide resources for the annotation of newly sequenced genes. The majority of established annotation methods utilize these functional annotation systems and transfer functional annotations among sequences based on sequence similarity or pattern searches. For example, the CycADS pipeline integrates multiple annotation tools and databases ; PRIAM identifies enzyme functions based on profiles constructed from known enzymes ; and Blast2GO annotates gene functions based on a combinations of various annotation methods . The strategy of annotation transfer has been shown to work well and is frequently employed.
Herein, we present an automatic, efficient and easy-to-use web-based annotation tool named FastAnnotator. Given that many tools have been proposed to annotate sequences based on previously established knowledge, we chose to integrate several well-established and popular tools to construct this service. FastAnnotator utilizes Blast2GO and PRIAM to identify GO terms and EC numbers for transcript sequences. Blast2GO has a high annotation accuracy (65-70%) , and PRIAM provides an enzyme profile database that was updated in October 2011 . As certain assembled sequences identified using RNA-Seq may not cover full-length coding regions, we also included a domain search in FastAnnotator. In addition to the possibility of identifying domains based on partial transcript sequences, domain searching also enables FastAnnotator to identify possible functions for sequences that show a significant level of divergence. This feature was included because sequences may evolve to become dissimilar, although the functional domain regions are likely to be conserved through evolution under functional constraints. It has also been shown that proteins having the same domain composition are homologs and likely have the same function [17, 18, 19]. Therefore, domain annotation in FastAnnotator can not only provide an annotation for a partially sequenced transcript but also improve the annotation performance for sequences that are highly divergent from existing database sequences. This property is particularly useful for metatranscriptomic analyses or samples from organisms that are distantly related to model organisms.
In summary, we have developed the FastAnnotator pipeline to provide automatic annotation of nucleotide sequences via a web interface. Users can begin their annotation by uploading their sequences to the FastAnnotator website. FastAnnotator then assigns possible functions and identified domains in those input query sequences based on sequence similarity and pattern searches. The output of FastAnnotator includes the best hits in the NCBI non-redundant database, GO terms, EC numbers, and domain identities [20, 21]. Users can explore the output on the website, access other external functional databases (BRENDA, Amigo or Pfam) for detailed functional descriptions and simply download the output for further analysis. As a web service specifically developed for ease-of-use annotations, the use of FastAnnotator does not demand any knowledge of the command line. Moreover, FastAnnotator is free, efficient and user-friendly. Relative to existing annotation pipelines, FastAnnotator is much more convenient because it does not require installation or tedious manual steps and allows the user to analyze a large number of sequences with a single click. FastAnnotator is now available at http://fastannotator.cgu.edu.tw/.
Flowchart of the FastAnnotator pipeline
The identification of GO terms
The alignment result generated by LAST is then used as an input to Blast2GO. The final assignments of GO terms are extracted and presented in a table on the website, which is made available for download.
The identification of domains
We downloaded the standalone BLAST+ (v2.2.25) program from NCBI  and used as database the 13,672 domain models (Pfam v26) from the Conserved Domains Database (CDD) [19, 20]. FastAnnotator applies the rpstblastn to identify domains in the query nucleotide sequences by searching against the preformatted domain database with mostly default parameters except the expectation value (e-value) which is set to be less than 0.01, and the hit aligned length which is longer than 50% of the domain PSSM . After the domains of the query sequences are identified, FastAnnotator calculates the length coverage and presents the percentage of coverage for the domain in the report table.
The identification of EC numbers
FastAnnotator utilizes PRIAM  to identify potential enzyme functions. PRIAM can detect specific enzymes patterns and annotate these enzymes with EC numbers. The latest version of the enzyme profiles (released on 19 Oct, 2011) was downloaded from the PRIAM website. The nucleotide sequence inputs were translated into protein sequences in six frames using Transeq, a tool a part of the European Molecular Biology Open Software Suite (EMBOSS) . These translated protein sequences are then used as inputs to search against the enzyme profile database. FastAnnotator identifies transcripts that may act as enzymes and presents the transcripts together with EC numbers in the output table.
Results and discussion
To demonstrate the performance and speed of FastAnnotator, we ran it on several transcriptome sequence datasets from different organisms. We were especially interested in how FastAnnotator could assist in the annotation of non-model organisms. The unpublished RNA-Seq dataset from a clam (Meretrix meretrix) was used as an example. This transcriptome included 22,129,105 cDNA sequence reads generated using an Illumina Genome Analyzer, and the average read length was 80 base pairs. A reference genome for the clam is currently not available, and thus, the reads were assembled de novo into 101,795 contigs using the software package CLC Genomics Workbench (CLC bio, Denmark). Using these contigs as the input, FastAnnotator required approximately 16 hours to finish the annotation. The statistical output showed that the N50 of these contigs was 390 nucleotides and the majority of these assembled contigs (73,878 contigs) ranged from 200-399 base pairs. This large number of short contigs may have resulted from a low level of coverage, and we believe that many of these contigs were actually partial transcript sequences. Of these contigs, 24,919 were found to be similar to sequences in the NCBI non-redundant protein database; 15,112 were assigned at least one Gene Ontology term; 13,015 were found to contain at least one domain; and 585 were assigned an enzyme annotation. Among all of the contigs, 20.4 % of contigs were annotated, and the annotation rate was even higher (26.4%) if we only considered contigs longer than 250 base pairs. These contig annotation rate was slightly higher than that reported in previous studies [1, 29], the majority of which were based on domain identification. A further comparison of our annotation results with a recently published clam transcriptomic analysis by Huan et al.  showed that our annotation results for the clam transcriptome were similar.
FastAnnotator results for five different organisms
# of entries (total base)
% of sequences with best hit
% of annotated sequence*
C. elegans +
Trichomonas vaginalis +
GO annotation is the most commonly used and well-established functional annotation scheme. To identify GO terms for the input sequences, FastAnnotator incorporates the Blast2GO pipeline, which is one of the most widely used annotation tools and claims to have an annotation accuracy of 65-70% . However, we made a small modification to the Blast2GO pipeline by replacing BLASTX with LAST, as LAST is significantly faster and is capable of detecting evolutionarily conserved regions . We also used the example file provided by the B2G4Pipe package to benchmark the computation time and comparing output results of BLASTX and LAST on the same machine (IBM X3850 with four Xeon E7540 2.0GHz CPUs and 128G RAM). We found that for this file of 10 nucleotide sequences, BLASTX took 5 times longer to search against the non-redundant database comparing to LAST. The resulting 30 annotations reported by each tool differed in only 2 cases. A closer inspection of the results revealed that the differences were minor and could be explained by the parent and child relationship within the GO acyclic graph. Therefore, we concluded that replacing BLASTX with LAST resulted in a minimal difference in the annotation while providing a substantial improvement in the computational speed.
Regarding the detection of enzymes, FastAnnotator annotates enzyme functions based on the database of 2,844 EC numbers included in the most recent PRIAM release. According to Integrated relational Enzyme database (IntEnz) release 76, there were 4,812 active EC numbers . Furthermore, it is well-known that more than one-third of the enzyme activities with EC numbers are so-called orphan enzyme activities, which are not associated with a protein or gene sequences [32, 33, 34]. Approximately 50% of the orphan enzyme activities are limited to only one species or closely related organisms, which implies that orphan enzyme activities may be limited to certain organisms that remain to be fully explored . Due to these limitations, current annotation tools that are based on sequence similarity searches are restricted to detecting only certain enzyme activities. Consequently, FastAnnotator may overlook certain enzyme functional annotations, especially if those functions are restricted to particular organisms. In our clam transcriptome, for example, only 585 contigs were identified as enzymes. Because it has been estimated that approximately 18-29% of genes encode enzymes in eukaryotes , it is very likely that this level of enzyme annotation was an underestimate, which may have occurred due to the fact that the majority of the contigs are only partial transcripts in our clam example and because the enzyme database used in FastAnnotator was imcomplete. FastAnnotator may provide improved annotations in the future after additional enzyme functions and associated sequences are identified and included in the enzyme databases.
As sequencing technologies improve and decrease in cost, an increasing number of sequences can be generated. However, the annotation of these sequences, especially for those generated from unfamiliar biological samples, becomes an important issue. In this project, we developed FastAnnotator, an automatic annotation web tool, which integrates several well-developed annotation tools together to provide annotations for query sequences. FastAnnotator allows users to assign protein functions, cellular location, enzyme activity, and function domains to query sequences through an easy-to-use interface. By adopting a different sequence search program, LAST, and including domain identification, it is capable of efficiently annotating sequences and is suitable for annotation of sequences derived from less well-studied organisms or environmental samples. In summary, we present a web-based annotation tool, FastAnnotator, which should be helpful in transcriptome studies, particularly metatranscriptome or non-model organism studies.
Availability and requirements
Project name: FastAnnotator
Project home page: http://fastannotator.cgu.edu.tw
Hardware specifications: IBM X3850 with four Xeon E7540 2.0GHz CPUs and 128G RAM
Operating system(s): CentOS Release 5.7 with Linux kernel 2.6.18
Programming language: Python and Perl for pipeline building, PHP for website interface.
This work is supported by a grant (CMRPD190142) from the Chang Gung Memorial Hospital to PT. We thank Dr. Frith for his useful suggestion on parsing the output data from LAST.
This article has been published as part of BMC Genomics Volume 13 Supplement 7, 2012: Eleventh International Conference on Bioinformatics (InCoB2012): Computational Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/13/S7.
- 4.Vellozo AF, Veron AS, Baa-Puyoulet P, Huerta-Cepas J, Cottret L, Febvay G, Calevro F, Rahbe Y, Douglas AE, Gabaldon T, et al: CycADS: an annotation database system to ease the development and update of BioCyc databases. Database : the journal of biological databases and curation. 2011, 2011: bar008-CrossRefPubMedGoogle Scholar
- 7.Nagaraj SH, Deshpande N, Gasser RB, Ranganathan S: ESTExplorer: an expressed sequence tag (EST) assembly and annotation platform. Nucleic Acids Res. 2007, W143-147. 35 Web ServerGoogle Scholar
- 9.Lee B, Hong T, Byun SJ, Woo T, Choi YJ: ESTpass: a web-based server for processing and annotating expressed sequence tag (EST) sequences. Nucleic Acids Res. 2007, W159-162. 35 Web ServerGoogle Scholar
- 12.Scheer M, Grote A, Chang A, Schomburg I, Munaretto C, Rother M, Sohngen C, Stelzer M, Thiele J, Schomburg D: BRENDA, the enzyme information system in 2011. Nucleic Acids Res. 2011, D670-676. 39 DatabaseGoogle Scholar
- 14.Webb EC: Enzyme nomenclature 1992: recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the nomenclature and classification of enzymes. 1992, San Diego: Published for the International Union of Biochemistry and Molecular Biology by Academic PressGoogle Scholar
- 19.Marchler-Bauer A, Lu S, Anderson JB, Chitsaz F, Derbyshire MK, DeWeese-Scott C, Fong JH, Geer LY, Geer RC, Gonzales NR: CDD: a Conserved Domain Database for the functional annotation of proteins. Nucleic Acids Res. 2011, D225-229. 39 DatabaseGoogle Scholar
- 20.Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, Clements J: The Pfam protein families database. Nucleic Acids Res. 2012, D290-301. 40 DatabaseGoogle Scholar
- 21.Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007, D61-65. 35 DatabaseGoogle Scholar
- 31.Fleischmann A, Darsow M, Degtyarenko K, Fleischmann W, Boyce S, Axelsen KB, Bairoch A, Schomburg D, Tipton KF, Apweiler R: IntEnz, the integrated relational enzyme database. Nucleic Acids Res. 2004, D434-437. 32 DatabaseGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.