Background

The "secretome" refers to the collection of proteins that contain a signal peptide and are processed via the endoplasmic reticulum and Golgi apparatus before secretion [1]. In organisms from bacteria to humans, secretory proteins are common and perform diverse functions. These functions include immune system [2], roles as neurotransmitters in the nervous system [3], roles as hormones/pheromones [4], acquisition of nutrients [57], building and remodeling of cell walls [8], signaling and environmental sensing [9], and competition with other organisms [1013]. Some secretory proteins in pathogens function as effectors that manipulate and/or destroy host cells with special signatures. In Plasmodium and Phytophthora species, effectors carry the RXLX [EDQ] or RXLR motifs as host targeting signals [1113].

With the aid of advanced genome sequencing technologies [14], the rapid increase of sequenced fungal genomes offers many opportunities to study the function and evolution of secretory proteins at the genome level [15, 16]. The Comparative Fungal Genomics Platform (CFGP; http://cfgp.snu.ac.kr/) [16] now archives 235 genomes from 120 fungal/oomycete species. The accurate prediction of secretory proteins in sequenced genomes is the key to realizing such opportunities.

The widely used SignalP 3.0 program [17] detected 89.81% of the 2,512 experimentally verified sequences in SPdb [18], a database containing proteins with signal peptides. To improve the accuracy of prediction, we built a hierarchical identification pipeline based on nine prediction programs (Table 1). Through this pipeline, putative secretory proteins, including pathogen effectors, encoded by 158 fungal and oomycete genomes were identified. The Fungal Secretome Database (FSD; http://fsd.snu.ac.kr/) was established to support not only the archiving of fungal secretory proteins but also the management and use of the resulting data. The FSD also has a user-friendly web interface and offers several data analysis functions via Favorite, a personalized data repository implemented in the CFGP (http://cfgp.snu.ac.kr/)[16].

Table 1 List of prediction programs used in FSD

Construction and content

Evaluation of the pipeline for predicting secretory proteins

To evaluate the capabilities of four programs SignalP 3.0 [17], SigCleave [19], SigPred [20], and RPSP [21] for predicting signal peptides, we analyzed the secretory proteins collected in SPdb [18]. SignalP 3.0 identified 89.81% of 2,512 proteins; while adding the other three programs, in combination, 87.50% of the proteins, which were not predicted by SignalP 3.0, were identified. The remaining proteins (1.31% of 2,512 proteins) were investigated by using two programs that predicted subcellular localization: PSort II [22] and TargetP 1.1b [23]. We found that 34.38% of the proteins were predicted to be extracellular proteins, increasing the coverage to 99.16%. For the 1,093 characterized fungal/oomycete secretory proteins (Table 2), the combinatory pipeline raised the prediction coverage from 75.30% to 84.17% in comparison to SignalP 3.0. In addition, 98.14% of 24,921 experimentally unverified sequences in the SPdb were predicted as secretory proteins by the pipeline, while SignalP 3.0 caught 80.22% of them as positive. To assess robustness of the pipeline with non-secretory proteins, we prepared yeast proteins localized in cytosol, endoplasmic reticulum, nucleus, or mitochondrion [24]. When the 1,955 proteins were subjected to the FSD pipeline and SignalP 3.0, the numbers of false positives were almost same (84 and 82, respectively). Together, these results suggest that this ensemble approach could compensate for some of the weaknesses of individual programs, resulting in more robust predictions. Additionally, SecretomeP 1.0f [25], which can predict non-classical secretory proteins, was integrated into the FSD.

Table 2 List of references and annotation results of characterized fungal secretory proteins

The FSD contains an identification pipeline that sequentially analyzes proteomes of interest using i) SignalP 3.0; ii) a combination of SigCleave, SigPred, and RPSP to screen those proteins not considered positive by SignalP 3.0; and iii) PSort II and TargetP 1.1b to analyze the negatives from the previous step. Additionally, SecretomeP 1.0f was integrated to provide information related to non-classical secretory proteins. To eliminate potential false positives, we filtered proteins that i) contain more than one transmembrane helix predicted by TMHMM 2.0c [26] and/or ii) the endoplasmic reticulum retention signal ([KRHQSA]- [DENQ]-E-L; classified as false-positive; Figure 1A) [27]. In addition, iii) nuclear proteins predicted by both predictNLS [28] and PSort II [22] and iv) mitochondrial proteins predicted by PSort II [22] as well as TargetP 1.1b [23] were eliminated because two subcellular localizations are not related to secretory proteins.

Figure 1
figure 1

FSD class definitions and the FSD pipeline. (A) Definitions of four FSD classes. The gray round rectangle indicates the total set of proteins, and the light blue arrows going outside the rectangle show the filtering out processes of the pipeline. The black rectangles show the names of the classes, the yellow arrows indicate expansion of the putative secretome boundary, and the white-bordered blue cross indicates additional information on the putative secretome. (B) Structure of the FSD pipeline. The two parallelograms are input data for the FSD pipeline. The rectangle in the middle indicates the process for identifying putative secretory proteins. The round rectangles indicate the four FSD classes. The gray square on the right represents the thirteen different analysis functions in Favorite.

Following analysis via the pipeline, the resulting putative secretory proteins after removing potential false positives are divided into four classes: i) SP contains all proteins predicted by SignalP 3.0; ii) SP3 contains the proteins predicted by SigPred, SigCleave, or RPSP but not by SignalP 3.0; iii) SL contains the proteins predicted by PSort II and/or TargetP 1.1b but not by the first two steps; and iv) NS contains the proteins predicted by SecretomeP 1.0f but not by SignalP 3.0 (Figure 1A; Table 3).

Table 3 Class definitions used in FSD

System structure of the FSD

To improve the expandability and flexibility of the FSD, we adopted a three-layer structure (i.e., data warehouse, analysis pipeline, and user interface) in its design. The data warehouse was established using the standardized genome warehouse managed by the CFGP (http://cfgp.snu.ac.kr/)[16] that has been used in various bioinformatics systems [15, 2935]. The pipeline layer was built with a series of Perl programs.

In addition to the prediction programs described above, ChloroP 1.1 as well as hydropathy plots [36] were included in the FSD to provide additional information on secretory proteins. Whenever new fungal genomes become available, the automated pipeline classifies them based on the predictions of nine programs, thus keeping the FSD current (Figure 1B).

MySQL 5.0.67 and PHP 5.2.9 were used to maintain database and to develop web-based user interfaces that present complex information intuitively. Web pages were serviced through Apache 2.2.11. Favorite, a personal data repository used in the CFGP (http://cfgp.snu.ac.kr/)[16], was integrated to provide thirteen functions for further analyses.

Utility and Discussion

Discussion

Secretory proteins in 158 fungal/oomycete genomes

To survey the genome-wide distribution of secretory proteins in fungi and oomycetes, we used the pipeline to analyze all predicted proteins encoded by 158 fungal/oomycete genomes. Of the 1,373,444 open reading frames (ORFs) analyzed, 92,926 (6.77%), 103,224 (7.52%), and 12,733 (0.93%) proteins belonged to classes SP, SP3, and SL, respectively (Table 4, 5, and 6). In total, 208,883 ORFs (15.21%) were denoted putative secretory proteins. The proteins belonging to class NS were not included in the putative secretome because they represented more than 40% of whole proteome.

Table 4 List and distribution of secretion-associated proteins of the fungal genomes belonging to the subphylum Pezizomycotina archived in FSD
Table 5 List and distribution of secretion-associated proteins of the fungal genomes belonging to the subphylum Saccharomycotina and Taphrinomycotina archived in FSD
Table 6 List and distribution of secretion-associated proteins of the fungal genomes belonging to the phyla Basidiomycota, Chytridiomycota, and Microsporidia, the subphylum Mucoromycotina, and the phylum Peronosporomycota (oomycetes) archived in FSD

To determine the phylum-level distribution of classes SP, SP3, and SL within fungi, we investigated the proportions of the three classes among subphyla (Figure 2). Class SP3 was the largest, class SP was a little smaller, and the class SL was much smaller; this was consistent over every subphylum. Only in Plasmodium species, oomycetes, and the kingdom Metazoa class SP was dominant. Class SL did not exceeded 2.10% of the whole genome, except in Plasmodium species (4.52%). Plasmodium species also showed the lowest variance among the three classes, which may reflect signal peptide-independent types of secretory proteins such as vacuolar transport signals (VTSs) [12]. These results may be partially affected by the composition of the training data for each prediction program and inherent features of each algorithm.

Figure 2
figure 2

Distribution of three classes at the phylum/subphylum level. The average ratios of the classes to the total ORFs at the subphylum and phylum levels are described. The orange circular arc represents the fungal kingdom, and the four light blue round boxes represent phyla or kingdoms. Inside the chart, the blue line represents the ratio of class SP; the red line, class SP3; and the green line, class SL.

The phylum Basidiomycota had a larger proportion of secretory proteins (17.90%) than other fungal taxonomy such as the subphylum Mucoromycotina (11.99%) and the phyla Ascomycota (12.87%) and Microsporidia (15.10%). Within the phylum Ascomycota, the subphylum Pezizomycotina showed a higher portion of class SP (7.82%) than the subphyla Saccharomycotina and Taphrinomycotina (4.57% and 3.74%, respectively). When considered that subphylum Pezizomycotina contains many pathogenic fungi (47 of 59) compared with subphylum Saccharomycotina (11 of 65), the abundance of secretory proteins in the subphylum Pezizomycotina suggests that pathogens may have larger secretome than saprophytes in general. In fact, Magnaporthe oryzae and Neurospora crassa, a closely related pair of pathogen and non-pathogen supported by recent phylogenomic studies [3739], contain 22.31% and 16.93% of secretory proteins, respectively. Moreover, the same tendency was found in comparison with 158 fungal/oomycete genomes archived in the FSD (pathogens and saprophytes showed 14.06% and 11.70%, respectively).

Effectors encoded by fungal/oomycete and Plasmodium genomes

Phytophthora species, a group that includes many important plant pathogens, uses a RXLR signal to secrete effectors to host cells [40]. RXLR effectors were tightly co-located with signal peptides predicted by the SignalP 3.0 with high confidence values (HMM and NN for 0.93 and 0.65, respectively) [41]. With the same conditions, we identified 734 putative RXLR effectors from three Phytophthora species, similar to a previous study [42]. However, 153 fungal genomes showed that only 0.04% of the total proteome contained this motif, suggesting that the use of RXLR for secretion is oomycete-specific.

The motivation of finding the RXLR pattern in oomycetes was the RXLX [EDQ] motif of the VTS in the malaria pathogen, Plasmodium falciparum. Once P. falciparum invades the human erythrocyte, it secretes the proteins that carry the pentameric VTS of the RXLX [EDQ] motif from the parasitophorus vacuole to the host cytoplasm [12, 13]. To determine how many VTSs could be detected by our pipeline, we investigated 217 proteins of P. falciparum [13]. Of these, 115 proteins (53.00%) were classified as secretory proteins, defined in the FSD by the RXLX [EDQ] motif. Comparing our result to that predicted by SignalP 3.0 alone (41 out of 217), we found that our pipeline demonstrated high fidelity in detecting proteins containing VTSs.

In class SP, the proportions of proteins possessing the RXLX [EDQ] but not the RXLR motif were 96.75%, 56.18%, and 93.21% in fungi, oomycetes, and Plasmodium species, respectively (Figure 3A). There were similar proportions of the RXLX [EDQ] motif in classes SP3 and SL across the three groups (Figure 3B and 3C). Taken together, these data show that the RXLR motif, with signal peptides predicted by SignalP 3.0, is oomycete-specific [41]. It is interesting that fungal genomes have significantly higher numbers of the RXLX [EDQ] motif than Plasmodium species (t-test based on amino acid frequency in each genome; P = 2.2e-16), suggesting that the RXLX [EDQ] motif may be one of fungal-specific signatures of effectors.

Figure 3
figure 3

Composition of RXLR/RXLX [EDQ] pattern in fungi, oomycetes, and Plasmodium species. Composition of the RXLX [EDQ] (blue) and the RXLR (red) under class SP (A), class SP3 (B), and class SL (C) with the relative ratio in fungi, oomycetes, and Plasmodium species, respectively.

Utility

FSD web interfaces

To support the browsing of the global patterns of archived data, the FSD prepares diverse charts and tables. For example, intersections of prediction results are summarized in a chart for each genome (Figure 4). Despite of the many programs, all prediction results for each protein are displayed on one page, allowing users to browse them easily (Figure 5).

Figure 4
figure 4

Screenshot of genome-level analysis functions for an example fungal genome. This screenshot shows the ORF numbers and ratios of each class through the pie chart in the left and the table in the right. The numbers in the table provide links to the list of putative secretory proteins belonging to each group. This figure shows the result from M. oryzae.

Figure 5
figure 5

One page summary for a protein. The web page shows a one page summary of amino acid sequence, exon structure, and genome context via the SNUGB [15], along with 12 predictions, including signal peptides and subcellular localization.

The SNUGB interface (http://genomebrowser.snu.ac.kr/)[15] provides several fields: i) signal peptides predicted by four different programs; ii) effector patterns, such as RXLR and RXLX [EDQ]; iii) nucleotide localization signals predicted by predictNLS; iv) transmembrane helixes predicted by TMHMM 2.0c; and v) hydropathy plots (Figure 6). The users can readily compare secretome-related information with diverse genomic contexts.

Figure 6
figure 6

SNU Genome Browser implemented in the FSD. The SNUGB (http://genomebrowser.snu.ac.kr/)[15] displays i) four types of signal peptides predicted by SignalP 3.0, SigCleave, SigPred, and RPSP, ii) amino acid patterns, iii) nucleotide localization signals predicted by predictNLS, iv) transmembrane helixes predicted by TMHMM 2.0c, and v) hydropathy plots.

The personalized virtual space, Favorite, supports in-depth analyses in the FSD

The FSD allows users to collect proteins of interest and save them into the Favorite, which provides thirteen functions: i) classes distribution of proteins; ii) comparisons of predicted signal peptides generated by the four programs; iii) distributions and lists of proteins with predicted signal peptide cleavage sites; iv) compositions of amino acids near the cleavage sites; v) analyses of subcellular localization predictions; vi) lists and ratios of proteins that have chloroplast transit peptides, as determined by ChloroP 1.1; vii) analyses of proteins detected by SecretomeP 1.0f; viii) lists and distribution charts of proteins with trans-membrane helices, as predicted by TMHMM 2.0c; ix) hydropathy plots for proteins; x) analyses of proteins believed to be targeted to the nucleus of a host cell supported by predictNLS; xi) distributions and lists of proteins with a specific amino acid patterns; xii) lists of functional domains predicted by InterPro Scan; xiii) domain architecture of InterPro Scan (Figure 7). From these result pages, users can collect and store proteins in Favorite again, for further analyses. Additionally, Favorites created in the FSD can be shared with the CFGP (http://cfgp.snu.ac.kr/)[16], permitting users to use the 22 bioinformatics tools provided in the CFGP web site.

Figure 7
figure 7

Thirteen analysis functions in the Favorite browser. Six different pages of analyses, connected to the Favorite browser, are displayed. "Prediction distribution" provides a list of predicted secretory proteins with their proportion to all proteins. "Class distribution" shows the composition of the classes, with the protein numbers belonging to each class. "Frequency/Position distribution" gives a bar or pie graph and numerical values linking to proteins listed for each item. "Hydropathy plots" draws the two graphs with window sizes of 11 and 19. "Amino acid distribution" presents consensus amino acids around the cleavage sites. "Functional domain distribution" lists the domains and their architecture diagrams based on InterPro terms.

Conclusions

Given the availability of large number of fungal genomes and diverse prediction programs for secretory proteins, a three-layer classification rule was established and implemented in a web-based database, the FSD. With the aid of an automated pipeline, the FSD classifies putative secretory proteins from 158 fungal/oomycetes genomes into four different classes, three of which are defined as the putative secretome. The proportion of fungal secretory proteins and host targeting signals varies considerably by species. It is interesting that fungal genomes have high proportions of the RXLX [EDQ] motif, characterized as host targeting signal in Plasmodium species. Summaries of the complex prediction results from twelve programs help users to readily access to the information provided by the FSD. Favorite, a personalized virtual space in the CFGP, serves thirteen different analysis tools for further in-depth analyses. Moreover, 22 bioinformatics tools provided by the CFGP can be utilized via the Favorite. Given these features, the FSD can serve as an integrated environment for studying secretory proteins in the fungal kingdom.

Availability and requirements

All data and functions described in this paper can be freely accessed through the FSD web site at http://fsd.snu.ac.kr/.