Key words

1 Introduction

Despite advances in technologies to combat infections, infectious diseases continue to challenge humans. This may be attributed to the rise in drug-resistant strains of pathogens such as Mycobacterium tuberculosis and new emerging infectious pathogens such as SARS coronavirus and influenza virus. A key step in the establishment of infectious disease is microbial virulence, which has been described as an emergent property of host–microbe interaction [1]. At the molecular level, entities like proteins, carbohydrates, or lipids enable the pathogens to establish themselves in a susceptible host. These molecules form inherent part of the pathogen cellular system and are collectively termed “virulence factors” [2, 3]. Virulence factors in various pathogens play diverse roles in the establishment of disease. These include colonization of the host, evasion of host defense mechanisms, immunosuppression, acquisition of nutrients from host cell, mediation of entry and exit into host cell in intracellular pathogens, and sensing change of environment [4, 5]. These factors enable colonization of host niche and eventually cause damage to host tissues [2, 4, 6].

It was therefore realized that targeting these microbial molecules by identifying their immunogenicity and use in vaccine formulations could serve as efficient anti-infective strategy. Vaccinologists are therefore preparing vaccine formulations with these molecules for priming the immune system in order to neutralize their activity in the event of a host–pathogen contact [5, 7].

A diverse array of molecules is involved during host–pathogen interaction and the prominent players vary between the pathogens. These include adhesins, toxins, enzymes, and capsules (polysaccharides or polypeptides).

Adhesins have attracted interest from immunological perspective because they are located on the cell surface and are likely to be accessible to the molecules of the immune system [8]. In the subsequent sections we provide an overview of these molecules and describe their prediction using Bioinformatics.

1.1 Adhesins

Adhesins enable adherence of the pathogen to host cells and constitutes the initial major step in the process of infection. This role of adhesins qualifies them for vaccine candidates as targeting adhesins could arrest infection at the initial stage [8]. Even though adhesins exhibit sequence polymorphisms, the conserved regions may serve for potential vaccine especially those containing receptor binding domain [9]. Recently, a potent combination of adhesins of Plasmodium falciparum has been identified, which could transcend strain variations [10].

Examples include FimH adhesin of uropathogenic Escherichia coli. Vaccination with this protein proved effective against urinary tract infection caused by E. coli in both mice and in nonhuman primates [11]. Filamentous hemagglutinin (FHA) and pertactin adhesins of gram-negative bacteria Bordetella pertussis elicits long-lasting cell mediated respiratory immune response [12]. These adhesins are components of the approved acellular pertussis licensed vaccine [13]. Another adhesin Neisseria meningitidis adhesin A (NadA) is part of a multicomponent meningococcal serogroup B vaccine named Bexero, which is capable of eliciting a robust immune response. This vaccine has cleared all clinical trials and awaiting license for use [14].

2 Materials and Methods

2.1 Bioinformatics Approaches of Adhesin Characterization

The advent of genomics technologies has revolutionized biological research. The complete genome sequence of a pathogen provides an abundance of opportunities to identify putative virulence factors through sequence analysis. These investigations are being aided by the development of new computational algorithms in this area. In the sections below, we discuss and outline the methods used in several investigations:

  1. 1.

    Sequence Similarity Search: Sequence similarity search is very popular and is among the first to be applied in sequence analysis. The goal here is to obtain orthologous sequences corresponding to a given query. This approach has been used to identify orthologues of known adhesins characterized in other pathogens (see Note 1 ). The best known algorithm is the Basic Local Alignment Search Tool (BLAST) algorithm [15]. Examples include application of BLAST algorithms in screening for potential adhesins in Mycoplasma agalactiae, Escherichia coli, Mycoplasma conjunctivae, Mycoplasma pneumonia, Rickettsial species [1621]. In addition BLAST can be used to identify orthologues of enzymes from pathogens involved in virulence: Hyaluronidase, Neuraminidase, Phospholipases, Proteases, Collagenase, Kinase, Coagulase, Leukocidins, Hemolysins.

  2. 2.

    Sequence Motif search: Sequence motif refers to a particular arrangement or pattern of amino acids within a protein sequence, or nucleotides within a DNA sequence, which is characteristic of a specific biochemical function [22]. In particular, majority of protein sequence motifs, provide unique detectable sequence features for a set of protein sequences and thus act as signatures of protein families. Such motifs indicate similar functional roles.

    For example, in fungi, many Glycosylphosphatidylinositol-modified (GPI) proteins linked to plasma membrane via preformed GPI anchor play role in adhesion and virulence [23, 24]. These proteins have C-terminal GPI-motif described as follows: “[GNSDAC]-[GASVIETKDLF]-[GASV]-X(4,19)-[FILMVAGPSTCYWN](10)>” in Prosite format, where “>” indicates the C-terminal end of the protein [26]. Algorithm based on identifying sequences having a C-terminal, fungus-specific, consensus sequence for GPI modification (GPI-motif) helps screen a set of potential fungal adhesins [25]. Table 1 lists the motifs identified in several adhesins.

    Table 1 Motifs in adhesins and other virulence factors
  3. 3.

    Signal Peptide: Signal Peptide (SP) is a short stretch of sequence present in the N-terminus of the protein directing it to the secretory pathway [31]. Adhesins being membrane attached proteins usually posses N-terminal signal peptide for translocation across the membrane of the endoplasmic reticulum [32, 33]. Therefore, algorithms using this information to screen for proteins having N-terminal signal peptide may help identifying potential adhesins (see Note 2 ). However, there are adhesins called “anchorless adhesins,” which do not have Signal peptide or Transmembrane domain. These “anchorless adhesins” cannot be identified through these approaches.

  4. 4.

    Transmembrane domain: Transmembrane domains are the regions of membrane proteins which traverse in and out, looping through the membrane. They are characteristics of integral membrane proteins. Since adhesins are mostly membrane proteins, the prediction of proteins having transmembrane domain would contribute to the set of putative adhesins (see Note 3 ). However, this approach would lead to large number of false positives as not all proteins possessing transmembrane domains are adhesins.

  5. 5.

    Domain Search: Domains are conserved autonomously folding functional unit of a protein [34]. The domains of a protein together define the function of the protein. The domain information of an unannotated protein sequence can be used to predict its function (see Note 4 ).

Some adhesin domains are known. Examples include GLEYA adhesin domain, PA14 domain, ALS_N domain in fungal species, YadA adhesin protein domain, fibrinogen-binding domain, Gingipain adhesin domains forming part of cleaved adhesin domain in bacterial species [3540]. Sequence analysis to study the presence of such adhesin related domains in the query protein sequence may help predicting potential adhesins.

2.2 Challenges in Bioinformatics Characterization of Adhesins

Although the computational methods described in preceding section permit identifying potential adhesins they are limited in their scope. Unlike many families of proteins, adhesins lack a well defined common sequence pattern or signatures, rendering their identification using the general signature sequence search or unique motif search difficult. This is mainly because adhesins include diverse proteins. Even adhesins belonging to same species include diverse molecular types and lack a common specific pattern in sequence. For example, the adhesins- M proteins in Streptococcus pyogenes, Gal/GalNAc lectin in Entamoeba histolytica, Fimbrial adhesins in Escherichia coli, Blood group antigen binding adhesin (BabA) in Helicobacter pylori, YadA collagen binding adhesin in Yersinia enterolitica [4145] lack significant similarity among each other.

However, in certain cases like in fungal species where many adhesins possess fungal specific GPI-motif, sequence motif search algorithm can be used to screen for potential fungal adhesins. However, identification methods solely based on motif searches such as GPI-anchor searches could return several false positives because all GPI-anchored proteins are not adhesins. Similar concerns apply to other identification methods such as Signal peptide search. The basic principles and limitations of various bioinformatics approaches used to characterize adhesins are summarized in Fig. 1.

Fig. 1
figure 1

Advantages and limitations of different sequence and motif based approaches for prediction of potential virulence factors

These limitations formed the foundation for developing non-homology group of algorithms, which use a large number of compositional properties.

2.3 Specialized Algorithms for Adhesin Prediction

2.3.1 SPAAN: A Software Program for Prediction of Adhesins and Adhesin-Like Proteins

SPAAN is an adhesin prediction tool developed using artificial neural network trained on compositional properties of known adhesins and non-adhesins. The algorithm is trained to predict adhesins and adhesin-like proteins solely from the sequence data. It is a non-homology method. SPAAN was trained using 105 compositional properties including 20 amino acid frequencies, 20 selected dipeptide frequencies, 20 multiplet frequency, 20 charge compositions, and 25 hydrophobic compositions. It showed an optimal sensitivity of 89 % and specificity of 100 % on a defined test set and could identify 97.4 % of known adhesins at high Pad value from a wide range of bacteria. Though SPAAN was trained on datasets dominated by bacterial adhesins, it can be used for general purpose to identify adhesins from a wide spectrum of species belonging to diverse phyla. Many novel adhesins in diverse species have been characterized using SPAAN [46]. It is one of the most widely used adhesin prediction tool available. The standalone software package of SPAAN can be downloaded from http://sourceforge.net/projects/adhesin/files/.

System Requirement: Red Hat Linux version 7.3 or above.

Other requirements: C compiler

Instruction for usage

  1. 1.

    SPAAN is provided as a tar-gzipped file. Post download, it should be unzipped and untarred by the command “tar xvzf SPAAN.tar.gz.”

  2. 2.

    The query sequences should be in FASTA format. Multiple sequences can be present in the input file.

  3. 3.

    The input file should be named as “query.dat.”

  4. 4.

    The command to run the software SPAAN is “./askquery.”

  5. 5.

    The output data is stored in “query.out.”

  6. 6.

    If the existing binary files are not compatible to the system, the source C codes provided need to recompiled using the following example command-“gcc –lm standard.c –o standard.o.”

List of C source codes to be compiled—standard.c, filter.c, annotate.c, and finalp1.c in the main SPAAN directory; recognize.c, AAcompo.c, hdr.c, multiplets.c, querydipep.c, and charge.c in their respective directories: AAcompo, hdr, multiplets, dipep, and charge: recognize.c needs to be compiled individually in each of the five mentioned directories.

Figure 2 describes an example of a run of SPAAN output result file “query.out.”

Fig. 2
figure 2

An example of a run of SPAAN output result file “query.out.” The results are output under three column heads, Serial No. (SN), Probability of adhesin (Pad-value), Protein name (Annotation)

2.3.2 MAAP: Malarial Adhesin and Adhesin-Like Proteins Predictor

MAAP was developed using Support Vector Machine (SVM) trained through compositional properties for classifying malarial adhesins and adhesin-like proteins [47]. The SVMlight package [48] of Support Vector Machine was used for this purpose. A total of 420 compositional properties including amino acid frequencies of 20 and 400 dipeptide frequencies were used to characterize the sequences of known adhesins and nonadhesins of Plasmodium species. MAAP runs on complete proteomes of Plasmodium species revealed that in Plasmodium falciparum at Pmaap scores above 0.0, a sensitivity of 100 % was observed with two false positives. In P. vivax and P. yoelii an optimal threshold Pmaap score of 0.7 was found optimal with very few false positives (upto 5). The MAAP Web server provides users with an interface where they can paste or upload their query sequences and predict whether the protein sequence is an adhesin (see Note 5 ). Users have the facility to set their own desired threshold cutoff value. The result can be exported as tab delimited text file by the users. The standalone version can be downloaded from the “Download” tab of MAAP Web server or http://sourceforge.net/projects/adhesin/files/.

Figure 3 describes the output result obtained using MAAP Web server.

Fig. 3
figure 3

Screenshot of output result obtained using MAAP Web server. The protein sequences scoring above threshold are highlighted in green color, whereas those scoring below the threshold are highlighted in red color. The result can be saved in a tab delimited plain text file format by clicking on the purple colored link (encircled)

2.3.3 FungalRV adhesin Predictor

In pathogenic fungi, adhesins play major roles as virulence factors mediating the interaction of the pathogens to variety of host cell types. In addition, adhesins in fungi aid in biofilm formation contributing to increased drug resistance and persistence of infections [49]. It has been established that differences in adhesion are responsible for greater virulence of one strain compared to other in fungi [50]. The fungal pathogens represent a diverse group of species.

FungalRV adhesin predictor was developed using Support Vector Machine (SVM) trained through compositional properties for classifying human pathogenic fungal adhesins and adhesin like proteins [51]. This tool was developed using SVMlight package of Support Vector Machine trained through 3,945 compositional properties including amino acid frequencies of 20 from amino acids, 247 selected dipeptide frequencies, 3,653 selected tripeptide frequencies, 20 amino acid multiplets frequencies, frequency of the hydrophobic amino acids and four moments of hydrophobic amino acid distribution of order 2–5. This is a non-homology based prediction tool. We obtained an overall MCC value of 0.8702 considering all 8 pathogens, namely, Candida albicans, Candida glabrata, Aspergillus fumigatus, Coccidioides immitis, Coccidioides posadasii, Histoplasma capsulatum, Blastomyces dermatitidis, and Paracoccidioides brasiliensis thus showing high sensitivity and specificity at a threshold of 0.511. In case of P. brasiliensis the algorithm achieved a sensitivity of 66.67 %. This tool was made into FungalRV Web server available at http://fungalrv.igib.res.in. The “Adhesin Predictor” tab of the FungalRV Web server provides users with an interface where they can paste or upload their query sequences and predict whether the protein sequence is a fungal adhesin (see Note 6 ). Users have been provided the facility to set their own desired threshold cutoff value. This facility has been provided to allow users to optimize the threshold for other fungi for which “FungalRV adhesin predictor” was not trained. The result can be exported as tab delimited text file by the users. The facility to search for fungal specific GPI pattern in the predicted adhesins and adhesin like proteins using fuzzpro program of EMBOSS has been provided. Users also have been provided the facility to conduct BLAST search with human reference proteins (see Note 7 ). The standalone version can be downloaded from the “Download” tab of FungalRV Web server or http://sourceforge.net/projects/adhesin/files/. Figure 4 describes the output adhesin prediction results obtained using FungalRV Web server.

Fig. 4
figure 4

Screenshot of output result obtained using FungalRV Web server. The protein sequences scoring above threshold are highlighted in green color, whereas those scoring below the threshold are highlighted in red color. The result can be saved in a tab delimited plain text file format by clicking on the purple colored link (encircled). Additional data on BLAST with Href proteins and GPI patterns are also displayed

2.3.4 Faapred

In addition to FungalRV, another Support Vector Machine (SVM) based algorithm named Faapred for prediction of fungal adhesins and adhesin-like proteins is available [52]. The SVM models for Faapred development were trained with compositional features- amino acid, dipeptide, multiplet fractions, charge and hydrophobic compositions, as well as PSI-BLAST derived PSSM matrices. The best classifiers were screened based on high MCC and accuracy. The amino acid composition model (ACHM), PSSM-a, and PSSM-b came out as the best classifiers with ACHM providing the highest MCC value of 0.610. Thus the prediction of Faapred uses classifiers based on compositional properties as well as PSSM. Faapred provides overall accuracy of 86 %. The prediction method is freely available as a World Wide Web based server at http://bioinfo.icgeb.res.in/faap.

3 Notes

  1. 1.

    BLAST algorithm is widely used to fetch orthologues. Reciprocal Best Hits (RBH) method has shown good efficiency in identifying orthologues. RBH is based on the principle that two genes from different genomes are orthologous if they find each other as the best hit in BLAST search in the other genome. Here BLASTP is usually carried out at a maximum E-value threshold of 1 × 10−6, including Smith–Waterman algorithm and Soft-filtering.

  2. 2.

    Various bioinformatics algorithms are available, which aid identifying signal peptides. SignalP algorithm available at http://www.cbs.dtu.dk/services/SignalP/ is widely used. The query sequences input in FASTA format can be submitted to predict presence of signal peptides.

  3. 3.

    Transmembrane prediction algorithms for example TMHMM available at http://www.cbs.dtu.dk/services/TMHMM/ is generally used to predict presence of transmembrane regions.

  4. 4.

    Conserved Domains can be predicted using domain prediction algorithms for example CDD search available at http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml. The presence of known adhesin related domains in the query sequences can be predicted.

  5. 5.

    The query proteins in FASTA format can be uploaded in the MAAP Web server. The server can be used to analyze the whole genome in one run.

  6. 6.

    Query protein sequences in FASTA format can be uploaded in FungalRV Web server. This Web server can be used to analyze the whole genome.

  7. 7.

    An adhesin vaccine should ideally not have similarity to human reference proteins to avoid cross-reactivity. The facility to conduct BLAST search with human reference proteins has therefore been provided in the FungalRV Web server. The cutoff E-value used here is 0.01, which borders on the limits of threshold similarity.