Prediction of Virulence Factors Using Bioinformatics Approaches
- 3.3k Downloads
Virulence factors produced by a pathogen are essential for causing disease in the host. They enable the pathogen to establish itself within the host thus enhancing its potential to cause disease and in some instances underlie evasion of host defense mechanisms. Identification of these molecules, especially those of immunological interest and their use in vaccine development are attractive and are among the initial steps of reverse vaccinology. Surface localized virulence factors such as adhesins serve as excellent immunogenic candidates in this regard. In this chapter we have described the bioinformatics approaches for adhesin prediction, which include specific adhesin prediction algorithms.
Key wordsVirulence factors Host Pathogen Immunogenic Adhesins Vaccine
Despite advances in technologies to combat infections, infectious diseases continue to challenge humans. This may be attributed to the rise in drug-resistant strains of pathogens such as Mycobacterium tuberculosis and new emerging infectious pathogens such as SARS coronavirus and influenza virus. A key step in the establishment of infectious disease is microbial virulence, which has been described as an emergent property of host–microbe interaction . At the molecular level, entities like proteins, carbohydrates, or lipids enable the pathogens to establish themselves in a susceptible host. These molecules form inherent part of the pathogen cellular system and are collectively termed “virulence factors” [2, 3]. Virulence factors in various pathogens play diverse roles in the establishment of disease. These include colonization of the host, evasion of host defense mechanisms, immunosuppression, acquisition of nutrients from host cell, mediation of entry and exit into host cell in intracellular pathogens, and sensing change of environment [4, 5]. These factors enable colonization of host niche and eventually cause damage to host tissues [2, 4, 6].
It was therefore realized that targeting these microbial molecules by identifying their immunogenicity and use in vaccine formulations could serve as efficient anti-infective strategy. Vaccinologists are therefore preparing vaccine formulations with these molecules for priming the immune system in order to neutralize their activity in the event of a host–pathogen contact [5, 7].
A diverse array of molecules is involved during host–pathogen interaction and the prominent players vary between the pathogens. These include adhesins, toxins, enzymes, and capsules (polysaccharides or polypeptides).
Adhesins have attracted interest from immunological perspective because they are located on the cell surface and are likely to be accessible to the molecules of the immune system . In the subsequent sections we provide an overview of these molecules and describe their prediction using Bioinformatics.
Adhesins enable adherence of the pathogen to host cells and constitutes the initial major step in the process of infection. This role of adhesins qualifies them for vaccine candidates as targeting adhesins could arrest infection at the initial stage . Even though adhesins exhibit sequence polymorphisms, the conserved regions may serve for potential vaccine especially those containing receptor binding domain . Recently, a potent combination of adhesins of Plasmodium falciparum has been identified, which could transcend strain variations .
Examples include FimH adhesin of uropathogenic Escherichia coli. Vaccination with this protein proved effective against urinary tract infection caused by E. coli in both mice and in nonhuman primates . Filamentous hemagglutinin (FHA) and pertactin adhesins of gram-negative bacteria Bordetella pertussis elicits long-lasting cell mediated respiratory immune response . These adhesins are components of the approved acellular pertussis licensed vaccine . Another adhesin Neisseria meningitidis adhesin A (NadA) is part of a multicomponent meningococcal serogroup B vaccine named Bexero, which is capable of eliciting a robust immune response. This vaccine has cleared all clinical trials and awaiting license for use .
2 Materials and Methods
2.1 Bioinformatics Approaches of Adhesin Characterization
Sequence Similarity Search: Sequence similarity search is very popular and is among the first to be applied in sequence analysis. The goal here is to obtain orthologous sequences corresponding to a given query. This approach has been used to identify orthologues of known adhesins characterized in other pathogens (see Note 1 ). The best known algorithm is the Basic Local Alignment Search Tool (BLAST) algorithm . Examples include application of BLAST algorithms in screening for potential adhesins in Mycoplasma agalactiae, Escherichia coli, Mycoplasma conjunctivae, Mycoplasma pneumonia, Rickettsial species [16, 17, 18, 19, 20, 21]. In addition BLAST can be used to identify orthologues of enzymes from pathogens involved in virulence: Hyaluronidase, Neuraminidase, Phospholipases, Proteases, Collagenase, Kinase, Coagulase, Leukocidins, Hemolysins.
Sequence Motif search: Sequence motif refers to a particular arrangement or pattern of amino acids within a protein sequence, or nucleotides within a DNA sequence, which is characteristic of a specific biochemical function . In particular, majority of protein sequence motifs, provide unique detectable sequence features for a set of protein sequences and thus act as signatures of protein families. Such motifs indicate similar functional roles.For example, in fungi, many Glycosylphosphatidylinositol-modified (GPI) proteins linked to plasma membrane via preformed GPI anchor play role in adhesion and virulence [23, 24]. These proteins have C-terminal GPI-motif described as follows: “[GNSDAC]-[GASVIETKDLF]-[GASV]-X(4,19)-[FILMVAGPSTCYWN](10)>” in Prosite format, where “>” indicates the C-terminal end of the protein . Algorithm based on identifying sequences having a C-terminal, fungus-specific, consensus sequence for GPI modification (GPI-motif) helps screen a set of potential fungal adhesins . Table 1 lists the motifs identified in several adhesins.Table 1
Motifs in adhesins and other virulence factors
These are right-handed parallel beta-helix supersecondary structural motifs in primary amino acid sequences. Present in toxins, virulence factors, adhesins, and surface proteins of Chlamydia, Helicobacteria, Bordetella, Leishmania, Borrelia, Rickettsia, Neisseria, and Bacillus anthracis
These are tetrapeptide motifs FxxN and GGA(I, L, V) present in polymorphic membrane protein family (Pmp) of Chlamydia pneumonia. They are required as duplicate copies for adhesion to host cells
These are arginine-glycine-aspartic acid (RGD) and glycosaminoglycan binding site (SGXG) motifs present in autotransporter family proteins of Bordetella pertussis—pertactin (Prn), Bordetella resistance to killing (BrkA) and Bordetella autotransporter protein-C (BapC). The arrangement of motifs confer BapC adhesive property to binding sites on the macrophages and epithelial cells
PARF motif (A/T/E)xYLxx(LYF)N
This is a (A/T/E)XYLXXLN amino acid sequence motif referred to as PARF (peptide associated with rheumatic fever). It is located in the N-terminal hypervariable region of the collagen binding M protein type 3 of Streptococcus pyogenes and Streptococcus dysgalactiae ssp. equisimilis (SDSE)
HExxH containing metalloprotease adhesins
This is a zinc binding sequence motif His-Glu-Xaa-Xaa-His. It is present in certain adhesins like Treponema pallidum extracellular matrix binding adhesin Tp0751
Signal Peptide: Signal Peptide (SP) is a short stretch of sequence present in the N-terminus of the protein directing it to the secretory pathway . Adhesins being membrane attached proteins usually posses N-terminal signal peptide for translocation across the membrane of the endoplasmic reticulum [32, 33]. Therefore, algorithms using this information to screen for proteins having N-terminal signal peptide may help identifying potential adhesins (see Note 2 ). However, there are adhesins called “anchorless adhesins,” which do not have Signal peptide or Transmembrane domain. These “anchorless adhesins” cannot be identified through these approaches.
Transmembrane domain: Transmembrane domains are the regions of membrane proteins which traverse in and out, looping through the membrane. They are characteristics of integral membrane proteins. Since adhesins are mostly membrane proteins, the prediction of proteins having transmembrane domain would contribute to the set of putative adhesins (see Note 3 ). However, this approach would lead to large number of false positives as not all proteins possessing transmembrane domains are adhesins.
Domain Search: Domains are conserved autonomously folding functional unit of a protein . The domains of a protein together define the function of the protein. The domain information of an unannotated protein sequence can be used to predict its function (see Note 4 ).
Some adhesin domains are known. Examples include GLEYA adhesin domain, PA14 domain, ALS_N domain in fungal species, YadA adhesin protein domain, fibrinogen-binding domain, Gingipain adhesin domains forming part of cleaved adhesin domain in bacterial species [35, 36, 37, 38, 39, 40]. Sequence analysis to study the presence of such adhesin related domains in the query protein sequence may help predicting potential adhesins.
2.2 Challenges in Bioinformatics Characterization of Adhesins
Although the computational methods described in preceding section permit identifying potential adhesins they are limited in their scope. Unlike many families of proteins, adhesins lack a well defined common sequence pattern or signatures, rendering their identification using the general signature sequence search or unique motif search difficult. This is mainly because adhesins include diverse proteins. Even adhesins belonging to same species include diverse molecular types and lack a common specific pattern in sequence. For example, the adhesins- M proteins in Streptococcus pyogenes, Gal/GalNAc lectin in Entamoeba histolytica, Fimbrial adhesins in Escherichia coli, Blood group antigen binding adhesin (BabA) in Helicobacter pylori, YadA collagen binding adhesin in Yersinia enterolitica [41, 42, 43, 44, 45] lack significant similarity among each other.
These limitations formed the foundation for developing non-homology group of algorithms, which use a large number of compositional properties.
2.3 Specialized Algorithms for Adhesin Prediction
2.3.1 SPAAN: A Software Program for Prediction of Adhesins and Adhesin-Like Proteins
SPAAN is an adhesin prediction tool developed using artificial neural network trained on compositional properties of known adhesins and non-adhesins. The algorithm is trained to predict adhesins and adhesin-like proteins solely from the sequence data. It is a non-homology method. SPAAN was trained using 105 compositional properties including 20 amino acid frequencies, 20 selected dipeptide frequencies, 20 multiplet frequency, 20 charge compositions, and 25 hydrophobic compositions. It showed an optimal sensitivity of 89 % and specificity of 100 % on a defined test set and could identify 97.4 % of known adhesins at high Pad value from a wide range of bacteria. Though SPAAN was trained on datasets dominated by bacterial adhesins, it can be used for general purpose to identify adhesins from a wide spectrum of species belonging to diverse phyla. Many novel adhesins in diverse species have been characterized using SPAAN . It is one of the most widely used adhesin prediction tool available. The standalone software package of SPAAN can be downloaded from http://sourceforge.net/projects/adhesin/files/.
System Requirement: Red Hat Linux version 7.3 or above.
Other requirements: C compiler
SPAAN is provided as a tar-gzipped file. Post download, it should be unzipped and untarred by the command “tar xvzf SPAAN.tar.gz.”
The query sequences should be in FASTA format. Multiple sequences can be present in the input file.
The input file should be named as “query.dat.”
The command to run the software SPAAN is “./askquery.”
The output data is stored in “query.out.”
If the existing binary files are not compatible to the system, the source C codes provided need to recompiled using the following example command-“gcc –lm standard.c –o standard.o.”
List of C source codes to be compiled—standard.c, filter.c, annotate.c, and finalp1.c in the main SPAAN directory; recognize.c, AAcompo.c, hdr.c, multiplets.c, querydipep.c, and charge.c in their respective directories: AAcompo, hdr, multiplets, dipep, and charge: recognize.c needs to be compiled individually in each of the five mentioned directories.
2.3.2 MAAP: Malarial Adhesin and Adhesin-Like Proteins Predictor
MAAP was developed using Support Vector Machine (SVM) trained through compositional properties for classifying malarial adhesins and adhesin-like proteins . The SVMlight package  of Support Vector Machine was used for this purpose. A total of 420 compositional properties including amino acid frequencies of 20 and 400 dipeptide frequencies were used to characterize the sequences of known adhesins and nonadhesins of Plasmodium species. MAAP runs on complete proteomes of Plasmodium species revealed that in Plasmodium falciparum at Pmaap scores above 0.0, a sensitivity of 100 % was observed with two false positives. In P. vivax and P. yoelii an optimal threshold Pmaap score of 0.7 was found optimal with very few false positives (upto 5). The MAAP Web server provides users with an interface where they can paste or upload their query sequences and predict whether the protein sequence is an adhesin (see Note 5 ). Users have the facility to set their own desired threshold cutoff value. The result can be exported as tab delimited text file by the users. The standalone version can be downloaded from the “Download” tab of MAAP Web server or http://sourceforge.net/projects/adhesin/files/.
2.3.3 FungalRV adhesin Predictor
In pathogenic fungi, adhesins play major roles as virulence factors mediating the interaction of the pathogens to variety of host cell types. In addition, adhesins in fungi aid in biofilm formation contributing to increased drug resistance and persistence of infections . It has been established that differences in adhesion are responsible for greater virulence of one strain compared to other in fungi . The fungal pathogens represent a diverse group of species.
In addition to FungalRV, another Support Vector Machine (SVM) based algorithm named Faapred for prediction of fungal adhesins and adhesin-like proteins is available . The SVM models for Faapred development were trained with compositional features- amino acid, dipeptide, multiplet fractions, charge and hydrophobic compositions, as well as PSI-BLAST derived PSSM matrices. The best classifiers were screened based on high MCC and accuracy. The amino acid composition model (ACHM), PSSM-a, and PSSM-b came out as the best classifiers with ACHM providing the highest MCC value of 0.610. Thus the prediction of Faapred uses classifiers based on compositional properties as well as PSSM. Faapred provides overall accuracy of 86 %. The prediction method is freely available as a World Wide Web based server at http://bioinfo.icgeb.res.in/faap.
BLAST algorithm is widely used to fetch orthologues. Reciprocal Best Hits (RBH) method has shown good efficiency in identifying orthologues. RBH is based on the principle that two genes from different genomes are orthologous if they find each other as the best hit in BLAST search in the other genome. Here BLASTP is usually carried out at a maximum E-value threshold of 1 × 10−6, including Smith–Waterman algorithm and Soft-filtering.
Various bioinformatics algorithms are available, which aid identifying signal peptides. SignalP algorithm available at http://www.cbs.dtu.dk/services/SignalP/ is widely used. The query sequences input in FASTA format can be submitted to predict presence of signal peptides.
Transmembrane prediction algorithms for example TMHMM available at http://www.cbs.dtu.dk/services/TMHMM/ is generally used to predict presence of transmembrane regions.
Conserved Domains can be predicted using domain prediction algorithms for example CDD search available at http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml. The presence of known adhesin related domains in the query sequences can be predicted.
The query proteins in FASTA format can be uploaded in the MAAP Web server. The server can be used to analyze the whole genome in one run.
Query protein sequences in FASTA format can be uploaded in FungalRV Web server. This Web server can be used to analyze the whole genome.
An adhesin vaccine should ideally not have similarity to human reference proteins to avoid cross-reactivity. The facility to conduct BLAST search with human reference proteins has therefore been provided in the FungalRV Web server. The cutoff E-value used here is 0.01, which borders on the limits of threshold similarity.
RC thanks The Indian Council of Medical Research for fellowship. This work was funded through grants “GENESIS” BSC0121 to SR from CSIR.
- 6.Rothy A, James L, Ed. (1988) Virulence mechanisms of bacterial pathogens. American Society for Microbiology, ISBN 0-914826-99-9Google Scholar
- 10.Pandey AK, Reddy KS, Sahar T, Gupta S, Singh H, Reddy EJ, Asad M, Siddiqui FA, Gupta P, Singh B, More KR, Mohmmed A, Chitnis CE, Chauhan VS, Gaur D (2013) Identification of a potent combination of key Plasmodium falciparum merozoite antigens that elicit strain-transcending parasite-neutralizing antibodies. Infect Immun 81:441–451PubMedCentralPubMedCrossRefGoogle Scholar
- 11.Langermann S, Mollby R, Burlein JE, Palaszynski SR, Auguste CG, DeFusco A, Strouse R, Schenerman MA, Hultgren SJ, Pinkner JS et al (2000) Vaccination with FimH adhesin protects cynomolgus monkeys from colonization and infection by uropathogenic Escherichia coli. J Infect Dis 181:774–778PubMedCrossRefGoogle Scholar
- 13.Halperin SA, Scheifele D, Mills E, Guasparini R, Humphreys G, Barreto L, Smith B (2003) Nature, evolution, and appraisal of adverse events and antibody response associated with the fifth consecutive dose of a five-component acellular pertussis-based combination vaccine. Vaccine 21:2298–2306PubMedCrossRefGoogle Scholar
- 21.Palaniappan RU, Chang YF, Jusuf SS, Artiushin S, Timoney JF, McDonough SP, Barr SC, Divers TJ, Simpson KW, McDonough PL, Mohammed HO (2002) Cloning and molecular characterization of an immunogenic LigA protein of Leptospira interrogans. Infect Immun 70:5924–5930PubMedCentralPubMedCrossRefGoogle Scholar
- 29.Reissmann S, Gillen CM, Fulde M, Bergmann R, Nerlich A, Rajkumari R, Brahmadathan KN, Chhatwal GS, Nitsche-Schmitz DP (2012) Region specific and worldwide distribution of collagen-binding M proteins with PARF motifs among human pathogenic streptococcal isolates. PLoS One 7:e30122PubMedCentralPubMedCrossRefGoogle Scholar
- 32.Lodish H, Berk A, Zipursky SL et al. (2000) Molecular cell biology, 4th edn. W. H. Freeman, New York, NY. Section 17.4, translocation of secretory proteins across the ER membrane. http://www.ncbi.nlm.nih.gov/books/NBK21532/
- 48.Joachims T (1999) Making large-scale SVM learning practical. In: Scholkopf B, Burges C, Smola A (eds) Advances in Kernel methods—support vector learning. MIT, Cambridge, MA, pp 169–185Google Scholar