Integrative analysis workflow for the structural and functional classification of C-type lectins
- 3k Downloads
It is important to understand the roles of C-type lectins in the immune system due to their ubiquity and diverse range of functions in animal cells. It has been observed that currently confirmed C-type lectins share a highly conserved domain known as the C-type carbohydrate recognition domain (CRD). Using the sequence profile of the CRD, an increasing number of putative C-type lectins have been identified. Hence, it is highly needed to develop a systematic framework that enables us to elucidate their carbohydrate (glycan) recognition function, and discover their physiological and pathological roles.
Presented herein is an integrated workflow for characterizing the sequence and structural features of novel C-type lectins. Our workflow utilizes web-based queries and available software suites to annotate features that can be found on the C-type lectin, given its amino acid sequence. At the same time, it incorporates modeling and analysis of glycans - a major class of ligands that interact with C-type lectins. Thereafter, the results are analyzed together with context-specific knowledge to filter off unlikely predictions. This allows researchers to design their subsequent experiments to confirm the functions of the C-type lectins in a systematic manner.
The efficacy and usefulness of our proposed immunoinformatics workflow was demonstrated by applying our integrated workflow to a novel C-type lectin -CLEC17A - and we report some of its possible functions that warrants further validation through wet-lab experiments.
KeywordsHomology Modeling Virtual Screening Docking Study Carbohydrate Recognition Domain Surface Binding Site
C-type lectins are Ca2+-depending sugar-binding proteins that are involved in several immune-related and other physiological functions. They are ubiquitous in the animal kingdom, and exist mostly as membrane receptors. Indeed, C-type lectins play an important role in pathogen recognition and cell-cell interaction through specific binding with glycans (sugars) found on the surfaces of target cells and glycosylated molecules . The importance of understanding C-type lectins and finding their interacting partners (both glycans as well as other molecules) is exemplified by applications in immuno- and vaccination-therapies, where lectins expressed on cells such as Dendritic cells (DCs) can be targeted by their natural ligands or antibodies that are directed against them. Such ligands are usually conjugated with antigens, which can be presented to T-cells upon ligand binding, leading to subsequent T-cell maturation and development of immunity towards the antigen . C-type lectins also have extensive applications in protein engineering, where mutations can be made to specific sites to modify their specificity towards certain ligands. Such modifications can be made only when we have a better understanding of their structural and functional characteristics .
Presently, 17 groups within the C-type lectin superfamily have been recognized , with more C-type lectins being constantly discovered based on the presence of a conserved 115-130 amino acid domain along their sequences - the C-type carbohydrate recognition domain (CRD). However, for most of the recently identified C-type lectins, their interactions with carbohydrates, intracellular functions and molecular mechanisms still remain unclear. Thus it is highly needed to characterize these proteins in order to uncover their possible physiological and pathological roles in the immune system. On a similar note, it is also imperative to develop techniques in glycoinformatics, so as to aid the elucidation and analysis of protein-glycan interactions - one of the key processes in the mammalian immune system .
To this end, we propose an integrative analysis workflow that utilizes various techniques and algorithms to systematically discover and annotate the putative functions of novel C-type lectins. Our workflow starts with the amino acid sequences to predict the primary functional units, i.e. domains and motifs. It is followed by homology modeling to determine the molecular structures of the C-type lectins. In tandem with this step is the generation of glycan conformer libraries, with the glycan composition being obtained from various sources and possibly specified in different formats. Finally, computational virtual screening is performed to identify potential protein-glycan interactions.
Integrative workflow for sequence and functional analysis of C-type lectins
It is possible to predict the putative functions of novel C-type lectins by analyzing their amino acid sequences and structures. This is due to the accepted view that protein functions can be ‘inherited through homology’ . In general, a peptide is composed of independently functioning smaller units, i.e. domains. Together with the advent of computational methods to identify these domains along a protein sequence, and the growing collection of known domains and their associated functions, e.g. Pfam , PROSITE , SMART , and InterProScan , it becomes evident that the first steps to analyze an unknown C-type lectin is to search its sequence for conserved domains. These domains indicate the possible functions, interactions and cellular locations of the C-type lectin, and also the secondary and tertiary structures it may assume.
Aside from sequence-based analysis, one can also study C-type lectins through their molecular structures, which can be either obtained through computational prediction , or determined by x-ray crystallography. Such physicochemical approaches can aid in understanding the molecular mechanisms of their functions at the atomic level. For instance, van Liempt et al.  analyzed the molecular structures of the C-type lectins DC-SIGN and L-SIGN, and identified the residues that were responsible for the differences in their carbohydrate binding profiles. Glazer et al.  further improved the prediction of potential Ca2+ binding sites by incorporating molecular dynamics to the protein structures. Going forward, docking studies and in silico screening can be performed against virtual libraries of glycans . This is already an integral part of the industrial drug discovery process for other proteins .
There is a plethora of different sequence analysis algorithms that can identify domains and motifs within a protein sequence. For instance, PROSITE scans a query protein sequence against an internal database of sequence signature patterns which were curated from literature. In addition, for each pattern, there is a miniprofile to refine the hits, as well as post-processing of the matches with some contextual information to improve accuracy . On the other hand, Pfam stores its database of protein domains as hidden Markov models (HMMs) and uses the HMMER3 algorithm to determine the presence of the domains within a query protein sequence . As such, the first step for analysis will be to leverage these existing platforms in order to gather as much information as possible, given a C-type lectin amino acid sequence.
List of servers and algorithms
Type of features
TMHMM 2.0 (http://www.cbs.dtu.dk/services/TMHMM)
Eukaryotic linear motifs
The next step in our workflow is to construct the molecular structure of the C-type lectin. Here, homology modeling can be employed to predict its structure. Generally, homology modeling of C-type lectins follows a series of steps - (i) template selection, (ii) structural alignment, (iii) model construction and constraint satisfaction, and (iv) refinement. For template selection, the sequence of the C-type lectin is first queried against the set of non-redundant proteins in the PDB database using the BLASTp algorithm . Proteins with moderate levels of sequence identity, typically more than 30% of the aligned regions , are then chosen as templates for modeling.
Note that there can be multiple templates, especially when they are aligned to different regions of the query protein. In addition, it is not always the case where the entire C-type lectin can be modeled. As the CRD is the most highly conserved region of C-type lectins, its homologs can usually be found in the PDB database. Upon selection of the templates, the query sequence and the templates are re-aligned based on a more stringent set of criteria which include fractional side chain accessibility and secondary structure type. Finally, using the template structures, the model is constructed by initially copying the coordinates of the backbone atoms (C, Cα, N and O) of aligned residues. It is followed by filling the gaps (i.e. loop and gap modeling), adding side chain residues to the backbone amino acids, and adjusting the model to make sure that spatial constraints are not violated . Depending on the level of alignment between the query C-type lectin and template sequences, an additional refinement step via molecular dynamics simulation may be required. In our workflow, all four steps are performed using the software suite Discovery Studio 2.5 by Accelrys, Inc . This part of the workflow is not yet automated due to the manual intervention for the selection of templates during the model construction. There are, however, some existing works that have attempted to simplify molecular modeling into a one-step process [21, 22] and these may be incorporated into our workflow later on.
As there is no crystal structure available for most of the novel C-type lectins, the predicted structures can only be validated using algorithms that assess their correctness based on physicochemical properties such as planarity, chirality and bond length deviations  of the residues. PROCHECK  is one of the software packages performing this function. In our case, we use the Profiles-3D methology  for structure validation. In addition, for each structure being constructed, its Ramachandran diagram is also plotted and analyzed to detect significant violations of the psi-phi angles between the amino acid residues . We select the best scoring model that has no gross physicochemical violations for further analysis and classification. Having obtained the molecular model of the C-type lectins, we can then perform docking studies to identify their putative binding partners.
Glycan conformer generation
For docking simulations, the structures of both the receptors and ligands must be known. In our current setting, C-type lectins are the receptors for glycan molecules. Having obtained their structures through homology modeling, we now require the glycan structures. Despite the availability of small ligand databases such as ZINC , they are not specific to glycans, thus making it difficult to search for the relevant models. Moreover, with the huge diversity of natural and synthetic glycans, it is technically challenging to resolve their structures and store them in databases.
The final step in the functional classification of C-type lectins in our workflow is to screen for plausible interactions with the glycan library through computational docking studies. We use LigandFit, an algorithm that locates possible binding sites by analyzing cavities in the protein structure before trying to dock each glycan from our virtual library . The output from this virtual screening is a list of glycans that have plausible poses in any of the predicted binding sites.
Results and discussion
Sequence Analysis of CLEC17A
From the results, CLEC17A is a Type II transmembrane protein. As a C-type lectin, it is predicted to have a high specificity towards mannose and Ca2+ due to the presence of the EPN motif (position 341) and WND motif (position 359) respectively. Within the extracellular region, there are two predicted N-linked glycosylated sites (positions 215 and 237), which may play a physiological role in the transport and localization of CLEC17A to the cell surface . We used some of these results to complement the experimental investigation and analysis of N-linked glycosylation sites on CLEC17A (See Additional File 3)
For the cytoplasmic region, there are several domains and motifs of interest. In particular, a number of SH2 and SH3 recognition domains can be found within a proline-rich region. The same SH2 binding motifs are also predicted to be phosphorylated by proline-directed kinases. A possible candidate would be the mitogen-activated protein kinase (MAPK). This adds to the confidence that SH2 containing proteins such as the adaptor protein Grb2 and Src family proteins can dock to the cytoplasmic tail of CLEC17A. Another possible intracellular signaling mechanism can be inferred by the presence of hemi-ITAM motifs (YxxL). This motif, which is also present in Dectin-1, can recruit and activate the Syk family kinases . Incidentally, Syk also has SH2 domains, supporting the hypothesis that it interacts with CLEC17A.
Casein kinase II (CKII) is predicted to be another kinase that may phosphorylate CLEC17A based on its recognition motif ([ST]xx[DE]). Following the consensus between Prosite and ELM, the possible phosphorylation sites were shortlisted to positions 16, 42, and 68. Furthermore, these regions are enriched with glutamic acid, providing the acidic context for CKII phosphorylation . Other potential kinases for CLEC17A include protein kinase C (PKC) at position 107 and glycogen synthase kinase-3 (GSK3) at position 146, the latter being less reliable as the specificity of GSK3 has not been confirmed. Of note is the presence of TNF receptor-associated factor 2 (TRAF2) binding motif ([PSAT]x[QE]E) . Although TRAF2 is commonly associated with the tumor necrosis factor receptor (TNFR) superfamily, it has been suggested by Geijtenbeek and Gringhuis  that the activation of nuclear factor NF-κB by Dectin-1 may involve the recruitment and activation of TRAF2-TRAF6 complex. Since there are some similarities in the cytoplasmic motifs found in Dectin-1 and CLEC17A, it is possible that this interaction is present in CLEC17A intracellular signaling as well. Nevertheless, confirmation of these features awaits experimental verification.
There are several other regulatory motifs that were found by the prediction servers. However, the biological context for their functions were not present in CLEC17A, and hence were not considered further. For instance, the C-terminal binding protein (CtBP) interacting motif (position 121) occurs mostly in DNA-interacting proteins and transcription factors. Since CLEC17A is a transmembrane receptor, this motif is discarded as a false positive.
Structure prediction and docking studies of CLEC17A
Next, we moved on to the virtual screening of the two surface binding sites against the glycan library using the following docking protocols - (i) CDocker, (ii) LibDock and (iii) LigandFit. In order to render the poses from the different protocols comparable, we re-scored them using a set of standard scoring functions -LigScore1,2 , Piecewise linear potential (PLP1,2) , Jain , and potential of mean force (PMF) . A consensus score is then generated for each ligand. Finally, the ligand poses are sorted according to the consensus score, and the top 25% unique ligands for each binding site are selected for further analysis.
As an initial analysis of the global glycan binding profile of CLEC17A, we looked at the terminating monosaccharides of the dockable glycans: it has been suggested in Taylor and Drickamer  that the binding specificities of C-type lectins may be due to their interaction with the terminal sugar. Hence, for each type of terminal monosaccharide, we obtained the list of corresponding glycans from the library and computed the proportion that docks to CLEC17A (Figure 5C). The results suggested that CLEC17A, in addition to its specificity towards mannose, may also bind glycans terminating with sugars such as fucose-β, N-glycolylneuraminic acid-α, N-acetylglucosamine-α and N-acetylgalactosamine-β. Note that as this is an initial analysis, a more thorough approach might be required to confirm the possible interactions between CLEC17A and the glycans, as well as the amino acid residues responsible for forming the bonds.
In this work, we have collected various methods for analyzing the putative structures and functions of novel C-type lectins and incorporated some of them into an integrative workflow for studying such lectins in a bottom-up manner. Sequence-based motifs and domains are first identified using an integrative metaserver. The structure of the given lectin is then constructed by homology modeling, and its putative functions are assessed through virtual screening against an in silico library of glycans that are found in mammalian cells. Having such a workflow in place will significantly increase the speed and efficiency of identifying the putative roles and functions of novel C-type lectins for further experimental validation. We applied our workflow to elucidate the putative functions of a novel human C-type lectin -CLEC17A, and characterized it as a N-linked glycosylated transmembrane protein with high specificity towards mannose and fucose. Preliminary screening studies have also shown that CLEC17A possibly binds glycans that terminate with a few other monosaccharides such as N-glycolylneuraminic acid and N-acetylglucosamine. Additionally, the presence of motifs that bind to SH2 and SH3 domains, as well as the hemi-ITAM motifs suggests that CLEC17A is involved in intracellular signaling which could lead to the production of cytokines such as interleukins.
With the development of more algorithms to predict sequence and structural features on C-type lectins, several more possible cellular functions of lectins may be revealed. However, the algorithms will have varying sensitivity and specificity. Although not all of them have been integrated into the workflow yet, we have demonstrated that integrating and interpreting the results together are invaluable in both filtering out improbable predictions and aiding the design of future experiments for validation. With all the collated results, future work will include probabilistic approaches for accepting or rejecting prediction results.
Moreover, some parts of our workflow still require human supervision. At present, there are some works that aim to achieve the complete automation of homology modeling [21, 22], and these can be integrated within our workflow to make it as an entirely automated process in the future. Incorporating the workflow with systems-level analysis such as pathway information will also shed more light not only on the features of the novel C-type lectins, but also their molecular mechanisms and functions from a network-centric point of view. In addition, we are currently developing an in-house database system to store information on C-type lectins and their interacting partners, and it will be designed to allow direct entry of information from the prediction results generated via the workflow.
The financial support of the Agency for Science, Technology and Research (A*STAR) is gratefully acknowledged. This work was also supported by the Academic Research Fund (R-279-000-328-112) from the National University of Singapore and a grant from the Next-Generation BioGreen 21 Program (No. PJ008184), Rural Development Administration, Republic of Korea.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 14, 2011: 22nd International Conference on Genome Informatics: Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S14.
- 10.Hunter S, Apweiler R, Attwood TK, Bairoch A, Binns ABD, Bork P, Das U, Daugherty L, Duquenne L, Finn RD, Gough J, Haft D, Hulo N, Kahn D, Kelly E, Laugraud A, Letunic I, Lonsdale D, Lopez R, Madera M, Maslen J, McAnulla C, McDowall J, Mistry J, Mitchell A, Mulder N, Natale D, Orengo C, Quinn AF, Selengut JD, Sigrist CJA, Thimma M, Thomas PD, Valentin F, Wilson D, Wu CH, Yeats C: InterPro: the integrative protein signature database. Nucleic Acids Research 2009, 37: D211-D215. 10.1093/nar/gkn785PubMedCentralCrossRefPubMedGoogle Scholar
- 12.van Liempt E, Imberty A, Bank CMC, van Vliet SJ, van Kooyk Y, Geijtenbeek TBH, van Die I: Molecular basis of the differences in binding properties of the highly related C-type lectins DC-SIGN and L-SIGN to Lewis X Trissaccharide and Schistosoma mansoni egg antigens. The Journal of Biological Chemistry 2004, 279(32):33161–33167. 10.1074/jbc.M404988200CrossRefPubMedGoogle Scholar
- 23.Engh RA, Huber R: Accurate bond and angle parameters for X-ray protein structure refinement. Acta Crystallographica Section A 1991, 47(4):391–400.Google Scholar
- 29.Weininger D: SMILES, a chemical language and information system. 1. Introduction to methology and encoding rules. Journal of Chemical Information and Computer Sciences 1988, 28: 31–36. 10.1021/ci00057a005Google Scholar
- 35.Kataoka H, Kume N, Miyamoto S, Minami M, Murase T, Sawamura T, Masaki T, Hashimoto N, Kita T: Biosynthesis and post-translational processing of lectin-like oxidized low density lipoprotein receptor-1 (LOX-1). The Journal of Biological Chemistry 2000, 275(9):6573–6579. 10.1074/jbc.275.9.6573CrossRefPubMedGoogle Scholar
- 37.Songyang Z, Lu KP, Kwon YT, Tsai LH, Filhol O, Cochet C, Brickey DA, Soderling TR, Bartleson C, Graves DJ, deMaggio AJ, Hoekstra MF, Blenis J, Hunter T, Cantley LC: A structural basis for substrate specificities of protein Ser/Thr kinases: primary sequence preference of Casein kinase I and II, NIMA, phos- phrylase kinase, Calmodulin-dependent kinase II, CDK5, and Erk1. Molecular and Cellular Biology 1996, 16(11):6486–6493.PubMedCentralPubMedGoogle Scholar
- 43.Gehlhaar DK, Verkhivker GM, Rejto PA, Sherman CJ, Fogel DB, Fogel LJ, Freer ST: Molecular recognition of the inhibitor AG-1343 by HIV-1 protease: conformationally flexible docking by evolutionary programming. Chemistry and Biology 1995, 2: 317–324. 10.1016/1074-5521(95)90050-0CrossRefPubMedGoogle Scholar
- 45.Muegge I, Martin YC: A general and fast scoring function for protein-ligand interactions: a simplified potential approach. Journal of Medicinal Chemistry 1999, 31: 45–71.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.