Structural genomics target selection for the New York consortium on membrane protein structure
- 624 Downloads
The New York Consortium on Membrane Protein Structure (NYCOMPS), a part of the Protein Structure Initiative (PSI) in the USA, has as its mission to establish a high-throughput pipeline for determination of novel integral membrane protein structures. Here we describe our current target selection protocol, which applies structural genomics approaches informed by the collective experience of our team of investigators. We first extract all annotated proteins from our reagent genomes, i.e. the 96 fully sequenced prokaryotic genomes from which we clone DNA. We filter this initial pool of sequences and obtain a list of valid targets. NYCOMPS defines valid targets as those that, among other features, have at least two predicted transmembrane helices, no predicted long disordered regions and, except for community nominated targets, no significant sequence similarity in the predicted transmembrane region to any known protein structure. Proteins that feed our experimental pipeline are selected by defining a protein seed and searching the set of all valid targets for proteins that are likely to have a transmembrane region structurally similar to that of the seed. We require sequence similarity aligning at least half of the predicted transmembrane region of seed and target. Seeds are selected according to their feasibility and/or biological interest, and they include both centrally selected targets and community nominated targets. As of December 2008, over 6,000 targets have been selected and are currently being processed by the experimental pipeline. We discuss how our target list may impact structural coverage of the membrane protein space.
KeywordsMembrane proteins Target selection Structural genomics Structure determination
Alpha helical bundle integral membrane protein
Domain of unknown function
Integral membrane protein
New York consortium on membrane protein structure
Protein data bank
Protein structure initiative
NCBI reference sequence
Uncharacterized protein family
- Reagent genomes
List of entirely sequenced organisms from which PSI clones its targets
NYCOMPS as part of the PSI
The protein structure initiative (PSI) is the leading structural genomics (SG) initiative in the USA; it is funded by the National Institute of General Medical Sciences (NIGMS) at the National Institutes of Health (NIH). The PSI currently supports four large production centers and six specialized centers as well as other activities [1, 2, 3]. Two of these specialized centers focus on developing new technologies for membrane protein structure determination: the Center for Structures of Membrane Proteins (CSMP)  and the New York Consortium on Membrane Protein Structure (NYCOMPS). At NYCOMPS (http://www.nycomps.org/) we have established a high-throughput pipeline beginning with target selection and further including protein purification, protein expression and scale-up. Scaled-up proteins are sent to individual participating labs (within or outside of the consortium) for structure determination trials. While most of our resources are channeled into X-ray crystallography, we also pursue structure determination by NMR, solid-state NMR and cryo-electron microscopy. Here, we describe target selection, the first stage of the NYCOMPS pipeline.
Structure determination of integral membrane proteins is difficult
Integral membrane proteins (IMPs) are usually classified into two structural classes, according to the secondary structure conformation adopted by their membrane spanning segments [5, 6]: alpha helical bundle integral membrane proteins (αIMPs; estimated to constitute ~25% of an average proteome [7, 8]) and beta barrel IMPs (estimated, for example, to account for ~2–3% of proteins in Gram-negative bacteria [9, 10, 11]). These two classes of proteins differ also in their membrane localization: beta barrels are exclusive to the outer membrane of Gram-negative bacteria, atypical Gram-positive bacteria, mitochondria and chloroplasts, while alpha helical IMPs have been observed in all other membranes . Recent structural data demonstrates the existence of at least one additional structural class of IMPs, the alpha barrel . At NYCOMPS, we have so far focused only on αIMPs, because of their abundance in genomes and high biological and medical relevance [6, 14].
αIMPs are among the most difficult proteins for structure determination studies [15, 16, 17]. Consequently, they are extremely under-represented in the set of proteins for which we have high-resolution experimental structures. In particular, fewer than 1% of the proteins in the Protein Data Bank (PDB)  are IMPs (in March 2009 there are a total of 432 IMPs including both αIMP and beta barrels among the 56,217 PDB structures, see http://blanco.biomol.uci.edu/Membrane_Proteins_xtal.html for up-to-date IMPs statistics ) while estimates based on fully sequenced genomes predict that 25% of all annotated proteins are αIMPs [7, 8].
αIMPs present several challenges for successful experimental structure determination. First, high protein yields are often essential for structural studies; unfortunately, αIMPs are generally expressed at naturally low levels and their over-expression is often toxic to the cell . Second, αIMPs are usually insoluble due to the long hydrophobic helices that are needed to span the lipid bilayer of the membrane core (typically around 17 residues long [21, 22]). Detergents are used to disrupt the membrane and prevent nonspecific aggregation. However, the choice of the detergent and the optimization of other buffer components, such as salt and glycerol, are challenging tasks [23, 24]. Finally, solution components that are useful for protein solubilization can interfere with crystallization, and crystallization success is also heavily dependent on the lipid content of the protein–detergent complexes . In essence, each step in the experimental purification and structure determination of an αIMP is extremely demanding; typically, many parameters must be optimized to obtain a high-resolution membrane protein structure , usually working with small amounts of proteins.
SG adds large sampling of diversity as a new dimension
Structural genomics (SG) tries to increase the odds of experimentally obtaining high-resolution structures by using a pan-genomic approach that adds homology as a new dimension to the structure determination problem. This approach has been described in numerous publications [25, 26, 27]. Typically, all proteins within a given realm are clustered into pan-genomic sequence families, i.e. families with members found in different genomes, and then a set of those proteins are selected for which DNA templates are available from ‘reagent’ genomes. The set of proteins is tested experimentally for structure determination. An experimental structure for any one member of the family can serve through comparative modeling to inform studies on any other family member . This rationale is behind the tremendous success and impact of the PSI structural leverage . Here, we take a slightly different approach based on the concept of seed sequences. In brief, given a target protein π* (the “seed”) our goal is to find a protein π, the structure of which is predicted to be similar to π* and that surrenders a high-resolution structure using available experimental procedures (whereas the seed may fail). As in the conventional approach, the structure of π can provide a comparative structural model for π*. The advantage of this approach over clustering of the full set of targets at the inception of the project is that we can create our families by expanding promising seeds whenever such seeds become available, instead of having to map these seeds to predefined families. This is likely to increase the ‘centrality’ of the seed with respect to the cluster.
Goals of target selection at NYCOMPS
The NYCOMPS target selection aims at providing the experimental pipeline with αIMPs that are: (1) novel with respect to what is already in the PDB, (2) diverse, with respect to their known sequence, structural and functional features, and (3) most likely to yield a structure. In order to increase αIMP feasibility we apply several computational filters that eliminate candidates less likely to succeed. These filters range from the exclusion of proteins known to constitute individual subunits of hetero-oligomeric complexes to the removal of proteins predicted to have long regions of disorder. Whatever passes those filters will enter the experimental structure determination pipeline at the New York Structural Biology Center. Targets pursued experimentally by NYCOMPS fall mostly into two broad categories: nominated and centrally selected targets. The way these targets are selected and their potential significance for structural coverage of the αIMP sequence and functional space is the subject of this contribution. We also briefly cover several special case targets that do not fall into either of the previous two categories. The Results section is organized into two parts: target selection (Task I) and target analysis (Task II).
Our target selection protocol has constantly been evolving since the start of NYCOMPS in fall 2005. The protocol has been modified to accommodate comments and suggestions coming from our team of investigators, as well as from the NYCOMPS scientific advisory committee, the NIH review panels and novel data that have appeared in the literature. In the first part of Results (target selection), we describe the protocol as of January 2009, although a fraction of the targets analyzed in the second part of Results (target analysis) were selected according to older criteria that did not include all the filtering steps described here.
Materials and methods
Creation of the NYCOMPS98 dataset
RefSeq target sequences
We downloaded 96 prokaryotic genomes in their amino acid sequence translation from the NCBI Reference Sequence collection (RefSeq ; ftp://ftp.ncbi.nih.gov/genomes; Table S1). 82 genomes are Bacteria (55 Gram- and 26 Gram+, including 3 Gram+ in the genus Mycoplasma, plus one genome that belongs to the phylum of Cyanobacteria), 14 are Archaea. Note that some of the NYCOMPS targets described here belong to 19 additional genomes that were retired in June 2008 because of: strain mismatch between the RefSeq strain and the one provided by the ATCC® (the source of our genomic DNA), early indications of poor cloning/expression performance from our experimental pipeline or both. Note also that our active list of genomes still comprises seven genomes for which we currently do not have an exact strain match with the ATCC provided genomic DNA or for which a match is uncertain. These genomes were retained based on early indications that, in these cases, strain mismatch was not causing major problems at the cloning level. Overall, the 96 NYCOMPS genomes encode 310,357 protein sequences.
Trasmembrane helix predictions
In order to identify αIMPs in the 96 chosen prokaryotic genomes, we run TMHMM2  on all sequences. While TMHMM2 has been reported to be one of the best transmembrane helix (TMH) prediction programs in more than one independent assessment [32, 33], it is also one of the fastest, i.e. ideal for predicting TMHs on a large number of sequences. TMHMM2 returns the number and location of predicted TMHs in a sequence. In principle, all proteins with at least one predicted TMH are predicted by TMHMM2 to be αIMPs. To remain on the safe side, we considered as valid targets only proteins with two or more predicted TMHs.
We run CD-HIT  with a 98% sequence identity threshold (we used parameters: -n 5 -c 0.98 -l 30 -d 30) on all proteins left from the previous step. This ensured that no two proteins in our dataset shared more than 98% sequence identity. When considering proteins sharing more than 98% sequence identity, the decision on which one to retain in our database depended on the genome the sequences belonged to. In particular, we prioritized: (1) sequences from Archaea (following the guess of our team of experimentalists that they may provide more stable proteins) and from a list of best performing genomes (“best” in terms structure yield for globular soluble proteins) provided to us by the Northeast Structural Genomics Consortium; this list was later substituted by a list of genomes with best expression yield based on preliminary data from our experimental pipeline (data not shown); (2) the longest sequence (i.e. as selected by CD-HIT).
Signal peptide predictions
We run SignalP  on all sequences from Bacteria left in our list (note: no SignalP for sequences in Archaea is available ), and excluded all sequences predicted to have two TMHs but for which the first predicted TMH started before a predicted cleavage site. For the position of the cleavage site, we took the maximum out of the neural network and the HMM SignalP predictions.
We identified disordered residues in our sequences by running IUPred  using the option ‘glob’ that predicts structured domains in a protein. IUPred is one of the best performing programs for prediction of long disordered regions [37, 38] and it is also extremely fast. We discarded all proteins that had more than 15 consecutive residues predicted by IUPred not to be in a structured domain.
The sequences left after running this protocol constitute what we call the NYCOMPS98 dataset (39,037 sequences total).
Criterion for establishing evolutionary relationships between αIMPs
In order to find homologs of an αIMP query sequence, we used sequence similarity. We run three iterations of PSI-BLAST : two profile generating iterations of the query against a large database composed of the sum of UniProtKB  and PDB , and one final iteration on the αIMP dataset of interest, e.g. TCDB  (parameters first two iterations: -j 3 –v 1,000 –t 1 -h 1e-10 -e 0.001 -F F; last iteration: -e 1 –t 1 –v 50,000 -b 50,000 -F F). Note that we input the ‘effective length of database’ of the first iterations into the last iteration (-z option) in order to have an estimate of the alignments’ E values based on a large database. We first selected as ‘homologs’ of the query protein all sequences that aligned to it with E value < 10−3 in the last iteration. Then, we additionally required all retained sequences to align to the query so that the alignment covered at least 50% of the residues predicted to be in a TMH in both query and subject sequence. All proteins that satisfied these constraints were considered part of the same structural family as the query sequence. Except when otherwise indicated, this is the criterion used to establish similarity between αIMPs throughout the paper.
Clustering of proteins in the ‘von Heijne list’
We clustered E. coli proteins in the ‘von Heijne list’ (613 protein total) using a variation of the CLUP algorithm [25, 42]. For establishing similarity between proteins, we used the criterion described above. For clustering, we chose as the initial seed the shortest protein in the list and then seeds of increasing length until no sequences were left [25, 42]. Once the E. coli proteins were grouped into paralogous clusters, we further merged any two clusters that had at least one common member such that the whole region aligned to the seed of the first cluster was also aligned to the seed of the second, or vice versa.
Post-seed-expansion filtering of targets
Exclusion of protein sequences similar to those in the PDB
We run three iterations of PSI-BLAST  using the same databases and parameters used for establishing evolutionary relationships between αIMPs (see above). Only differences were the E value threshold we used and the transmembrane (TM) coverage we required. Indeed, we discarded all proteins that at any iteration aligned with E value < 1 to a PDB protein and for which the alignment covered at least 25% of the residues predicted to be in a TMH in the target protein. Note that in this case the fraction of a PDB protein TM region aligned to the query was not deemed relevant. Even in cases in which the alignment extended over 100% of the TM region of a PDB protein, the target was not discarded unless the alignment covered more than 25% of the target TM region.
Exclusion of individual subunits of hetero-oligomeric complexes
EcoCyc  is an annotated database for E. coli strain K-12 MG1655; Swiss-Prot [40, 44] is a general-purpose database including proteins from thousands of different species. To collect information about the possible role of a given protein as constituent subunit of a hetero-oligomeric complex, we queried EcoCyc manually for E. coli proteins and Swiss-Prot automatically for both E. coli and non E. coli proteins. From the Swiss-Prot searches we used information from the “FUNCTION”, “INTERACTIONS”, “SUBUNITS”, “COFACTORS”, “SUBCELLULAR LOCATION”, “DISEASE”, “DOMAIN” and “SIMILARITY” fields. This data was then manually inspected by our team of experimentalists, who were asked to take a decision on whether or not to approve a given seed family.
Exclusion of seed family ‘outliers’
In this manual step, we excluded proteins that were very different with respect to the seed in terms of length and number of TMHs. What ‘very different’ meant depended on the family under consideration but, in general, proteins that had a difference of more than two predicted TMHs or had long (>100 residues) insertions with respect to the seed were prime candidates for exclusion. Also, we generally excluded proteins whose N-terminus we suspected might have been wrongly annotated. To this aim, we built a multiple sequence alignment of the whole family using CLUSTALW  and manually inspected all members’ N-terminal regions. If a consensus N-terminus could be identified, sequences that aligned with the consensus but displayed extra N-terminal residues were discarded.
Comparing NYCOMPS targets to Pfam-A
Percentage of NYCOMPS selected targets that map to at least one Pfam-A family
E value\ TM coverage
Comparing NYCOMPS targets to TCDB
Transport Classification Database (TCDB)  is a membrane transport protein database based on the Transporter Classification system, which is analogous to the enzyme commission (EC) system for enzyme classification. Note that TCDB includes both αIMPs and beta barrel integral membrane proteins. For our analysis, we used the version of the database from July 2008. TCDB is organized into 5 levels (each level identified by a number or a letter), representing the transporter class (7 classes in the version that we used), subclass (24), superfamily/family (557), subfamily (1,320) and substrate transported (3,224) for a total of 5,005 sequences. In order to estimate the fraction of TCDB classes, subclasses and superfamilies/families covered by our targets (first 3 levels of TCDB), we proceeded as follows. We run 1,000 bootstrapping  iterations. At each iteration, we picked at random exactly one target for each seed family (to mimic the situation in which we solve only one structure per seed family) and aligned all such targets against TCDB. At the end of each iteration, we calculated the number of different TCDB identifiers covered by our randomly picked targets (i.e. the number of TCDB identifiers corresponding to TCDB proteins that aligned to those targets). Finally, after 1,000 iterations, we calculated the average and standard deviation of the percentage of TCDB numbers covered with respect to the total (again, for the first three levels in the classification). We repeated this operation considering only 25, 50 and 75% of the seed families, selecting the families at random in each iteration, without re-sampling.
Comparing NYCOMPS targets to UniProtKB
UniProtKB [44, 49] is a protein repository composed of the manually annotated Swiss-Prot and of the automatically annotated TrEMBL databases. Here, we only considered UniProtKB proteins with at least 2 predicted TMH (UniProtKB-TMH). Since we wanted to calculate novel leverage  of the UniProtKB-TMH subset provided by NYCOMPS targets, we first had to calculate the leverage on the same subset provided by αIMPs currently in the PDB. In fact, UniProtKB-TMH proteins that show significant similarity to PDB proteins cannot be claimed as “novel leverage” by NYCOMPS targets. To find αIMPs in the PDB we considered all PDB sequences (February 2009) and run TMHMM2  on the entire set. We discarded all sequences predicted to have less than 2 TMHs and reduced redundancy at 98% sequence identity using CD-HIT  (as done for the NYCOMPS targets). This left us with 187 proteins. We preferred this definition of αIMPs rather than using the annotations provided by PDB because it more closely reflected the way our target sequences were annotated as αIMPs (i.e. using TMH prediction). Finally, we aligned these sequences against UniProtKB-TMH and labeled all sequence similar UniProtKB-TMH proteins as ‘not-novel’. Note that in this case criterion for similarity to PDB proteins is as in Fig. 2. When calculating the UniProtKB-TMH leverage of our targets, we excluded all proteins labeled as ‘not novel’ and performed bootstrapping  to calculate averages and standard deviations in the same way as described for TCDB. In the same way, we calculate novel leverage for subsets Swiss-Prot-TMH and UniProtKB-TMH-Human. For comparison, we also calculated ratios between NYCOMPS target leverage and PDB protein leverage by taking average leverage values obtained from bootstrapping for NYCOMPS targets and leverage of the just described 187-protein subset for PDB proteins.
Results and discussion
Task I: target selection
NYCOMPS98: creating a valid target list of predicted alpha helical membrane proteins (Fig. 1a)
Since most of our targets had no experimental annotation linking them to the membrane, we had to rely on prediction methods. Such methods are estimated to be very accurate [32, 53], however, inevitably they will at times make TMH prediction mistakes. For target selection, we therefore only included proteins with ≥2 predicted TMHs. After this, we reduced redundancy by filtering out targets with exceedingly similar sequences guaranteeing that no two proteins in our dataset shared more than 98% pairwise sequence identity. Indeed, while we welcomed redundancy, we wanted to avoid cloning very similar proteins as very close homologs have been shown to have similar crystallization propensities .
In order to further decrease the chance of introducing water-soluble non-αIMPs into our pipeline, we also excluded sequences with two predicted TMHs for which the position of the most N-terminal TMH overlapped with a predicted signal peptide (the most common mistake of TMH prediction programs is to predict an N-terminal TMH in place of a signal peptide ). Finally, we filtered out proteins that were predicted to have more than 15 consecutive disordered residues and hence might be problematic for crystallization . The remaining sequences constitute what we refer to as the NYCOMPS98 dataset (39,037 αIMPs).
Cloning families (Fig. 1b)
The two principal steps in target selection are: (1) the identification of targets that constitute promising candidates for structure determination and/or are of utmost biological interest; we refer to these as to the “seeds”. (2) The expansion of the seeds into families of—usually homologous—proteins likely to have membrane structures similar to the seed; we expand the seeds considering only sequences that are part of the NYCOMPS98 set of valid targets. NYCOMPS seeds are chosen from two distinct tracks that we refer to as “central selection” and “nomination”. Centrally selected seeds (139 proteins) have so far been selected according to prior indications of successful over-expression in our E. coli host  (details below). In contrast, nominated seeds (35 proteins) have been hand-picked by participating and adjoined laboratories; most of these seeds are well-studied proteins of known function. Note that seeds do not need to be proteins within NYCOMPS98 and not even within our collection of genomes. A protein is a valid seed as long as we can find homologs in the NYCOMPS98 dataset and as long as these homologs pass all the additional filters described below.
Central seed selection
In central seed selection our main concern was to pick membrane proteins that were more likely to readily provide high-resolution structures and that differed substantially in sequence within the TM region from proteins for which structures had already been determined experimentally. Given the small number of membrane protein structures available in the PDB, predictions as to which membrane proteins constitute structure-prone targets can at the moment rather be based on a misunderstanding of statistics than on sustainable science. We therefore scaled down our ambitions and started with proteins that were likely to give high yields of expression in our E. coli host (eventually increasing the odds to determine their structure). Such a list of proteins became available at the outset of the project thanks to a published genome-wide expression study performed on E. coli proteins . Although the main goal of the work by von Heijne and coworkers was to determine the localization of the C-terminus of E. coli αIMPs (either inside the cell or in the periplasm), an important side effect was to provide us with a list of proteins that were successfully over-expressed in E. coli. In fact, all 139 NYCOMPS seeds that have so far been centrally selected have been extracted from a list of 613 proteins provided to us by Erik Granseth (Stockholm University, Sweden). NYCOMPS refers to this set of proteins as “the von Heijne list” (www.rostlab.org/punta/vonHeijnelist.txt). All proteins in this list were successfully over-expressed in E. coli in fusion with GFP or phosphatase A . Our hope had therefore been that expanding these proteins into our 96 reagent genomes would provide a longer list of “good expressers”, eventually increasing the odds for obtaining a structure for each seed family.
The initial von Heijne list was not unique at the sequence level; it included paralogs. Hence, we first clustered the proteins in the von Heijne list according to sequence similarity (the clustering procedure was adapted from methods described in references [25, 42], see “Methods” for more details). This resulted in 268 E. coli paralogy groups. The protein with the highest fluorescence level in each group was then selected as seed for the expansion into the NYCOMPS98 dataset (following prior indications that fluorescence levels correlated with expression levels ). Several of these initial families were subsequently excluded due to one or several of the following reasons: (1) they were similar to membrane proteins of known structure (i.e. in the PDB; note, however, that for a handful of centrally selected seeds now exists a structure of a homolog in the PDB that was deposited after the seed was selected), (2) they represented isolated subunits of hetero-oligomeric complexes, (3) they had less than 5 homologs in the entire UniProtKB [44, 49]; i.e. they provided low structural leverage). Some of the largest families (hundreds of homologs in NYCOMPS98) have also been held back, waiting for a data-driven criterion that could allow us to select only a fraction of homologs among all those available for the family; a criterion that would, for example, allow us to select proteins that we predict to express well under our experimental protocol. To date, 139 seeds from the von Heijne list have been selected for cloning.
Nominated seed selection
Nominated seeds (handpicked by participating groups) are special in many respects. For one, novelty with respect to PDB proteins is not enforced. Instead, we simply report the observed similarities to the nominating group, which then takes the final decision on whether or not to pursue that specific target. Indeed, according to our criterion for novelty (E value < 1, alignment extending on at least 25% of the target TM region, see “Methods” for more details), 14 (or 40%) of our nominated seeds have significant sequence similarity to a least one PDB protein. There are various reasons why nominated targets are sometimes selected disregarding similarity to PDB proteins. Technology development projects often need to work on well-characterized test cases. One such example was the nomination of the KcsA channel  by the solid state NMR group. Also, there are cases in which sequence similarity as detected by our automatic protocol may not capture important structural and/or functional differences between the nominated seed and protein(s) already in the PDB. These differences may mean that the seed is a very valuable target, notwithstanding the presence of a homolog in the PDB. Again, this type of evaluation is left to the nominating group. Another procedural difference between nomination and central selection pertains to redundancy: nominated seeds do not need to be non-redundant. The same group can nominate seeds that are so sequence similar that they expand into the same family. Despite their similarity, they will be processed as separate seeds by our pipeline (with overlapping members being assigned to only one of the resulting seed families). If different groups nominate similar seeds though, we ask that they reach an agreement on how the resulting targets will be distributed.
Filtering of targets
After a seed is expanded into a family of proteins predicted to have similar membrane cores, all family members are subjected to additional filters. Since these filters depend on information that may quickly change, they are best applied at the moment the targets have to be submitted to the experimental pipeline rather than when creating the NYCOMPS98 dataset.
Novelty: exclusion of PDB homologs
The first filtering step is meant to ensure that target proteins provide novel coverage of the protein universe . Again, this does not apply to nominated targets. We filter out any centrally selected target with significant similarity in the predicted TM region to any protein in the PDB. This excludes all proteins aligning to a PDB protein with PSI-BLAST E values < 1 and for which the alignment extends over more than 25% of the predicted TM region (“Methods”). Note that while our E value cut-off ensures selection of targets whose TM domains are more novel than the average domain selected by other PSI high-throughput consortia , we do allow some overall similarity to PDB proteins to occur. Indeed, while we want to select novel targets, we do not want to exclude sequences simply based on the presence of a soluble domain that has a homolog in the PDB or on the fact that at most 25% of its TM region can be modeled based on a protein in the PDB. Finally, note also that our E value cutoff for avoiding similarity to PDB proteins is stricter (i.e. requires less similarity for exclusion of a protein) than the one we used, for example, for seed expansion (E value < 1 vs. <10−3). This reflects our different goals in the two situations. Whereas in seed expansion we try to minimize the number of false positives (proteins not evolutionary related to the seed), in the comparison with PDB proteins we want to minimize the number of false negatives (in other words, we want to exclude as many targets related to PDB proteins as possible even at the cost of excluding a number of targets that are not in fact related to any PDB protein). We run this PDB filter both at the seed and at the single target level. Seeds with significant similarity to a PDB protein are not considered for further processing. When a seed passes this filter, each individual target in the family into which the seed expands is still subjected to the same filter and discarded if it matches our criteria for exclusion.
Exclusion of isolated subunits of hetero-oligomeric complexes
Occasionally, protein subunits that natively are parts of larger hetero-oligomeric complexes are structurally stable even when expressed in isolation (e.g. homotetrameric A subunit cyclic nucleotide-gated ion channels ). However, in general, when expressed in isolation they are expected to be less likely to yield experimental structures. Therefore, our second filter removes all candidates for which we have evidence that they might constitute individual subunits of hetero-oligomeric complexes. Our identification of such subunits mostly relies on information extracted from EcoCyc  and Swiss-Prot . Additionally, we seek input from our team of experimentalists. When we find evidence that one or more members of a family may constitute a subunit of a larger hetero-oligomeric complex, we usually discard the entire family. The final decision is taken after consulting with our team of experimental experts. An alternative way to confront such cases might be to clone and co-express them with all components of the complex. However, except for a few special cases (see below), our experimental pipeline is currently not set up to perform such operations in a high-throughput manner.
Removal of outliers
As a final step, we try to correct for “inconsistencies” in our families. We usually discard proteins for which the number of predicted TMHs differs greatly from that of the seed, as well as, proteins that differ significantly in their length with respect to the seed (“Methods”). Also, we try to exclude proteins that align well with the consensus N-terminus of the family (when any such consensus can be identified) but that feature additional N-terminal residues, because they constitute cases of proteins that may have been mis-annotated (Fig. S3).
Other NYCOMPS targets
Several NYCOMPS targets were selected according to other criteria. For instance, biological-theme targets are individual proteins handled by participating laboratories in the usual style of hypothesis-driven rather than hypothesis-generating structural biology. Such targets typically do not enter our pipeline at any stage. Another set of examples is constituted by 230 nominated histidine kinase targets that were hand-picked by one of our participating groups based on functional annotations. Some of those have <2 predicted TMHs (either 1 or 0). Additionally, we cloned 18 constructs that consist of co-cistronic (i.e. localized in neighboring regions of the genome) subunits of hetero-oligomeric complexes. This particular set of complexes can enter the existing experimental pipeline without modifications, because we can clone the full DNA stretch spanning the genes of all involved subunits.
Task II: target analysis
NYCOMPS targets diverse in terms of number of TMHs and length
81% of the selected targets map to Pfam-A families, 63% to TCDB proteins
How “relevant” are NYCOMPS targets to the today’s biology? We try to address this point by mapping our target list to two manually curated protein databases that carry functional annotations: the general purpose Pfam-A domain database  and the IMP-specific TCDB, a comprehensive database of membrane transport proteins . Pfam-A families are manually curated from sequence-based alignments and they often do not span entire structural domains [46, 47]. This means that the TM region of a target may align to more than one Pfam-A family and one or more Pfam-A families may not cover it entirely. On the other hand, some of the TM regions in our targets do not represent single structural domains, e.g. the TM region of 2-hydroxycarboxylate transporters . For these reasons, we evaluate similarity to Pfam-A families in two ways. First, we calculate the fraction of our targets that align (HMMER E value < 10−3) to one or more Pfam-A families, collectively covering more than half of the target TM region. This fraction amounts to 81% of all targets (89% of nominated and 79% of centrally selected targets, respectively; see Table 1 for different E value thresholds and different fraction of TM region coverage). Second, we consider only targets for which it exists at least one Pfam-A family that by itself aligns over more than half of the predicted TM region. The percentage of targets satisfying this condition is equal to 70%. Overall, these latter targets map to 142 different Pfam-A families, offering additional evidence of NYCOMPS targets diversity at the sequence level. Finally, since not all Pfam-A families carry a functional annotation, we additionally calculate the percentage of matches after excluding DUFs and UPFs. In this case, the previous values become 61% (one or more families, Table S2) and 58% (single family).
Target leverage: TCDB
Fraction of target families solved out of 174
TCDB Level 1 (7) (%)
TCDB Level 2 (24) (%)
TCDB Level 3 (557) (%)
If we solved one structure for each seed family, 12% of TCDB superfamilies/families distributed over 45% of all TCDB subclasses could, on the average, be modeled (considering our criterion for similarity: at least 50% of their TM region could be modeled). If instead we solved one structure for every fourth seed family, we could provide models for 3% of TCDB superfamilies/families distributed over 25% of all TCDB subclasses. When considering these numbers one has to take into account that TCDB also contains beta barrel integral membrane proteins that we currently do not taken into consideration as targets for NYCOMPS and that not all membrane proteins are transporters (i.e. TCDB does not cover the entire universe of annotated αIMPs).
In conclusion, comparison to Pfam-A (61%) and TCDB (63%) shows that a little less than two-thirds of our targets can at least partially be mapped to functionally annotated proteins found in manually curated public databases. On the other hand, we also see that achieving a comprehensive structural coverage of IMP databases such as TCDB will require scaling-up considerably the number of selected seeds and targets.
UniProtKB novel leverage provided by NYCOMPS targets
One important aspect of any SG effort is novel leverage, i.e. the degree to which each experimental structure enables the generation of comparative models that were not available prior to its determination . PSI has overall performed extremely well by this criterion . In the context of NYCOMPS target selection we estimated the novel leverage that would result if we obtained structures for all, or a fraction of, our families. Note that here we apply a slightly more restrictive criterion for novel leverage with respect to its original definition , that is, not novel leverage with respect to the time targets were selected but novel leverage as of February 2009. We defined the leverage by aligning all our targets against UniProtKB but discarding aligned UniProtKB sequences predicted to have <2 TMHs (we call this subset UniProtKB-TMH).
Next, we investigated the leverage with respect to Swiss-Prot-TMH, i.e. the manually annotated subset of UniProtKB-TMH (Fig. 4a, filled diamonds and long-dashes). In this case, NYCOMPS targets would allow to model between 300 and 4,700 novel Swiss-Prot-TMH proteins, depending on the number of seed families for which we can solve at least one structure. The fraction of novel leverage with respect to what currently possible using PDB proteins follows a very similar trend with respect to what seen for the entire UniProtKB-TMH (Fig. 4b).
Finally, if we consider only human sequences within UniProtKB-TMH, we find that novel structural information could be obtained for 10–260 proteins (Fig. 4a, crossed squares and short-dash line). This time, the ratio to current leverage is markedly smaller (Fig. 4b). This is not very surprising given that all of our targets come from prokaryotic organisms.
New York Consortium On Membrane Protein Structure (NYCOMPS), targets alpha helical bundle integral membrane proteins, adopting a strategy that seeks to optimize success while maintaining the commitment to novelty, target relevance and leverage. In this paper, we have shown that the selected targets cover a wide range of protein lengths, TM topologies and functions. We have also demonstrated that the experimental determination of representative structures for these targets would allow transfer of structural information to a large number of known, but structurally uncharacterized, proteins. In the near future, we plan to expand the list of valid targets by introducing new genomes and by targeting eukaryotic proteins, non-co-cistronic complexes and beta barrel integral membrane proteins.
This work was supported by the grant U54-GM75026-01 to the NYCOMPS from PSI of the NIH. Thanks to all NYCOMPS collaborators who contribute to making NYCOMPS a wonderful experience, in particular thanks to Ann McDermott, Filippo Mancia, Ming Zhou, Francesca Gubellini (all Columbia), Da-Neng Wang (New York University), Mark Girvin (Albert Einstein), Guy Montelione (Rutgers), Renato Bruni and Brian Kloss (both NYCOMPS). Specific thanks to Jinfeng Liu (Genentech), Jessica Locke (Rutgers) and Rajesh Nair (Food and Drug Administration) for providing preliminary information and programs; to Ta-tsen Soong (Columbia), Henry Bigelow and Dariusz Przybylski (both Broad Institute), Kaz Wrzeszczynski (Cold Spring Harbor Laboratory), Zsuzsanna Dosztanyi (Hungarian Academy of Sciences, Budapest) and Zsolt Zolnai (University of Wisconsin—Madison) for useful discussions; to Guy Yachdav and Laszlo Kajan (both Columbia) for computer assistance and the collection of genome data sets. Last, not least, thanks to all those who deposit their experimental data in public databases, and to those who maintain these databases.
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
- 4.Stroud RM, Choe S, Holton J, Kaback HR, Kwiatkowski W, Minor DL, Riek R, Sali A, Stahlberg H, Harries W (2009) 2007 Annual progress report synopsis of the center for structures of membrane proteins. J Struct Funct Genomics 10:193–208Google Scholar
- 18.Berman HM, Battistuz T, Bhat TN, Bluhm WF, Bourne PE, Burkhardt K, Feng Z, Gilliland GL, Iype L, Jain S, Fagan P, Marvin J, Padilla D, Ravichandran V, Schneider B, Thanki N, Weissig H, Westbrook JD, Zardecki C (2002) The protein data bank. Acta Crystallogr D Biol Crystallogr 58:899–907CrossRefPubMedGoogle Scholar
- 21.Chen CP, Rost B (2002) Long membrane helices and short loops predicted less accurately. Protein Sci 11:2766–2773Google Scholar
- 29.Nair R, Liu J, Soong TT, Acton TB, Everett JK, Kouranov A, Fiser A, Godzik A, Jaroszewski L, Orengo C, Montelione GT, Rost B (2009) Structural genomics is the largest contributor of novel structural leverage. J Struct Funct Genomics 10:181–191Google Scholar
- 38.Schlessinger A, Punta M, Yachdav G, Kajan L, Rost B (2009) Improved disorder prediction by combination of orthogonal approaches. PLoS One 4:e4433Google Scholar
- 43.Keseler IM, Bonavides-Martinez C, Collado-Vides J, Gama-Castro S, Gunsalus RP, Johnson DA, Krummenacker M, Nolan LM, Paley S, Paulsen IT, Peralta-Gil M, Santos-Zavaleta A, Shearer AG, Karp PD (2009) EcoCyc: a comprehensive view of Escherichia coli biology. Nucleic Acids Res 37:D464–D470CrossRefPubMedGoogle Scholar
- 44.Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O’Donovan C, Redaschi N and Yeh LS (2005) The universal protein resource (UniProt), Nucleic Acids Res, 33 Database Issue, D154–D159Google Scholar
- 48.Efron B, Tibshirani RJ (1993) An introduction to the bootstrap. Chapman & Hall/CRC, Boca RatonGoogle Scholar
- 51.Acton TB, Gunsalus KC, Xiao R, Ma LC, Aramini J, Baran MC, Chiang YW, Climent T, Cooper B, Denissova NG, Douglas SM, Everett JK, Ho CK, Macapagal D, Rajan PK, Shastry R, Shih LY, Swapna GV, Wilson M, Wu M, Gerstein M, Inouye M, Hunt JF, Montelione GT (2005) Robotic cloning and protein production platform of the northeast structural genomics consortium. Methods Enzymol 394:210–243CrossRefPubMedGoogle Scholar
- 57.Drew D, Slotboom DJ, Friso G, Reda T, Genevaux P, Rapp M, Meindl-Beinker NM, Lambert W, Lerch M, Daley DO, Van Wijk KJ, Hirst J, Kunji E, De Gier JW (2005) A scalable, GFP-based pipeline for membrane protein overexpression screening and purification. Protein Sci 14:2011–2017CrossRefPubMedGoogle Scholar
- 59.Biel M (2009) Cyclic nucleotide-regulated cation channels. J Biol Chem 284:9017–9021Google Scholar