Modern biomedical research depends on a complete and accurate proteome. With the widespread adoption of new sequencing technologies, genome sequences are generated at a near exponential rate, diminishing the time and effort that can be invested in genome annotation. The resulting gene set contains numerous errors in even the most basic form of annotation: the primary structure of the proteins.
The application of experimental proteomics data to genome annotation, called proteogenomics, can quickly and efficiently discover misannotations, yielding a more accurate and complete genome annotation. We present a comprehensive proteogenomic analysis of the plague bacterium, Yersinia pestis KIM. We discover non-annotated genes, correct protein boundaries, remove spuriously annotated ORFs, and make major advances towards accurate identification of signal peptides. Finally, we apply our data to 21 other Yersinia genomes, correcting and enhancing their annotations.
In total, 141 gene models were altered and have been updated in RefSeq and Genbank, which can be accessed seamlessly through any NCBI tool (e.g. blast) or downloaded directly. Along with the improved gene models we discover new, more accurate means of identifying signal peptides in proteomics data.
Yersinia pestis, a Gram-negative bacterium, is the causative agent of the bubonic and pneumonic plague. The pathogenic lifestyle of this microbe involves two distinct life stages, one in the flea vector, the other in mammalian hosts, primarily rodents . Y. pestis recently speciated from Y. pseudotuberculosis, acquiring two pathogenic plasmids and a chromosomal pathogenicity island. Seven Y. pestis genomes have been sequenced to completion, along with five other Yersinia sequences. Numerous other Yersinia have been sequenced to draft quality.
Genome annotation is often divided into two sequential phases, finding genes and assigning function. Most prokaryotic genome annotation pipelines consist of automated gene finding, corroborated by limited homology comparisons. As such they lack any experimental validation of primary structure. Fundamentally, an accurate primary structure implies finding the correct start/stop of the gene, which may be erroneously predicted for 20% of genes in some bacterial and archaeal genomes [2, 3]. But it also includes recognizing any true frame-shifting events, which must be delineated from sequencing errors or recent degeneration into a pseudogene.
A second benchmark for accurate primary structure is determining the mature protein sequence. Protein cleavage events (e.g. N-terminal export signal peptides, C-terminal LPXTG cell wall anchorage motifs) are particularly valuable clues for protein localization in the prokaryotic cell. Similarly, identifying differences between a mature virulence-associated protein and the nascent pre-protein can add valuable information as to how such a protein assumes its biological role in pathogenesis. Furthermore, modifications to amino acids (e.g., phosphorylation) implicate a protein in distinct and often transient biological processes (e.g. regulation of gene expression). None of these mature protein events are observable in DNA sequencing. They must be observed on the protein level.
A genome's annotation should be a dynamic working hypothesis, improved over time as understanding and knowledge increase [4, 5]. Peptides observed from MS/MS experiments are an orthogonal data type from the DNA-centric evidences commonly used to predict protein-coding sequences (e.g. sequence homology, EST mapping, nucleotide and codon frequency metrics, etc.). Protein prediction using these extrinsic evidences, called proteogenomics, yields a more complete and accurate protein-coding catalog [6, 7]. Specifically, proteogenomics can determine reading frame, translational start and stop sites, and the validity of short ORFs. In a variety of organisms, new insights from proteogenomics have consistently improved genome annotation [8–12].
In this work, we present a comprehensive proteogenomic analysis Yersinia pestis KIM. We discover non-annotated genes, correct the protein coding sequence of several of genes, remove many spuriously annotated ORFs, and make major advances towards accurate identification of signal peptides. Corrections have been updated in the RefSeq annotation of the genome (NC_004088). Through NCBI's peptidome, we have linked our experimental data directly to the genome annotation. Finally, we translate these improvements across other Yersinia genomes, to update annotation for the two other human pathogenic species.
Results and Discussion
Correcting Annotation Errors
By mapping the proteomic data onto the genome, we are able to objectively determine the quality of the genome annotation. When peptides map outside of predicted proteins, there are two categories of annotation improvement. If the open reading frame lacks a predicted protein, the observed peptides are evidence for a novel gene. If the ORF contains a predicted protein and peptides map upstream, they are evidence for a 5' extension and new start site. Both of these situations necessitate an update to the genome annotation. Working with the RefSeq curators at NCBI, all of the instances discussed below have been updated, and can be accessed seamlessly through any NCBI tool (e.g. blast) or downloaded directly.
Two proteins with erroneous start sites are special cases and widely mispredicted in bacterial genomes. Peptide chain release factor II, prfB, often contains a ribosomal +1 frame shift. As with the KIM genome, erroneous annotations of this protein typically contain only the c-terminal ORF but exclude the true n-terminus (Additional File 1, Figure S2). Our proteogenomics pipeline recognized peptides in both ORFs and the two were stitched together. We also found peptides upstream of infC, protein chain initiation factor IF-3, which utilizes an ultra-rare start codon ATT .
Proteolytic cleavage of proteins in vivo can be recognized by proteomics by their atypical peptide endpoints. Spectra used in this report were generated from proteins digested with trypsin. Thus, we expect most identified peptides to be fully tryptic. Previously Gupta and colleagues postulated that signal peptides could be discovered simply by identifying non-tryptic peptides . In their analysis, if the first observed peptide in a protein had a non-tryptic n-terminus and was within 17-55 residues of the start site, then it was a considered evidence of signal peptide cleavage. 202 Yersinia proteins fulfill these two requirements. We extend Gupta's criteria to include critical biological motifs within the signal peptide: the hydrophobic patch and cleavage motif . Filtering out proteins lacking these new requirements, we report 82 proteins with observed signal peptide cleavage (Additional File 2, Table S1). These proteins also contain other common signal peptide features: prevalence of LL doublets and early basic residues. Furthermore, we noticed that the hydrophobic patch had a similar placement within the signal peptide for all validated proteins (Figure 4B). This location is consistent with the patch's structural purpose, i.e. membrane anchoring and exposure of the cleavage motif at an appropriate distance from the membrane. Finally, we report that many of the sequences contain not simply early basic residues but contain Met-Lys as the first two residues in the protein sequence. 34 of the 82 proteins start in this manner. An additional 15 have Met-Lys internally which could be the true start, if this trend is seen as a general pattern.
Twenty proteins contain signal peptides longer than 30 residues, which is atypical. When we compared these sequences to close homologs within the gamma-proteobacteria, 13 of them could be better predicted at an alternative start site downstream of the current annotation. This shorter version of the protein would not only have better agreement with homologous genes, but also the characteristics of signal peptides noted above (not just length). An additional four long signal peptides appear to contain the twin arginine motif, which are characteristically longer than those exported through the Sec-dependent pathway.
The set of secreted proteins within a genome is often computationally predicted. To compare our proteomic observations to such predictions, we ran signalp on all proteins in the genome. We plot the score for proteomically validated signal peptide containing proteins against the background of signalp's score distribution (Figure 4C). We also plot proteomically rejected signal peptide containing proteins (see Methods). The proteomically observed and rejected proteins separate very clearly, with the positive set scoring well above the suggested cutoff. Furthermore, proteomics and signalp generally agree on the exact residue of cleavage.
Improvement of Related Genomes
As a final step in our analysis, we used our proteomics data to improve the annotation of other Yersinia genomes. Published in 2002, KIM was the second annotated Yersinia and served as a source for subsequent genome projects . Thus errors in KIM's annotation are likely to show up in other genomes. We created one-to-one orthology maps for proteins from 21 Yersinia genomes, and used our peptide mappings from the KIM dataset to evaluate gene models. Transferring proteogenomic improvements is complicated by both legitimate differences between strains/species, but also artifacts from sequencing and assembly. Our approach was to be conservative, and only change gene models where the evidence was clear.
We started with the removal of dubious genes, resulting in the deletion of 46 accessions. Remembering that each dubious gene was part of a conflicted locus, we checked to make sure that the true gene was present. At two loci we found that the true gene was missing, excluded from the original annotation by the dubious gene. We created new gene models for each of these. Three additional new genes were created, arising from the transitive annotation of the novel KIM genes. The two cold shock proteins were missing from Y. pseudotuberculosis IP 31758. The major outer membrane lipoprotein, lpp, was missing from Y. pestis Pestoides F.
To correct start sites in other Yersinia genomes, we applied the n-terminal most peptide from a gene in KIM to members of its ortholog set. If an ortholog was too short to include this peptide, we tested whether an extension could be made to accommodate. In 43 instances, this extension was trivial - clear homology up to and including the start codon. For 22, an extension could not be made due to indels or mutations shortening the reading frame. Finally, 10 were difficult cases where we could extend the reading frame to an upstream start codon, but homology was unclear. In these cases, we did not alter the genome annotation. This was particularly problematic in Y. enterocolitica, which is the most divergent genome used in our comparison.
Each primary genome annotation is unique. From parameters and cutoffs, to pragma or data sources, the differences between genome annotation pipelines are non-trivial. When the number of places producing genome annotation is taken into consideration, the disparities become ever greater. When annotations disagree, there is rarely experimental proof as to which is right. Proteogenomics is a fast and effective way of improving genome annotation. By comparing experimental proteomics data with a genome annotation, we root ourselves in the observed proteome. In Y. pestis KIM, we show several instances of misannotation. As an isolated event, a few misannotations here or there may not seem overly significant. But as we have shown, misannotation is multiplied in future genomes.
Given the size of our data, we were somewhat surprised to find only four novel genes and five misannotated start sites. There are three potential explanations for this. First, gel-based proteomics is inherently limited in its sampling depth, as compared to the MudPIT experimental set up. This reason is not entirely satisfactory, as we report coverage for >30% of the proteome. Second, Yersinia is closely related to E. coli and thus benefits from its well curated genome. Finally, and most importantly, we note the tendency for overprediction within the annotation. This is evidenced by the extreme number of overlapping genes (many of which were proven to be dubious) and the frequent occurrence of genes which are longer in KIM than their homologs in other genomes. As proteogenomics relies on underprediction to find corrections (meaning that we identify a region of the genome which is not predicted to be coding but should be), this tendency diminishes our power for annotation improvement. Yet we are still able to significantly improve and enhance the annotation, especially by incorporating our data back to public repositories.
Misannotated start sites are regularly found in proteogenomic surveys. As reported by Salzberg , genomes annotated before 2006 had a ~ 20% error rate for start site assignment. The five cases discovered here highlight necessity of protein-based genome annotation, and the difficulties surrounding purely computational predictions. The most difficult situation to automate is the usage of exceptional or rare start codons. Our example here is IF-3 which utilizes the ATT codon, and has been found misannotated in other proteogenomics reports . We note that this rather fundamental gene is mispredicted in most enterobacterial genomes. The full extent of rare start codon usage is not known, although a recent report proposed several new rare codons in Deinococcus . A second category of difficult genes are those unique to a genus or species, often the more biologically interesting set of proteins. Without broad sequence distribution across taxa, comparative genomics is not effective. A final case is programmed ribosomal frame-shifts, such as prfB. In the Yersinia genus, prfB is annotated in two predominant forms, a full length protein, and one which lacks ~60 amino acids from the n-terminus (Additional File 1, Figure S2). The +1 frame shift required to maintain proper reading frame is simply not considered in most annotations.
Our analysis of the observed proteome reveals the difficulties surrounding the annotation of pseudogenes. Whether shortened by nonsense mutations or disrupted by indels, pseudogene annotation is inconsistent. In our orthology clusters, we observed genes split into multiple ORFs variously annotated as a truncated protein, pseudogene, unannotated, or a protein without comment. Second, we highlight the implication of marking a region as a pseudogene, which gives the impression that this genomic locus should be ignored. However both here and elsewhere  proteomics has revealed that many pseudogenes are 'alive' - translated and present in the cell. The truncated ABC transporter y3734 observed here almost certainly does not have the expected function. How should annotation transparently and concisely express the evidences and caveats for this phenomenon?
As we mapped peptides onto the genome, we noticed that some loci contained multiple gene predictions with substantial overlap. 80% of the time, this was caused by the prediction of a dubious gene, which we have removed from the annotation. Unfortunately, these dubious genes polluted public repositories, acting as a source for future annotations. As we sorted through the conflicted loci, we discovered two additional problems which hamper exclusively computational genome annotations. In 24 loci where one gene was completely overlapped by another (Figure 5A), six times the smaller gene was the true gene. This presents a problem for some gene calling pipelines, where the longest ORF in a region is selected to the exclusion of other overlapping gene possibilities, which in these cases includes the true gene. Second, we observed seven loci where the overlapping genes are both hypothetical. In such situations, proteomics can provide experimental validation to unambiguously determine the correct gene model.
Aside from highlighting annotation errors, proteogenomics can supplement annotation by providing value-added information about the mature functional protein. In our effort to reliably observe proteolytic cleavage events in proteomic data, we have introduced the new requirements for validating signal peptide cleavage. By filtering out proteins lacking a hydrophobic patch and the signal peptidase cleavage motif, we remove many spurious assertions. Unlike previous reports, we do not find a great discordance in the proteins identified by proteomics and computational predictions. Studies in Shewanella found 28% of proteins observed by proteomics are missed by signalp, and 26% of computational predictions are refuted by proteomics . Our results are much more tempered, with 8% false-negatives and < 1% false-positives. There are two likely causes for the differences in our findings: the phylogenetic differences between Yersinia and Shewanella and new filters introduced in this work. The large excess of false-positives in Shewanella (i.e. proteins predicted by signalp but refuted by proteomics) may be attributable to the training set of proteins used for signalp, which is heavily overrepresented by enterobacteria (e.g. E. coli). Yersinia, another enterobacterium, is very closely related to E. coli and is more similar to data in the training set. The excess of false-negatives in Shewanella (i.e. proteins found by proteomics but not by signalp) may be a result of either distant phylogeny or the more liberal filters used to identify signal peptides.
Finally, we leverage comparative genomics and apply our data to 21 neighboring genomes, where we correct over 140 gene models. Instead of heuristics or a democratic voting scheme, we rely on experimentally validated protein sequences allowing us to confidently annotate orthologs across the genus. The heterogeneity of protein lengths in these orthologs highlights some of the difficulties surrounding comparative genomics and pangenomics. Although the sequences of Y. pestis are 90-95% identical, computational artifacts of the annotation process lead to several large differences between the protein calls. Furthermore, we saw diversity in the annotation of proteins split into multiple ORFs as mentioned above. All of these lead to inconsistency which can hamper a comparative genomics pipeline.
Mass Spectrometry Data
Spectra used in this report have been previously published. Briefly, the Pathogen Functional Genomics Resource Center (PFGRC) at JCVI generated ~15 million spectra from Y. pestis KIM 6+. Cell lysates were fractionated into cytoplasm, membrane and periplasmic components, and then run on 2 D gels, with spectrum acquisition on a Thermo LTQ [14, 22, 23]. All spectral data has been uploaded to NCBI's Peptidome http://www.ncbi.nlm.nih.gov/peptidome/.
Our pipeline can be divided into distinct steps, with the overall goal of mapping peptides onto the genome and then inferring genome annotation (Figure 1). The first step is to prepare protein sequence databases by translating the RefSeq DNA into all six frames: genome NC_004088, and plasmids NC_004836, NC_004837, NC_004838. We do not make any restriction of open reading frames (ORFs). For example, we do not require a minimum length or the presence of a start codon. In addition to these Yersinia sequences, we append common contaminants (trypsin and keratin) to the database. To finish database preparation, we create a decoy database by shuffling each sequence .
We use the Inspect software to match spectra to peptide sequences from the protein database . Search parameters: 25 tags/spectrum, cysteine + 57 fixed modification, and 3.0 Da parent mass tolerance. Following Inspect, we rescore peptide/spectrum matches (PSMs) with PepNovo . We assign a p-value to PSMs based on the score distribution of hits to the target and decoy databases and set a 5% local false-discovery rate cutoff . PSMs passing the cutoff are mapped back to their genomic ORF, and subjected to protein-level filters. To leverage the gel-based nature of our experimental set up, we extend the common 2 peptide heuristic to require 2 peptides per protein per gel spot. Manual inspection of the remaining PSMs (now grouped to their ORFs and reported as proteins) revealed a category of questionable results. We noticed PSMs mapping to low sequence-complexity regions, with significant overrepresentation of small amino acids (e.g. glycine or alanine). We believe that such sequences are likely to match noise spectra because the small mass allows for spurious peak matching. We also noticed that none of the PSMs in these objectionable clusters are tryptic. As trypsin was used to generate peptides for the experiment, we reject ORFs that lack any tryptic PSMs.
We use PSMs passing the above filters to improve the RefSeq annotation of NC_004088 in three ways: adding novel gene models, correcting the start site of current models, and deleting dubious protein predictions. The updated KIM genome can be accessed seamlessly through any NCBI tool (e.g. blast) or downloaded directly ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Yersinia_pestis_KIM_10_uid288/. We create a new gene model (novel gene) when at least 2 PSMs uniquely map to an ORF that lacks any protein annotation. We examine start sites when PSMs map upstream of and in frame with a current protein prediction. Through manual inspection, we adopted a simple heuristic, that the peptide supporting an extension must contain at least two amino acids upstream of the currently annotated start. With only a single amino acid upstream, it was possible for the amino acid to have the same mass as common modifications (e.g. glycine and carboxyamidomethyl (CAM) modification have a mass of 57).
We found many peptides which mapped to genomic loci containing multiple protein predictions. A closer inspection revealed that these loci often contained proteins with significant overlap. We marked genomic regions containing two or more genes with a 50 bp overlap as potentially in conflict. In our initial analysis, we divided conflicted loci into three categories defined by the relationship of the overlapping genes: convergent, divergent, unidirectional . This delineation is attractive, because divergent and unidirectional overlaps can potentially be resolved by shortening one or both genes to remove the overlap. However, the extreme length and multiplicity of overlaps (some proteins overlapped two or three others) rendered this classification system awkward. Through visual analysis, we categorized conflicting loci according to the extent of overlap: 100% overlap, > 80% overlap, > 50% overlap, and < 50% overlap. It then became clear that in most instances, one of the overlapping genes was dubious and should simply be deleted.
When both genes had proteomic evidence, we asserted that both genes are real. In these instances, one or both of the genes was always overpredicted, i.e. was longer than homologs from other genomes. After correcting for this, the genes no longer overlapped in their coding regions. When only one gene had proteomic evidence, we tested whether the unobserved protein was a legitimate gene. Equivocal genes were determined using the following criteria. First, their extended sequence overlap (to a gene with proteomic support) is atypical, especially considering that many of them are completely contained within another protein's coding region. Second, they were sporadically predicted in bacterial genomes with poor and uncharacteristic conservation, while the proteomically supported gene had more compelling sequence conservation, was more consistently annotated within the genus/family, and had a generally broader phylogenetic distribution. Third, most of the dubious genes were short (median 95, mean 108). We note that 100 amino acids is the minimum for many genome annotation projects. We finally note that, if present, the computational evidence for asserting the equivocal gene in the original genome submission was weak (e.g. low sequence identity over short regions to phylogenetically distant organisms). Therefore, we term the equivocal gene as dubious, and have deleted it from the annotation.
Mature Protein Annotation
To report a protein as containing a signal peptide, we started with proteins where the first observed peptide was not tryptic on its n-terminus, and was within 15-50 amino acids of the predicted start site . Trypsin cuts after Arg and Lys; if the residue before the peptide is not Arg or Lys, then the peptide has a non-tryptic n-terminus. The region between the initial methionine and the first observed peptide is thus the putative signal peptide. We filtered this set using previously recognized signal peptide characteristics . We required a hydrophobic patch of at least 8 amino acids, using the standard Kyte-Doolittle hydropathy index. We then examined the signal peptide terminus for the expected AxB cleavage motif (where A = [Ile, Val, Leu, Ala, Gly, Ser], B = [Ala, Gly, Ser]). Signal peptides matching these two criteria were considered validated by proteomics.
We ran signalp v3.0 on all proteins in the genome . Proteins listed as containing a computationally predicted signal peptide are those with a D-score above 0.43 as recommended. When comparing signalp against proteomic evidence for signal peptide presence, we simply look gene by gene, whether proteomics or signalp or both list the protein as containing a signal peptide (and at the site of cleavage). Proteins with proteomic validation which have a signalp score below 0.43 were viewed as computational false-negatives. Conversely where peptides localize 5' of the signalp predicted cleave site (at least 4 residues upstream), we viewed these as computational false-positives.
Proteogenomic Improvements in Other Genomes
We created orthology clusters of all genes in the 12 complete Yersinia genomes with the Sybil program . Peptides observed in the KIM dataset were mapped onto the appropriate columns of an aligned cluster, and genome annotation improvements were inferred from this. For example, if a gene model in genome A was shorter than the orthologous member of the KIM genome, and peptide evidence from KIM mapped to the uncalled region, we checked for the ability to move the start site upstream in genome A. When genome improvements were possible (e.g. not precluded by stop codons) they were made and submitted to RefSeq. The same process was performed for the 9 Genbank annotations owned by JCVI. A list of all altered genes is presented in Additional File 3, Table S2. We are actively pursuing the ability to edit other genome annotations in collaboration with the genome owners.
We wish to thank Jason Inman and Erin Beck for their assistance in the ortholog cluster creation. We wish to thank Bill Klimke at NCBI for assistance with RefSeq integration. SHP is funded by an NSF microbial genome sequencing grant (EF-0949047). Proteomic analysis of Yersinia pestis KIM6+ was supported by the National Institute of Allergy and Infectious Diseases, National Institutes of Health, under contract No. N01-AI15447. Additional funding provided by National Institute of Allergy and Infectious Disease contract HHSN266200400038C.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.