Bolt: a New Age Peptide Search Engine for Comprehensive MS/MS Sequencing Through Vast Protein Databases in Minutes

Abstract

Recent increases in mass spectrometry speed, sensitivity, and resolution now permit comprehensive proteomics coverage. However, the results are often hindered by sub-optimal data processing pipelines. In almost all MS/MS peptide search engines, users must limit their search space to a canonical database due to time constraints and q value considerations, but this typically does not reflect the individual genetic variations of the organism being studied. In addition, engines will nearly always assume the presence of only fully tryptic peptides and limit PTMs to a handful. Even on high-performance servers, these search engines are computationally expensive, and most users decide to dial back their search parameters. We present Bolt, a new cloud-based search engine that can search more than 900,000 protein sequences (canonical, isoform, mutations, and contaminants) with 41 post-translation modifications and N-terminal and C-terminal partial tryptic search in minutes on a standard configuration laptop. Along with increases in speed, Bolt provides an additional benefit of improvement in high-confidence identifications. Sixty-one percent of peptides uniquely identified by Bolt may be validated by strong fragmentation patterns, compared with 13% of peptides uniquely identified by SEQUEST and 6% of peptides uniquely identified by Mascot. Furthermore, 30% of unique Bolt identifications were verified by all three software on the longer gradient analysis, compared with only 20% and 27% for SEQUEST and Mascot identifications respectively. Bolt represents, to the best of our knowledge, the first fully scalable, cloud-based quantitative proteomic solution that can be operated within a user-friendly GUI interface. Data are available via ProteomeXchange with identifier PXD012700.

Introduction

Mass spectrometry-based proteomics studies are one of the most popular and powerful techniques for complex sample analysis [1,2,3]. With constant improvements in mass spectrometers speed and mass accuracy [4] and the recently proposed comprehensive data acquisition strategies like DIA [5], pSMART [6], and BOXCAR [7], it is possible to achieve a greater depth of proteome coverage today than what has been achievable in the past. The commonly used workflow involves extracting all the proteins from a sample (e.g., human serum or cell lysate), reducing/alkylating the cysteines, cleaving them with trypsin, and then loading the sample on a nanoflow HPLC coupled high-resolution tandem mass spectrometer (LC-MS) [3]. After acquisition, the data is processed through a database search engine such as SEQUEST [8], Mascot [9], Andromeda [10], or MS-Amanda [11] that is nearly always housed on a powerful desktop computer. Samples are often precious and limited but still require comprehensive proteome coverage. Advances in sample preparation, chromatography, and instrumentation have developed primarily to meet these needs at both large time and financial expense [12]. Furthermore, the overall workflow needs to be run by skilled professionals due to the complexity of the instrument operation and the lack of automated processes, presenting another costly aspect of the workflow [13]. This intensive process is then concluded in most facilities by analysis with SEQUEST or Mascot, two of the most commonly used search engines. If users want their search algorithm to complete the analysis in a reasonable time (less than 30 min), they have to select a canonical human database, select only a couple of most common post-translational modifications (e.g., oxidation and N-term acetylation), and consider only fully tryptic peptides. The search algorithm’s restrictive design limits the protein search space and the ability for it to handle an excessive number of post-translational modifications, thus forcing the user to make strategic trade-offs while using these algorithms to search their data. These trade-offs are not without biological consequence, as current generation instrumentation can easily identify post-translational modifications and non-tryptic cleavage events present in even the most complex human samples [14]. Even if users had access to larger, pricy computational resources, they may still need to make limiting choices as otherwise the search engines might still take hours to process one analysis file.

Several recently developed search engines have tried to take aim at resolving some of issues that classical search engines possess. MSFragger [15] and MetaMorpheus [16, 17] can perform open searches, looking for an exceedingly large list of PTMs. While each of these newly built software suites provide an answer to one of the problems with classical search algorithms, they are still restricted by the overall size of the search space. They are unable to handle, in a timely fashion, the multi-dimensional increase in search space when important biological events such as protein splicing, isoforms, and mutations are added to the analysis.

In this paper, we introduce a modern search engine called Bolt that is capable of processing high-resolution proteomics data against databases containing over nine hundred thousand protein entries in a matter of minutes. The reason we call Bolt as a modern search engine are twofold. Firstly, it has been developed keeping in mind the cloud infrastructure which has become so prevalent in our daily lives: from financial transactions, to email and mobile apps, a lot of technologies that we interact today utilize cloud infrastructure. The search engines that are used in a lab today are not scalable to the power of cloud servers (as they were developed primarily for desktop computers), and they also require large data transfer (e.g., gigabytes) between the user computer and cloud server if the search engine is deployed on the cloud. Secondly, the large databases that Bolt searches have become very relevant in the last few years due to emergence of proteogenomics and personalized medicine. The current search engines are also unable to handle these in an efficient manner. In Bolt’s infrastructure, canonical sequences, reviewed and non-reviewed isoforms, hundreds of thousands of known mutation variants, and a large contaminant database are all collected and searched using a semi-enzymatic digestion with a great number of PTMs, utilizing no greater computational resources for the end user than a standard laptop computer. Bolt consists of two main processing features: a client-side algorithm that processes the analysis file and a high-performance cloud server preconfigured to optimize the proteome search. This architecture and an algorithm that can take full advantage of this architecture are the primary reasons for Bolt’s speed advantages. The high-memory cloud server makes it possible for the creation of massive persistent in-memory indexed database that speed up the search tremendously as well as provide large number of compute cores for the massively parallel algorithm. Bolt’s architecture and the new algorithm make these high-performance servers accessible to all users without needing to purchase and maintain these expensive servers. These are further explained in detail in the “Material and Methods” section. We compare Bolt’s results with two of the most commonly used search engine: SEQUEST and Mascot. Along with increased processing speed, Bolt also provides an additional benefit of high-quality results, where 61% of the uniquely identified peptides by Bolt had strong fragmentation coverage compared with 13% for uniquely identified by SEQUEST and 6% for uniquely identified by Mascot.

Material and Methods

Sample

All experiments were carried out with the HeLa digest standard and the Pierce Retention Time Calibration Standard (PRTC), both from Thermo Fisher. The PRTC was diluted with 0.1% formic acid to a total concentration of 50 fmol/μL. This solution of PRTC was used to reconstitute and dilute the HeLa digest standard. The HeLa standard was diluted to allow 1 μg of peptide digest on column.

LC/MS Platform

An Orbitrap Fusion 1 system equipped with Tune version 3.0 installed in January 2018 with an EasyNLC 1200 system using 0.1% formic acid as buffer A and 80% acetonitrile with 0.1% formic acid as buffer B. EasySpray 25-cm columns with PepMap C-18 with 2-μm particle size and equivalent 2-cm precolumn trap was used for all experiments described. Files from employing 2 distinct LC gradients were evaluated, with a total run time of 60 and 130 min, respectively. The gradients began a 5% B, ramping to 28% B within by 75% of their total acquisition time, to 50% B in the next 10%, followed by a rapid ramp to 98% B. Column equilibration to baseline conditions was performed automatically at the beginning of each run. MS/MS was acquired in the C-trap after HCD, providing high-resolution accurate mass product ions. All components described are the products of Thermo Fisher Scientific. MS/MS isolation window is 1.6 Da. The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE [18] partner repository with the dataset identifier PXD012700.

SEQUEST/Proteome Discoverer

Proteome Discoverer 2.3 was used for the comparative searches, using the vendor-provided default workflow for Q Exactive Basic Peptide ID on a quad-core Windows 10 server. This consists of the SequestHT search engine with 10-ppm MS1 tolerance, and 0.02-Da MS/MS tolerance. Static modifications for iodoacetamide modification of cysteines and dynamic modifications were allowed for acetylation on the protein N terminus, as well as for the oxidation of any methionine. Percolator [19] was used for FDR according to manufacturer default settings. Up to 2 missed cleavages by trypsin were allowed. The files were searched against a UniProt/SwissProt FASTA downloaded in February 2018 that was parsed within the vendor software on the term “sapiens.” An in house generated contaminants database consisting of a combination of the MaxQuant common contaminants and the cRAP database (https://www.thegpm.org/crap/) was used to flag contaminants in both the processing and consensus workflows.

MASCOT/Proteome Discoverer

Mascot node for Proteome Discoverer 2.3 was run using the same parameters and database as SEQUEST (mentioned above) followed by Percolator. The server was Windows 10 Xeon CPU E5-2630 v3 @ 2.4 GHz, 32 GB RAM. This server can run 16 threads in parallel.

Pinnacle

In the current study, Bolt was run inside the Pinnacle software v 93 on i5-3427U @ 1.8 GHz laptop to help visualize and export results. Search tolerant is set as 10 ppm for MS1 and 20 ppm for MS/MS (these are user-defined).

Bolt

The Bolt engine works in the following way. It has two components: the Bolt server (on a high-performance cloud server) and Bolt client (on the user computer/laptop). The steps done on the server are marked as [Server] and the steps done on the client are marked as [Client].

  1. 1.

    [Server] During initialization, the server reads a protein FASTA file along with the reverse database. It currently holds a combination of human protein sequences, bovine protein sequences, and contaminant sequences. As shown in Table 1, we used a combined database that contains 909,583 protein sequences, the majority of which come from the protein mutation database XMAn [20]. Approximately 8% of the protein sequences from the XMAn database are not considered in the current version of Bolt as those could not be mapped to the appropriate non-mutated protein sequence. We decided to use the entire Bovin database for this study of the cultured human HeLa cell line as we found more than 7800 peptide sequences present in the Bovin database that are also present in the Trembl and SwissProt-Isoform database but not present in the SwissProt or Contaminant database. If we do not include the Bovin database, there is the risk of misidentifying a potential Bovin peptide as an isoform or a non-reviewed Trembl human peptide.

  2. 2.

    [Server] The second step of initialization is to create a large memory indexed database of all possible peptides and their product ions. As most of the current software are designed for desktop processing, their indexed databases are stored on hard disk. The cloud technology has made it possible to utilize high-memory servers (e.g., 250 GB RAM), thereby allowing the creation of a persistent in-memory indexed database. This provides huge speed advantages as reading from memory is orders of magnitude faster than reading from hard disk. We utilize servers from Microsoft’s Azure Virtual Machine for this task.

  3. 3.

    [Server] Using commonly used proteomics search parameters for human samples, an in-memory indexed database is created which can work for most users (this is a one-time step and the indexed database does not need to be recreated as long as the server is running). The standard digestion configuration is for a tryptic digestion with three missed cleavage sites, partial tryptic search on both N and C terminus, carboxymethyl, or carbamidomethyl as the alkylation agent along with the following 41 post-translation modifications (chosen based on the most commonly reported PTMs from Uniprot): oxidation (M), oxidation (K), oxidation (P), oxidation (W), oxidation (C), oxidation (H), phosphorylation (S), phosphorylation (T), phosphorylation (Y), methylation (K), methylation (R), –H2O (E), pyroglutamate (N), pyroglutamate (Q), pyroglutamate (C), deamidation (N), deamidation (Q), formylation (S), formylation (T), formylation (K), formylation (N-term), dimethylation (K), dimethylation (R), dimethylation (N-term), trimethylation (K), trimethylation (R), double oxidation (M), double oxidation (W), acetylation (N-term), acetylation (K), carbamylation (K), carbamylation (C), carbamylation (R), carbamylation (N-term), propionylation (K), propionylation (N-term), Sulf (Y), GlyGly (K), lipoylation (K), Hexose (K), Hexose (S).

  4. 4.

    [Server] Steps 2 and 3 are repeated for reverse database.

  5. 5.

    [Client] The user provides one or more vendor RAW files to Pinnacle GUI. This can be on any user computer (e.g., user laptop or acquisition computer). Bolt’s client-side processing (inside Pinnacle) reads the vendor RAW file in the native format (for Thermo Scientific files, MSFilereader libraries are used) and extracts all MS/MS spectra along with the corresponding precursor monoisotopic m/z values and precursor charge state. It then performs signal/noise reduction on the MS/MS spectra [21, 22], corrects the precursor m/z value if needed, and creates a local file that store all these PSMs. The unique file is compressed to ensure a small file size to upload to the cloud server.

  6. 6.

    [Client] The client-side utility then starts a unique session with the Bolt server and automatically uploads this local file to the Bolt server. As this file is small, this upload completes in 10–20 s on a regular internet connection. The client also uploads the user preferences (e.g., mass tolerance, alkylation). The client then waits for the response from the server.

  7. 7.

    [Server] The Bolt server receives this file and user parameters, and then invokes a massively parallel algorithm that takes the PSMs and searches them against the in-memory indexed database (using the user-provided mass tolerance). This is again where the power of cloud servers strongly outweighs the local desktops. The Azure VM we are using has 32 vCPUs to search in parallel.

  8. 8.

    [Server] For each potential match, the server calculates a vector of scores as follows: (1) peptide sequence length, (2) total number of theoretical ions matched, (3) maximum number of continuous theoretical ions matched, (4) total intensity of ions matched, (5) number of missed cleavage sites, (6) cross-correlation score between matched ion intensities and theoretical spectrum with all intensities = 1, (7) uncommon peptide sequence. The last score is a Boolean variable with value 1 for semi-tryptic matches, non-canonical database peptides, and peptides with less common PTMs; otherwise, it is value 0. In the current version of Bolt, oxidation, N-term acetylation, and phosphorylation are set as common PTMs and the remaining are set as non-common, and this can be user-defined. This score in particular helps control for FDR in this vast search space. All the scores are calculated for both forward and reverse databases, as the in-memory database contains both forward and reverse protein sequences.

  9. 9.

    [Server] The Bolt server then runs Percolator (v2.08.01, November 2016) on these score vectors along with the knowledge whether the vector came from a forward database or reverse database. Percolator trains its model and reports a q value for all the forward database matches. Bolt then filters these results to identify the peptide with the best q value for each PSM. While Percolator also reports protein FDR, it is not used in the current implementation. If q value and matched ion evidence is exactly the same, some simple rules are followed to further aid peptide ranking: single PTM is preferred over two PTMs. Unmodified peptide is preferred over peptide with PTM. Peptide from canonical database is preferred over peptide from non-canonical sequences (e.g., isoforms, mutations). Our rationale for these decisions was to pick the form that conventional wisdom would define as more probabilistic. In future improvements of Bolt, we can report all forms and build a protein inference-based model to select the peptide form.

  10. 10.

    Once the high confidence target peptides are identified (user-defined q value ≤ 0.01), those are then transferred back to the client for further processing. The amino acid sequence, PTMs, assigned protein sequence, and the Percolator q value for each peptide are sent back to the client.

  11. 11.

    [Client] On receiving a file containing these high confidence matches, Bolt client stores them in a Pinnacle database file. Pinnacle then extracts monoisotopic chromatogram for each peptide around the identified retention time (with a user-defined m/z tolerance, e.g., ± 5 ppm), defines a peak using this chromatogram, and calculates area-under-the-curve. Based on user choice, the process can be repeated for more isotopes and/or charge states. This allows Pinnacle to perform quantitation for each peptide and then display the list to the user along with the MS/MS match and the q value.

  12. 12.

    [Client] Client also performs a protein grouping algorithm to match the identified peptides to the appropriate protein groups. This algorithm starts by first assigning unique peptides to their appropriate proteins. Then proteins with a maximum number of unassigned peptide identifications are assigned to all those peptides. If two proteins have the exact same set of unassigned peptides, then canonical database protein is preferred over non-canonical protein (e.g., Tremble/mutations). This process is repeated until all peptides are assigned to an appropriate protein.

  13. 13.

    For any new RAW file from any client, only steps 5 through 12 are to be performed (as the in-memory database is already present).

Table 1 Various Databases Considered by Bolt to Combine into a Single Sequence Database

Some rules are followed during initialization of the Bolt server to reduce the search time. All of these rules are unique (compared with other search engines) and allow us to maintain a balance between a large set of PTMs and a combinatorial explosion of the search space. Even with these rules, the search space for just canonical peptides is two orders of magnitude more than what is considered by SEQUEST or Mascot in our analysis. Same rules are followed for the forward and reverse databases to ensure that the search spaces are of equal size.

  1. 1.

    The number of PTMs was limited to 1 per peptide, but oxidation (M) and N-terminal acetylation can still be considered as a second PTM. These two represent the most common PTM found in a non-enriched sample.

  2. 2.

    PTMs and semi-tryptic cleavage peptides are allowed only for canonical database peptides. The probability of a peptide having simultaneous mutations and PTMs is low, and considering both of these on a single peptide increases the search space significantly.

  3. 3.

    PTMs are not allowed on a semi-tryptic peptide.

  4. 4.

    Peptides that have the same sequence but are different by only Leucine vs. Isoleucine are combined into a single search.

  5. 5.

    Aspartic acid is chosen over deamidation of asparagine if the rest of the sequence is the same.

We believe the above rules form reasonable assumptions for most non-enriched human samples being studied in mass spectrometry labs around the world. We certainly acknowledge that Bolt will miss on some true positives that do not follow these rules, but for such peptides, the large combinatorial search space and the corresponding high score thresholds needed to ensure a good FDR will lead to only a handful of matches on a complex sample. We also acknowledge that the same rules may not apply to other sample types (e.g., samples following some enrichment protocol), and future improvements in Bolt will allow us to work on those. Furthermore, using advanced mass spectrometry and chromatography techniques, it may be possible to distinguish between Leucine and Isoleucine, and between Aspartic acid and deamidated asparagine, and future improvements in Bolt will consider that.

Results

The Hela standard analyzed with a 130-min gradient and a 60-min gradient resulted in 53,921 and 41,384 MS/MS spectra respectively. Bolt was able to complete the reading of spectra, upload to server, complete search process, and send the data back to the local Pinnacle file system in 10.5 min and 8 min, respectively (Pinnacle run on i5 laptop). Approximately 80% of this time was spent on the searching step by the cloud server, and the rest steps took about 20% time. The corresponding processing workflow run times for SEQUEST in Proteome Discover 2.3 were 48 min and 37 min and for MASCOT were 11 min and 8 min (Mascot’s server had 4 times more threads). With the addition of the Proteome Discoverer consensus step, a total processing time reached above an hour for SEQUEST. These run times do not include peptide quantitation times from any of the software as those are not relevant to this analysis.

Bolt reported 16,562 peptides (with q value ≤ 0.01) and is compared with the 13,456 peptides identified by SEQUEST and 13,954 peptides by Mascot with q values ≤ 0.01 for the 130-min standard gradient. Figure 1 shows the Venn diagram comparing these three results. Among these, 11,889 peptides were identified by all three software, and 3733, 573, and 375 peptides were uniquely identified by Bolt, Mascot, and SEQUEST respectively. These unique peptide identifications are later referred to as Uniq-B, Uniq-M, and Uniq-S. Considering only those Uniq-B identifications that are present in the Swissprot database and did not have the PTMs not considered by SEQUEST or Mascot, we built a subset of identifications called Uniq-B-Swissprot. These are peptide sequences that are available in the Mascot and SEQUEST search space. For the hela-130, Uniq-B-Swissprot had 1620 identifications.

Figure 1
figure1

Venn diagram showing the overlap of peptide table from BOLT, MASCOT, and SEQUEST. Left panel shows for hela-130 and right panel shows for hela-60

While the above clearly shows the speed advantage of Bolt even on 100× larger search space, next we wanted to assess whether the peptides identified uniquely by either software were correct. All the identified peptides were grouped into 4 classes: length greater than 12, 11 to 12, 9 to 10, and 7 to 8. Peptides with length 6 and smaller are not considered as those may yield even fewer fragments. Figure 2 plots the distribution of the number of matched fragment ions for each of these peptide length classes. Larger purple bars denote identifications with more fragment ions matched. For peptides longer than 12 residues (top left panel), 83% of the peptides commonly identified by all three search engines contained 8 or more fragment ions annotated in the MS/MS spectrum. Uniq-B has approximately 61% peptides with 8 or more fragment peak annotations. Comparatively, the Uniq-M and the Uniq-S groups only contained 6% and 13% peptides with 8 or more fragment peaks annotated. Even Uniq-B-Swissprot contained 58% peptides with 8 or more fragment peaks annotated. Common peptides, Uniq-B, Uniq-M, and Uniq-S peptides contained 6 or more annotated fragment peaks in the matched spectrum for 95%, 84%, 11%, and 28% occurrences respectively. In addition, the Uniq-S group was comprised of 36% of the peptides only having 3 or fewer annotated peaks and the Uniq-M group was comprised of 62% of the peptides only having 3 or fewer annotated peaks in the MS/MS spectrum. The same trend was observed in the peptide lengths 11 to 12 (top right panel) and peptide lengths 9 to 10 (bottom left panel) where Common and Uniq-B had a higher propensity of peptides having more ions annotated in the MS/MS spectrum compared with Uniq-S and Uniq-M. For smaller length peptide group (length 7 to 8, top left panel), most peptides from both search engines had fewer ions annotated in the MS/MS spectrum. Even here, Uniq-S and Uniq-M had a lower percentage (13% for each) for ≥ 6 annotated ions for these smaller length peptides when compared with Common (36%) and Uniq-B (35%) groups. The number of peptides that belong in each group is provided in Supplementary Table 1.

Figure 2
figure2

Distribution of number of assigned fragment ions in the spectrum matched to a peptide with q value < 0.01 for hela-130 raw file. Larger purple bars denote identifications with more fragment ions matched. Top left panel shows the distribution for assigned ions for peptides that are longer than 12 amino acids. Distribution is shown for peptides that are identified by Bolt, SEQUEST, and Mascot (Common); identified uniquely by Bolt (Uniq-B); identified uniquely by Bolt from the Swissprot database (Uniq-B-Swissprot); identified uniquely by Mascot (Uniq-M); and identified uniquely by SEQUEST (Uniq-S). Unique matches by SEQUEST and Mascot have significantly fewer ions explained compared with the unique matches by Bolt. Top right panel, bottom left, and bottom right show similar distributions for peptide lengths 11 to 12, peptide lengths 9 to 10, and peptide lengths 7 to 8 respectively

On the Hela 60-min gradient samples, Bolt reported 12,648 peptides, SEQUEST reported 10,836 peptides, and Mascot reported 11,385 peptides with q values ≤ 0.01. Figure 1 right panel shows the Venn diagram for these two. A total of 9514 peptides were identified by all three software and Uniq-B, Uniq-M, and Uniq-S contained 2333, 492, and 258 respectively. Considering only those Uniq-B peptides that are present in Swissprot, Uniq-B-Swissprot contained 1407 peptides. Figure 3 plots the distribution of the number of matched fragment ions for each of these peptide length groups. For peptides longer than 12 residues, the Common set contained approximately 71% peptides with 8 or more fragment ions annotated in the MS/MS spectrum, Uniq-B had 56%, Uniq-B-Swissprot had 53%, Uniq-M had 6%, and Uniq-S had 19%. For Uniq-S, 43% peptides longer than 12 residues had three or less ions annotated and for Uniq-M, this was 75%, whereas this was observed for only 1% of the Common peptides and 2% of Uniq-B peptides. Similar trends were observed for other length peptides as well, where Uniq-S and Uniq-M had a significantly lower number of annotated ions compared with Common, Uniq-B and even Uniq-B-Swissprot. The number of peptides that belong in each group is provided in Supplementary Table 1.

Figure 3
figure3

Distribution of number of assigned fragment ions in the spectrum matched to a peptide with q value < 0.01 for hela-60 raw file. Larger purple bars denote identifications with more fragment ions matched. Top left panel shows the distribution for assigned ions for peptides that are longer than 12 amino acids. Distribution is shown for peptides that are identified by Bolt, SEQUEST, and Mascot (Common); identified uniquely by Bolt (Uniq-B); identified uniquely by Bolt from the Swissprot database (Uniq-B-Swissprot); identified uniquely by Mascot (Uniq-M); and identified uniquely by SEQUEST (Uniq-S). Unique matches by SEQUEST and Mascot have significantly fewer ions explained compared with the unique matches by Bolt. Top right panel, bottom left, and bottom right show similar distributions for peptide lengths 11 to 12, peptide lengths 9 to 10, and peptide lengths 7 to 8 respectively

One critical reason for running the different gradients was to observe how many of the peptide IDs from the shorter gradient can be observed in the longer gradient (thereby adding confidence in their assignment). We obviously do not expect to see 100% coverage of peptides from shorter gradient to longer gradient as even in otherwise identical experiment configurations, this is dependent on stochastic sampling of ions for MS/MS trigger. Thus, we expect about 70 to 80% reproducibility in the resulting peptide list. In the hela-60 raw file, Common group had 9514 peptide IDs. From these, 7515 (79%) were identified in the hela-130 raw file by all three software. In the hela-60 raw file, Uniq-B-Swissprot had 1407 peptides, out of which 429 (30%) were identified by all three software in hela-130. Thus, many of the identifications that were unique to Bolt in the hela-60 raw file were identified by all three software in hela-130, adding even more confidence to the unique results from Bolt. In hela-60, Uniq-S had 258 peptides, out of which 53 (20%) were identified by all three software in hela-130 and Uniq-M had 492 peptides out of which 132 (27%) were identified by all three software in hela-130.

One interesting aspect of the search engine is the capability to go through hundred thousands of published mutation sites in a matter of minutes. While both hela-60 and hela-130 reported 23 and 35 such mutated peptides found, six of these peptides were found common in both these analyses. For the protein GLRX3_HUMAN (O760003, Glutaredoxin-3), it is reported that residue 123 Proline is found in the mutated form as Serine. Thus, the resulting tryptic peptide is HASSGSFLSSANEHLK (bold site shows site of mutation). Bolt found this peptide in both hela-60 and hela-130 raw files, whereas SEQUEST or Mascot did not assign this MS/MS a confident peptide ID. Figure 4 show the rich fragmentation pattern in the hela-130 raw file, with most major ions assigned. The retention time observed in hela-60 is 11.7 min and in hela-130 is 12.1 min.

Figure 4
figure4

Fragmentation for the mutated peptide HASSGSFLSSANEHLK observed in hela-130 raw file as reported by Bolt

Figure 5 shows the distribution of types of the identifications from Uniq-B vs. the reported q value. The top three panels show the different q value bins for the different classes of peptides from hela-130 that are uniquely identified by Bolt and would not be present in SEQUEST’s and Mascot’s search space. The bottom three panels show the same information for hela-60 raw file. Bovin, Mutation, and Isoforms refer to peptides identified from the other sequence databases that were not provided to SEQUEST or Mascot (Bovin database, XMAn database, and SwissProt-Isoform + Trembl database respectively). N-term refers to SwissProt peptides having tryptic cleavage site only on the C terminus, and C-term refers to SwissProt peptides having a tryptic cleavage site only on the N terminus. Non-M oxidation refers to peptides having oxidation on residues besides methionine (W, C, P, H, and K). UncommonPTM refers to peptides having the uncommon PTMs that are considered by Bolt search (Lipoyl/triMethyl/Sulf/Hexose/GlyGly). There is a significant number of peptides even in the high confidence q value region (top left panel) for almost all classes of peptides, which suggest the need for all search engines to consider these classes of peptides. Almost 3% of the identified peptides are partially tryptic, thereby emphasizing the amount of information lost by ignoring partial tryptic search. Among the Bovin group peptides, almost 85% are present in the Contaminant database. While this may suggest that we do not need the entire Bovin database, as only 20–40 additional peptides from Bovin are found that do not exist in the common contaminant database, we have chosen here to err on the side of caution for a cell line grown in culture, rather than to report a Bovin peptide as a human peptide from the Trembl or XMAn mutation database. Alternatively, this also shows the need for a more complete contaminant database should be applied to current searches. The number of peptides for each type is provided in Supplementary Table 2.

Figure 5
figure5

Distribution of types of peptides identified uniquely by Bolt (uniq-B group) vs. the reported q value. The top three panels show the different q value bins for the hela-130 raw file and the bottom three panel shows for the q value bins for the hela-60 raw file. Bovin, Mutation, and Isoforms refer to peptides identified from the other sequence databases that were not provided to SEQUEST or Mascot. N-term/C-term refers to peptides having tryptic cleavage site only on one terminal. Non-M oxidation refers to peptides having oxidation on residues besides methionine (W, C, P, H, and K). UncommonPTM refers to peptides having some uncommon PTMs that are considered by Bolt search

We also investigated some spectral matches that were conflicting (i.e., same MS/MS spectrum being assigned different sequence by Bolt and SEQUEST or Mascot). This was the case for less than 5% identifications and it appears these may arise due to chimeric spectra (i.e., a spectrum containing multiple peptide IDs). Furthermore, most of these identifications are from the SwissProt database, thus are present in the search space of all software. Evaluating them in more detail, we find that the top four annotated ions by one search engine are different from the top four annotated ions by the other search engine for more than 60% of such spectra. This further suggests that these could truly be chimeric spectrum where both search engines are reporting correct matches, but have different scoring and ranking algorithms. Figure 6 shows two such examples for hela-130 raw file. Top panel shows MS/MS spectrum #11131 assigned to peptide sequence VEEVGPYTYR by SEQUEST and to peptide SM[Oxid]QDVVEDFK by Bolt. Bottom panel shows MS/MS spectrum #8659 is assigned to peptide sequence VDSPTVTTTLK by SEQUEST and to peptide LGNTTVICGVK by Bolt. The marked annotated ions (almost complete y-ion series) are all different for both these assignments clearly indicating the presence of chimeric spectrum. All four peptides are from Swissprot database so available to all three search engines.

Figure 6
figure6

Two examples of chimeric spectrum in the hela-130 raw file. Top panel: MS/MS spectrum 11131 is assigned to peptide sequence VEEVGPYTYR by SEQUEST and to peptide SM[Oxid]QDVVEDFK by Bolt. Bottom panel: MS/MS spectrum 8659 is assigned to peptide sequence VDSPTVTTTLK by SEQUEST and to peptide LGNTTVICGVK by Bolt. All sequences are from the canonical Swissprot database. Annotated ions are all different for both these assignments clearly indicating the presence of chimeric spectrum

Discussion

We have presented Bolt, a new search engine that is capable of searching over nine hundred thousand protein sequences, tens of dynamic PTMs, and partially cleaved peptides in a matter of minutes on a standard configuration computer and it does so without any observable loss of specificity. Selecting even one of these options in SEQUEST or Mascot will make the processing take more than an hour on a high-performance server and hours or even days on a more typical configuration computer [12,13,14]. If the goal of studying large proteomics cohorts is to sequence as deeply as possible, this makes Bolt a much more viable search engine than SEQUEST or Mascot especially given the clinical relevance of mutations, PTMs, and protein isoforms. For doing this search, Bolt utilizes a high-performance server on Azure cloud which would be available to the entire scientific community without the need of expensive IT infrastructure.

We acknowledge that there are other search engines, e.g., Byonic [23], Andromeda, and X!Tandem [24] that are used by many research groups, although SEQUEST and Mascot are still among the most widely used search engines. Their peptide results are often used as a benchmark for comparison. We did attempt the entire 909,583 protein database along with 41 PTMs with Andromeda, Mascot, and MS-Amanda, and none of them could even begin processing the data. Thus, all these search engines have similar performance bottlenecks—to have the search result be reported in a reasonable time, the user must limit the database and the list of dynamic post-translation modifications as well as perform a search assuming near complete protease cleavage. There are other database engines that could process a larger set of PTMs (e.g., PEAKS [25], MSFragger [15], Proteinpilot [26]), but as none of them can handle an extensive mutation database, they are not compared. A more extensive comparison with other search engines is in works and will be reported in a follow-up study. Our hypothesis is that the unique and conflicting group of peptides will be different for each search engine but expect Bolt to still have a significant performance advantage over all other search engines.

While peptides with mutations may be the most interesting biologically, there are other classes of peptides that are also very important. Peptides from Trembl and Isoform databases help see protein variants that might have been ignored in previous studies. Peptides from Bovin can help improve the contaminant databases. Partially cleaved peptides can help identify protein fragments that are not present in the public databases, and these may be especially interesting. Peptides with other common and uncommon PTMs all help analyze various biological pathways. Thus, all these classes help enhance and complete the biological analysis in some form or the other. While we chose SwissProt and Trembl as the main databases, we can also choose ENSEMBL and RefSeq. We found both of these databases to have a large number of uncharacterized peptides (proteins) that are not part of our currently used databases. A recent study [27] compared these various databases with proteogenomics but searching time and FDR were their primary bottlenecks. Bolt is a robust solution for both of these challenges (search time and FDR). We plan to incorporate other proteomics databases (as well as variant databases) in the future extension of Bolt. Our choice of PTMs for this study was derived from common human modifications that were of most relevant interest to other studies we currently have in works and served as a proof of principle. We wanted to show that Bolt is capable of handling a large number of PTMs in an efficient manner without having to compromise on the false discovery rate.

The results in Figures 2 and 3 demonstrate that identifications by Bolt are higher confidence than SEQUEST or Mascot and may have fewer false positives in these samples. While we have used a number of fragments as our primary representative measure, which is one of the most commonly used visual metric used by scientists, there are other criteria that one can use such as the percentage of top few MS/MS ions explained. However, most of these criteria are based on the assumption that the peptide produces a rich fragmentation pattern. For many classes of peptides, this may not be true (e.g., peptides contain proline). We fully acknowledge that Bolt is still in early stages of development. Search engines like SEQUEST have been developed and optimized over many iterations and many years, and have been optimized for downstream analysis with tools such as Percolator. It is quite possible that some of the identifications in the group Uniq-S are correct and Bolt is missing these. When we study some of these manually, some do appear to be correct. But without any further investigative studies and possibly synthesizing those peptides, there is no way of knowing.

As described in the “Material and Methods” section, the Bolt server is currently initialized to the most common parameters used in proteomics analysis of human samples. We plan to extend it to other species as well as allow users to modify some of the parameters on the fly. With the cloud technology, it is easy to deploy multiple servers as needed: either with different species, or even multiple servers with human-specific parameters if the demand is high.

Another interesting area of study that we plan to investigate is chimeric spectrum. Figure 6 showed two examples of chimeric spectrum where both identifications by seem correct. Some research groups have added capabilities to their search engines to process every spectrum as a possible chimeric spectrum [28] and yield multiple identifications per MS/MS. However, this often requires a second complete search and subsequently large increases in search time. Moreover, the comparison of results between search engines needs to be studied in greater depth when we allow multiple matches per MS/MS.

Bolt provides a truly modern option for setting up proteomics searches. In a very reasonable time, it can search through non-reviewed and mutation databases without loss of sensitivity. While we have shown and compared results on a peptide level, Bolt produces a protein list that is also longer than the one from SEQUEST’s or Mascot’s output. This is expected, but it requires a careful evaluation of protein grouping algorithms for conclusive analysis, especially when there is a high degree of homology in the database (as is the case for Bolt). Bolt represents, to the best of our knowledge, the first fully scalable, cloud-based quantitative proteomic solution that can be operated within a user-friendly GUI interface.

Software Availability

Demo copy of software Bolt is available by emailing http://www.amol.prakash@optystech.com or by registering on the website http://www.optystech.com

References

  1. 1.

    Hebert, A.S., Richards, A.L., Bailey, D.J., et al.: The one hour yeast proteome. Mol. Cell. Proteomics. 13(1), 339–347 (2013)

    Article  Google Scholar 

  2. 2.

    Shishkova, E., Hebert, A.S., Coon, J.J.: Now, more than ever, proteomics needs better chromatography. Cell Syst. 3(4), 321–324 (2016)

    CAS  Article  Google Scholar 

  3. 3.

    Zhang, Y., Fonslow, B.R., Shan, B., Baek, M.C., Yates, J.R.: Protein analysis by shotgun/bottom-up proteomics. Chem. Rev. 113(4), 2343–2394 (2013)

    CAS  Article  Google Scholar 

  4. 4.

    Scheltema, R.A., Hauschild, J.-P., Lange, O., Hornburg, D., Denisov, E., Damoc, E., Kuehn, A., Makarov, A., Mann, M.: The Q Exactive HF, a benchtop mass spectrometer with a pre-filter, high-performance quadrupole and an ultra-high-field Orbitrap analyzer. Mol. Cell. Proteomics. 13(12), 3698–3708 (2014)

    CAS  Article  Google Scholar 

  5. 5.

    Doerr, A.: DIA mass spectrometry. Nat. Methods. 12, 35 (2014)

    Article  Google Scholar 

  6. 6.

    Prakash, A., Peterman, S., Ahmad, S., Sarracino, D., Frewen, B., Vogelsang, M., Byram, G., Krastins, B., Vadali, G., Lopez, M.: Hybrid data acquisition and processing strategies with increased throughput and selectivity: PSMART analysis for global qualitative and quantitative analysis. J. Proteome Res. 13(12), 5415–5430 (2014)

    CAS  Article  Google Scholar 

  7. 7.

    Meier, F., Geyer, P.E., Virreira Winter, S., Cox, J., Mann, M.: BoxCar acquisition method enables single-shot proteomics at a depth of 10,000 proteins in 100 minutes. Nat. Methods. 15(6), 440–448 (2018)

    CAS  Article  Google Scholar 

  8. 8.

    Yates, J.R., Eng, J.K., McCormack, A.L., Schieltz, D.: Method to correlate tandem mass spectra of modified peptides to amino acid sequences in the protein database. Anal. Chem. 67(8), 1426–1436 (1995)

    CAS  Article  Google Scholar 

  9. 9.

    Perkins, D.N., Pappin, D.J., Creasy, D.M., Cottrell, J.S.: Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis. 20(18), 3551–3567 (1999)

    CAS  Article  Google Scholar 

  10. 10.

    Cox, J., Neuhauser, N., Michalski, A., Scheltema, R.A., Olsen, J.V., Mann, M.: Andromeda: a peptide search engine integrated into the MaxQuant environment. J. Proteome Res. 10(4), 1794–1805 (2011)

    CAS  Article  Google Scholar 

  11. 11.

    Dorfer, V., Pichler, P., Stranzl, T., Stadlmann, J., Taus, T., Winkler, S., Mechtler, K.: MS Amanda, a universal identification algorithm optimized for high accuracy tandem mass spectra. J. Proteome Res. 13(8), 3679–3684 (2014)

    CAS  Article  Google Scholar 

  12. 12.

    Williamson, N.A.: Operational experience of an open-access, subscription-based mass spectrometry and proteomics facility. J. Am. Soc. Mass Spectrom. 29(3), 439–446 (2018)

    CAS  Article  Google Scholar 

  13. 13.

    Friedman, D.B., Andacht, T.M., Bunger, M.K., Chien, A.S., Hawke, D.H., Krijgsveld, J., Lane, W.S., Lilley, K.S., Maccoss, M.J., Moritz, R.L., et al.: The ABRF proteomics research group studies: educational exercises for qualitative and quantitative proteomic analyses. Proteomics. 11(8), 1371–1381 (2011)

    CAS  Article  Google Scholar 

  14. 14.

    Bekker-Jensen, D.B., Kelstrup, C.D., Batth, T.S., Larsen, S.C., Haldrup, C., Bramsen, J.B., Sorensen, K.D., Hoyer, S., Orntoft, T.F., Andersen, C.L., et al.: An optimized shotgun strategy for the rapid generation of comprehensive human proteomes. Cell Syst. 4(6), 587–599 (2017)

    CAS  Article  Google Scholar 

  15. 15.

    Kong, A.T., Leprevost, F.V., Avtonomov, D.M., Mellacheruvu, D., Nesvizhskii, A.I.: MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics. Nat. Methods. 14(5), 513–520 (2017)

    CAS  Article  Google Scholar 

  16. 16.

    Solntsev, S.K., Shortreed, M.R., Frey, B.L., Smith, L.M.: Enhanced global post-translational modification discovery with MetaMorpheus. J. Proteome Res. 17(5), 1844–1851 (2018)

    CAS  Article  Google Scholar 

  17. 17.

    Millikin, R.J., Solntsev, S.K., Shortreed, M.R., Smith, L.M.: Ultrafast peptide label-free quantification with FlashLFQ. J. Proteome Res. 17(1), 386–391 (2018)

    CAS  Article  Google Scholar 

  18. 18.

    Perez-Riverol, Y., Csordas, A., Bai, J., et al.: The PRIDE database and related tools and resources in 2019: improving support for quantification data. Nucleic Acids Res. 47(D1), D442–D450 (2018)

    Article  Google Scholar 

  19. 19.

    The, M., MacCoss, M.J., Noble, W.S., Käll, L.: Fast and accurate protein false discovery rates on large-scale proteomics data sets with percolator 3.0. J. Am. Soc. Mass Spectrom. 27(11), 1719–1727 (2016)

    CAS  Article  Google Scholar 

  20. 20.

    Yang, X., Lazar, I.M.: XMAn: a Homo sapiens mutated-peptide database for the MS analysis of cancerous cell states. J. Proteome Res. 13(12), 5486–5495 (2014)

    CAS  Article  Google Scholar 

  21. 21.

    Liu, X., Inbar, Y., Dorrestein, P.C., et al.: Deconvolution and database search of complex tandem mass spectra of intact proteins: a combinatorial approach. Mol. Cell. Proteomics. 9(12), 2772–2782 (2010)

    CAS  Article  Google Scholar 

  22. 22.

    Awan, M.G., Saeed, F.: MS-REDUCE: an ultrafast technique for reduction of big mass spectrometry data for high-throughput processing. Bioinformatics. 32(10), 1518–1526 (2016)

    CAS  Article  Google Scholar 

  23. 23.

    Bern, M.; Kil, Y. J.; Becker, C. Byonic: Advanced peptide and protein identification software. Curr. Protoc. Bioinforma. 2012;13;Unit13.20

  24. 24.

    Craig, R., Beavis, R.C.: TANDEM: matching proteins with tandem mass spectra. Bioinformatics. 20(9), 1466–1467 (2004)

    CAS  Article  Google Scholar 

  25. 25.

    Ma, B., Zhang, K., Hendrie, C., et al.: PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. Rapid Commun. Mass Spectrom. 17(20), 2337–2342 (2003)

    CAS  Article  Google Scholar 

  26. 26.

    Shilov, I.V., Seymour, S.L., Patel, A.A., et al.: The Paragon Algorithm, a next generation search engine that uses sequence temperature values and feature probabilities to identify peptides from tandem mass spectra. Mol. Cell. Proteomics. 6(9), 1638–1655 (2007)

    CAS  Article  Google Scholar 

  27. 27.

    Nesvizhskii, A.I.: Proteogenomics: concepts, applications and computational strategies. Nat. Methods. 11(11), 1114–1125 (2014)

    CAS  Article  Google Scholar 

  28. 28.

    Dorfer, V., Maltsev, S., Winkler, S., Mechtler, K.: CharmeRT: boosting peptide identifications by chimeric spectra identification and retention time prediction. J. Proteome Res. 17(8), 2581–2589 (2018)

    Article  Google Scholar 

Download references

Acknowledgements

We would like to acknowledge Simion Kreimer, Ph.D. (Johns Hopkins University) and Dragana Lagundzin, Ph.D. (University of Nebraska) for their help with the Mascot analysis.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Amol Prakash.

Electronic Supplementary Material

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Prakash, A., Ahmad, S., Majumder, S. et al. Bolt: a New Age Peptide Search Engine for Comprehensive MS/MS Sequencing Through Vast Protein Databases in Minutes. J. Am. Soc. Mass Spectrom. 30, 2408–2418 (2019). https://doi.org/10.1007/s13361-019-02306-3

Download citation

Keywords

  • Mass spectrometry
  • Proteomics
  • Peptide
  • Mutations
  • Search engine
  • MS/MS
  • Sequencing
  • Variants
  • Cloud
  • Bolt