Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework
For shotgun mass spectrometry based proteomics the most computationally expensive step is in matching the spectra against an increasingly large database of sequences and their post-translational modifications with known masses. Each mass spectrometer can generate data at an astonishingly high rate, and the scope of what is searched for is continually increasing. Therefore solutions for improving our ability to perform these searches are needed.
We present a sequence database search engine that is specifically designed to run efficiently on the Hadoop MapReduce distributed computing framework. The search engine implements the K-score algorithm, generating comparable output for the same input files as the original implementation. The scalability of the system is shown, and the architecture required for the development of such distributed processing is discussed.
The software is scalable in its ability to handle a large peptide database, numerous modifications and large numbers of spectra. Performance scales with the number of processors in the cluster, allowing throughput to expand with the available resources.
KeywordsPeptide Multiple Search Peptide Database Hadoop Framework Software Development Cost
The identification of peptide sequences from spectra is a computationally expensive and rate limiting task. This expense is due to the fact that the identification process typically involves each measured spectrum being matched against the likely (artificially generated) spectra for each possible peptide ion within a range of mass-to charge ratios. The problem with identification of peptides is confounded as the number of peptides is huge, and the search space increases geometrically with the number of amino acid modifications considered. This paper introduces a highly scalable strategy for overcoming these limitations, which has been purposefully built to take full advantage of a highly distributed computation framework.
The shotgun proteomics workflow has become the most widely used technique for identifying and quantifying proteins present in a biological sample in a high-throughput manner. Although many variations exist, the basic approach begins with extracting proteins from sample and digesting the proteins into peptides with a proteolytic enzyme such as trypsin. The peptides are then separated using liquid chromatography and analyzed by mass spectrometry (MS). The mass-to-charge (m/z) ratios of the peptide precursor ions are measured and the precursors are fragmented into a series of ions and measured in tandem mass spectrometry (MS/MS) mode. The resulting MS/MS spectra of these peptide ions together with the precursor m/z are searched against a database of possible peptides to determine the best match.
Many strategies have been proposed to help with such search techniques. Raw, embarrassingly parallel approaches are the most widely used (e.g. for X!Tandem, SEQUEST, OMSSA). To increase the speed and scalability of the searching advances in high performance computing hardware have also been used to offload the searching from general to dedicated hardware (including GPUs and FGAs), and improvements to search algorithms have been implemented[7, 8]. Some algorithms have been ported to parallel computing architectures such as MPI (). Previous parallel implementations generally scale with the number of spectra, however the scalability issues relating to the size of the database being searched are not addressed.
In this paper we introduce Hydra, which is designed to scale both in terms of the number of spectra and the size of the search database. Hydra makes use of Hadoop, which is a common and well supported framework for handling distributed computation. The Hadoop framework handles the management of a cluster of generic machines. It handles the details of taking a set of two tasks: map and reduce (see methods for details) and creating the instances required to handle a given data set. Tasks are sent by the framework to specific machines and output is collected by the framework to the next task. The framework handles details such as failure and retry, generation of a distributed data system and storage and access to temporary results.
Using Hadoop for mass spectrometry based searching has been suggested before, where an existing MPI search engine (X!!Tandem) was wrapped to work with Hadoop. However, such MPI based code does not use critical features of the MapReduce algorithm (e.g. the ability to sort data in ways that guarantee that searches are performed optimally). The algorithm introduced in this paper is a complete bottom-up rewrite of the X!Tandem code, and has been designed to take full advantage of map-reduce. Within Hydra, we implement the K-score, which is one of the pluggable scoring algorithms packaged with the Trans-Proteomic Pipeline[15, 16] distribution of X!Tandem. The K-score is an implementation of the COMET scoring function, which produces a dot product of matched peak intensities after spectrum normalization. The K-score normalization and scoring routines account for noise and unmatched peaks within the input spectra.
The advantages of the approach introduced in this paper are: scalability, as Hadoop has been shown to be able to handle massive data sets in the petabyte range, and has been used to undertake complex search and analysis operations on datasets of this size. This means, as is shown in this paper, that such a framework is highly suited for the growing data sets that are becomingly searched against when undertaking MS/MS analyses; flexibility, as the system can be used on a variety of commodity hardware configurations, including running across heterogeneous machines where the performance scales with additional hardware, and can be used within cloud environments (e.g. Amazon EC2) that support Hadoop; and reliability, as Hadoop provides the necessary libraries to handle the critical issues of task distribution, monitoring and failure recovery (as the numbers of tasks, nodes and data sets increases, so too does the probability of task failures, which is an important issue with high throughput experiments).
In this paper we present the design for decomposition of mass spectrometry search into a collection of map/reduce tasks, and show an implementation of our proteomic search engine Hydra. Hydra is specifically designed from the ground up to work within the MapReduce algorithm. By comparison to X!Tandem we show where and how the system scales, and discuss the advantages of our approach. The software is made publicly available under an open source license.
The strategy advocated in this paper for ensuring scalable searches of mass spectrometry data is one that uses distributed parallel processing. This type of solution has become popular over the last five years as it allows for scalability using commodity hardware, and so both reduces the hardware and software development costs traditionally associated with high performance computing. This style of architecture allows for the development of parallel solutions across clusters of heterogeneous hardware.
With highly distributed frameworks, the times required for execution scale inversely with the number of processors allocated and directly with the size of the data set. These frameworks are most efficient with large data sets, as there is always an overhead with initial setup. These frameworks coupled with MapReduce based designs provide a means to partition the analysis tasks effectively.
Hydra was designed to specifically take full advantage of the MapReduce architecture and native capabilities. The MapReduce algorithm parallelizes the performance of a number of problems. Computation is performed in two phases. The first phase, called the mapper, receives data in parallel at a series of nodes, and then emits data as a series of key-value pairs. The unit of data received and emitted is application dependent. In our case the first mapper receives information representing a single spectrum and emits the same.
We use the Hadoop framework, which is the de facto open-source implementation of the MapReduce architecture supported by Apache and available on a number of machines. Our code is single threaded and uses the Hadoop framework to achieve parallelism in separate nodes on a cluster. One important feature of the approach is to precompute all candidate peptide sequences including any mass modifications and to group them by m/z. This allows each sequence and its corresponding spectrum to be generated only once.
Our implementation divides proteomic search into two tasks: generation of a peptide database and the subsequent scoring of a collection of spectra against the database. This separation allows the set of peptides with appropriate modifications to be generated in advance of searches and, as long as the set of proteins and modifications remains unchanged, to be reused in multiple searches. In addition, as discussed below, knowledge of the number of peptide ions to be scored allows tuning to control memory requirements.
Generation of the peptide database
The peptide database is written to answer a single query: given a specific m/z ratio return a list of all peptides and modified peptides to score having that m/z. The task of generating peptides has two parts - the first finds a list of all unique peptides in a protein database after applying a specified digestive enzyme (e.g., trypsin allowing one missed cleavage) and a collection of modifications. The second element is to generate groups of peptides with a specified m/z ratio and store them.
Our MapReduce implementation (see Figure1) has the following steps: for every protein, the mapper generates all peptides and uses the sequence (with modifications) as a key and an id of the protein as a value. Every key delivered to the reducer represents a unique sequence (duplicate keys are collapsed automatically) and the list of values represents the proteins containing the sequence. The reducer emits the sequence as the value with the m/z ratio as a key.
The next reducer receives all sequences scored at a specific m/z ratio and stores them for later analyses. Also generated is a list of m/z values and the corresponding number of scored sequences.
The algorithm for scoring spectra uses three MapReduce jobs (Figure2). The first job sends spectra to be scored against a set of peptide ions having a theoretical m/z ratio within the specified tolerance of the measured one. The second job combines all scoring of a specific spectrum, choosing the best scoring peptides. The third job combines the scoring of all spectra into a single output file.
These algorithms score spectra against a range of m/z values within a specific delta of the precursor m/z ratio. Our implementation assigns candidate peptides to a specific m/z group (1 Dalton for z = 1). Spectra are assigned to score against a set of groups close to the precursor ion m/z. Every group is scored on a separate reducer. When the number of peptides in a group exceeds a known limit, the group is split and scored on multiple machines. For each spectrum the top scoring peptides in the group, as well as scoring statistics are gathered as an object, serialized as XML and written with the scan id as the key.
The second MapReduce job gathers the scoring for each spectrum and collates the results. Peptide-spectrum matches (PSMs) from all groups are combined into a single list sorted by score. Again, the top scoring peptides and overall statistics are packed into a single object and serialized as XML with the scan id being the key.
The consolidation task receives these overall score objects for each scan. The data is reformatted into the standard X!Tandem output and written to a single file. Following the completion of the consolidation task, the output of the consolidator stage, a file in X!Tandem format is copied from the cluster to the user’s machine.
Results and discussion
The program was tested against larger searches to validate scalability and the ability to handle large search spaces. We ran the program against three databases: a large non-redundant (nr) sample (~16.4 million proteins), a subset with half the original nr data and one with a quarter the original set. Sample data sets ranged from 4000 to 500,000 spectra. We used a cluster with 4 nodes with 8 cores/node, and each node had a terabyte disk. All nodes were in the same rack connected with standard Ethernet.
To generate the original database, we used NCBI's non-redundant list of protein sequences for building the database of peptides. This file contains all available protein sequences for all organisms that differ at least with one amino acid residue. The smaller databases were generated by selecting subsets of the larger database. For searching against the human proteome, the 2012 February UniProt reference proteome (65835 proteins) was used.
The scalability problems with proteomics searching come down to three factors. The first factor is the size of the peptide database: the number and size of the proteins considered; whether tryptic only or semitryptic peptides are searched; and the number of modifications and missed cleavages. Secondly the size of the experiment, which is the number of spectra scored. The third factor is the number of computer resources that can be made available. Our algorithm is designed to be optimized to scale for all the three factors. So that there are no limits on the number of searched peptides, the number of scored spectra, and the number of compute nodes that can be used. Adding proteins or spectra will increase execution time for a given set of resources, but it does not prevent the algorithm from arriving at a solution.
Comparison of search times for standard X!Tandem and Hydra
Dot product (M)
Hydra and X!Tandem do differ in the data that they generate, as X!Tandem returns a single peptide, and Hydra returns a list of the top likely peptides. When comparing the outputs of X!Tandem to Hydra with artificial spectra they are greater than 99.9% identical in terms of the best peptide found. When running the system on experimental data there are differences due to algorithmic details (e.g. integer rounding), which means that they are 75% identical in terms of the top peptide and score matching, and 95% of the time they both return peptides with similar scores. It is interesting to note that X!Tandem itself does not always generate exactly the same results, possible due to race conditions in the code base. Because we can afford to do a slightly wider search, Hydra can find peptides the X!Tandem does not (in nearly all the cases where the systems do not match Hydra finds a peptide with a substantially higher score than X!Tandem).
Scalability of search operations in mass spectrometry is a critical concern. The size of proteome databases continues to grow especially as more modifications, post-translational changes, isotopic labeling and semi-and non-tryptic peptides are searched. Isotopic labeling and modifications geometrically expand the search space of peptide masses. The wider availability of faster mass spectrometers and the common practice of scoring data with multiple search algorithms are having huge effects on the computational requirements for such operations. The software discussed in this paper is designed to have few limits on scalability.
Search is inherently a parallel operation and algorithms already exist that have been adapted to work in parallel. In this paper we present the use of a distributed framework to develop a new generation of search algorithms, as these frameworks lend themselves conveniently to sequence-spectra matching. The advantage of using such a framework is that much of the infrastructure of managing parallel jobs is built into the framework. Hadoop has been demonstrated to scale up to thousands of processors. The details of handling a cluster, distributing work, gathering results and, critically for large systems, handling failure and retry are built into the framework.
The genomics field has already demonstrated the power of using cloud or high distributed frameworks for computational intensive tasks. As the computational burden in tandem mass spectrometry proteomics based searching is large and increasing, exploration into the use of such distributed frameworks is necessary. This increase in search space is going to continue to be a rate limiting step, and this paper discusses one strategy which can be used to overcome these limitations.
The full source code, documentation and test code is available athttp://code.google.com/p/hydra-proteomics/.
This project is supported by Award Number R01GM087221 from NIGMS and R01CA137442 from NCI, EU FP7 grant 'ProteomeXchange' [grant number 260558], major research instrumentation grant 0923536 (to RM) in part by American Recovery and Reinvestment Act (ARRA) funds through grant number R01 HG005805 from the National Institutes of Health National Human Genome Research Institute, and the National Institute of General Medical Sciences, under grant no. 2P50 GM076547/Center for Systems Biology and we also thank the Luxembourg Centre for Systems Biomedicine and the University of Luxembourg for support. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
- 7.Oh JH, Gao J: Peptide identification by tandem mass spectra: an efficient parallel searching, Bioinformatics and Bioengineering. Fifth IEEE Symposium on Bioinformatics and Bioengineering 2005, 161–168.Google Scholar
- 11.White T: Hadoop: the definitive guide. Sebastopol: O’Reilly Media; 2009.Google Scholar
- 12.Dean J, Ghemawat S: MapReduce: simplified data processing on large clusters. Commun ACM 2008., 51(1):Google Scholar
- 17.The Global Proteome MachineAvailable from:http://www.thegpm.org/gpm/faq.html Available from:
- 18.Uniprot reference proteome 2012.http://www.uniprot.org/uniprot/?query=organism%3a9606+keyword%3a1185&format=%2A Available from:
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.