An improved peptide-spectral matching algorithm through distributed search over multiple cores and multiple CPUs
- 2.1k Downloads
A real-time peptide-spectrum matching (RT-PSM) algorithm is a database search method to interpret tandem mass spectra (MS/MS) with strict time constraints. Restricted by the hardware and architecture of individual workstation, previous RT-PSM algorithms either are not fast enough to satisfy all real-time system requirements or need to sacrifice the level of inference accuracy to provide the required processing speed.
We develop two parallelized algorithms for MS/MS data analysis: a multi-core RT-PSM (MC RT-PSM) algorithm which works on individual workstations and a distributed computing RT-PSM (DC RT-PSM) algorithm which works on a computer cluster. Two data sets are employed to evaulate the performance of our proposed algorithms. The simulation results show that our proposed algorithms can reach approximately 216.9-fold speedup on a sub-task process (similarity scoring module) and 84.78-fold speedup on the overall process compared with a single-thread process of the RT-PSM algorithm when 240 logical cores are employed.
The improved RT-PSM algorithms can achieve the processing speed requirement without sacrificing the level of inference accuracy. With some configuration adjustments, the proposed algorithm can support many peptide identification programs, such as X!Tandem, CUDA version RT-PSM, etc.
KeywordsComputer Cluster Head Node Work Node Candidate Peptide Peptide Database
Real-time peptide-spectrum matching
Tandem mass spectra
- MC RT-PSM
- DC RT-PSM
Distributed computing RT-PSM
Parallel Virtual Machine
Message Passing Interface
Nearest neighbor search
- HT Technology
Tandem mass spectrometry (MS/MS) has been widely used in the early detection of diseases, chemical analysis and pharmaceutical industry. It can efficiently identify and characterize the protein component information in complex biological mixtures. Interpretations of MS/MS spectra need to perform peptide-spectrum matches (PSMs) by searching experimental MS/MS spectra against a protein sequence database.
In order to improve the efficiency and the accuracy of MS/MS experiments, a real-time peptide identification procedure needs to be involved in a mass spectrometry system which analyzes peptides and performs the PSMs in a peptide identification procedure life-circle. Wu et al.  have proposed a pretty fast procedure, called real-time PSM (RT-PSM). The key component is “identifying peptides”, which is performed by a software application . However, this RT-PSM procedure does not include any external software controlling features. Although the method is fast, further experiments indicate that the programming still cannot completely satisfy all real-time system requirements, since it is a single-thread program that runs on a single workstation.
As a real-time system, the time window of each peptide identification procedure is limited by the spectrum acquiring time of mass spectrometers. It could be between 0.05 second to 0.5 second due to different mass spectrometers. To fit in the narrow time window, using parallel computation to improve the speed of PSMs is becoming a trend. Duncan et al.  develop a program called Parallel Tandem by using a computer cluster. It processes MS/MS in parallel by using X!Tandem and a computer cluster with Parallel Virtual Machine (PVM) or Message Passing Interface (MPI). Sadygov et al.  develop the parallel version of SEQUEST, which is also based on the PVM in a computer cluster. Diament et al.  further develop a faster SEQUEST, called Tide, to speed up the performance of the SEQUEST. It acheives up to 170 times faster than SEQUEST. Zhang et al. [5, 6] use SIMD instructions in a single workstation to develop programs for improving the speed of peptide identification procedures. Graumann et al.  recently develop a framework of intelligent agent, termed MaxQuant Real-Time, which is implemented in the MaxQuant computational proteomics environment. The framework is especially uesful for new instrument types, such as the quadrupole-Qrbitrap.
No matter using a computer cluster or a single workstation, the principles of parallel computing are identical: dividing a large sequential process into several independent sub-processes and executing the sub-processes concurrently to reduce the execution time . However, those previous parallel computing methods [1, 2, 3, 4, 5, 6, 7] still have some room to be improved. In terms of processing time, parallel forms of X!Tandem  and SEQUEST  spend more time than the RT-PSM algorithm proposed in  when analyzing individual spectra. Although the Tide  is already very fast, the speed can still be improved. In terms of computing environments, SIMD instructions are restricted by the CPU L2 Cache [9, 10], which often needs to sacrifice the level of inference accuracy to achieve the time limitation of a real-time system, while a computer cluster circumvents this problem. Moreover, instead of design a specific program, we aim to develpe a general platform that can support many peptide identification programs.
In this paper, we develop an improved peptide identification procedure on a computer cluster based on the RT-PSM algorithm proposed by Wu et al. in . Two parallel algorithms are developed in this study: a multi-core RT-PSM algorithm (MC RT-PSM) which works on an individual workstation in form of a multi-thread program and a distributed computing RT-PSM algorithm (DC RT-PSM) which works on a computer cluster in form of a distributed computing program. The DC RT-PSM is built by using the parallelized MC RT-PSM procedure, which allocates and manages task computating resources through a head node in the distrubuted computing procedure. Source code of the DC RT-PSM algorithm and sample data are available in the Additional file 1. The improved algorithms can achieve processing speed requirements without sacrificing the level of inference accuracy.
Results and discussion
Experimental environment and data sources
The experimental computer cluster consists of one head node and 32 worker nodes, which is connected with 1 Gigabit Ethernet. Each node has 8 logical CPU cores.
Two datasets are employed to test the improved algorithms in this study. Dataset A is the one used in the RT-PSM package : the MS/MS spectrum experimental data source includes 2058 group spectrum data and the protein database is taken from a subset of the UniRef100 human protein database. It contains over 2200 entries (over 180000 peptide sequences). Dataset B includes 16463 groups of experimental spectrum data and over 3300 entries. It is also generated from the UniRef100 human protein database.
The level of inference accuracy for the improved algorithms
The purpose of the parallel computing processing in our improved algorithms is to reduce the peptide identification time. Hence, the proposed algorithms do not gain better performance by decreasing the level of inference accuracy. The results of the improved algorithms should be identical with the original RT-PSM in . We randomly choose 100 groups of experimental data from the results of our improved alogirthms and the original RT-PSM. The identification results are in excellent agreement between the original RT-PSM program and our improved MC RT-PSM and DC RT-PSM programs.
The time speedup of the 2-Dimensional peptide database search method
Comparisons between the binary search method and the 2-dimensional peptide database search method
Binary search method
2-Dimensional peptide database search method
The time speedup of the MC RT-PSM procedure
The detail information of experiment hardware environments
Worker node of cluster
The time speedup of the DC RT-PSM procedure
The performance of the DC RT-PSM is compared with the single-thread MC RT-PSM procedure. Two tasks are designed for comparisons. Task 1 is to search 2058 spectra against 2200 protein entries that conducted on Dataset A. Task 2 is to search 16463 spectra against 3300 protein entries that conducted on Dataset B.
In this paper, we have proposed an MC RT-PSM algorithm which works on an individual workstation and a DC RT-PSM algorithm which works on a computer cluster for interpreting MS/MS spectra. The MC RT-PSM algorithm is an extension of the single-thread RT-PSM algorithm proposed by Wu et al. in , while the DC RT-PSM algorithm is a distributed parallel computing algorithm that allocates and manages cluster worker nodes to perform the MS/MS spectrum analysis.
One advantage of our proposed method is that it is a general platform of parallel computing, since many current parallel algorithms are either not fast enough for all real-time MS/MS systems or restricted to specific computing environments. The distributed computing algorithm is designed not only for this RT-PSM algorithm but also for other similar algorithms. It can support many other peptide identification programs with some configuration adjustments, such as X!Tandem, SEQUEST, SIMD version RT-PSM, etc.. The other advantage of our method is that it can speed up the searching time. The proposed DC RT-PSM algorithm can reach the real-time constraints of most MS/MS systems without sacrificing the level of inference accuracy.
The performance of the RT-PSM program
The similarity-scoring module is the most time-consuming part in the RT-PSM program. It consumes over 95% CPU time in profiling experiments . This is due to the fact that each spectrum has to be compared with the whole set of candidate peptides, which could easily contain thousands of peptide sequences. Hence, it is critical to reduce the computing time of the similarity-scoring module in terms of satisfying the time constraint of a real-time system.
In this paper, we develop both a multiple core computing algorithm and a distrubited algorithm to speedup the performance of the RT-PSM program. The comparison is made betweet our algorithm and the RT-PSM algorithm in . The definition of sensitivity and specificity in  refer from the textbook , which are different from currently widely accepted formula. We ignore the name of sensitivity and spedificity, but employ the evalutation formula used in  to carry out our comparisons.
The similarity-scoring module of the RT-PSM program
Types of fragment ions and their m/z values in the RT-PSM program
Generally, it is not necessary to search the whole database for finding the best-matched candidate peptide. The mass difference between an experimental peptide and its matched candidate peptide is often very small. A nearest neighbor search (NNS) is employed in the RT-PSM algorithm. Suppose Mm is the mass of an experimental peptide and t is the tolerance range of the NNS. Only those candidate peptides with mass range between Mm - t and Mm + t need to be considered. The RT-PSM program proposed by Wu et al.  employs the most common binary search method to perform the NNS . The time complexity of the binary search is O (log n). Hence, the time spending on the peptide search is related to the size of peptide database.
With this improved data structure of peptide database, the peptide searching consists of two steps. The first step is to search if the integer part X of the target peptide mass with tolerance value t is indexed by the peptide database (X ± t). If the value is found, then the first record in the indexed sub-array is the matched peptide, and the time complexity of this step is O(2 t). If the first step cannot find a matched peptide and the database also contains a subset with index (X-1), then the second step is using the binary search method to check if this subset contains the matched peptide. The time complexity of the second step is O(log(subset length)). The pseudo code of the 2-dimensional peptide database search method for peptide database searching is shown in Algorithm 1.
where N is the number of peptides in the group.
Multi-core Computing and Distributed Computing
The similarity-scoring module in the RT-PSM program is a typical CPU-bound computation function, which means the computing time of the function is determined principally by the speed of CPU. Normally, one processor can only execute one function at one time. In order to reduce the time consumed for the similarity-scoring module, we propose a parallel algorithm that combines the advantages of multi-core computing and distributed computing to achieve the maximum performance.
The multi-core RT-PSM (MC RT-PSM) algorithm
The maximum number of threads can be used in the MC RT-PSM is based on the number of logical processors. The pseudo code of MC RT-PSM algorithm is shown in Algorithm 2.
The distributed computing RT-PSM (DC RT-PSM) algorithm
Similar to MC RT-PSM algorithm, the DC RT-PSM algorithm also needs to separate a large task into several sub-tasks and executes them concurrently. However, they are different in the following two aspects. Firstly, the DC RT-PSM algorithm is designed to run on a distributed computer, such as a computer cluster, rather than a single-CPU workstation. The cluster is a computer system with the processing elements connected as a network. The Windows HPC SDK package provides a stable and user-friendly development environment for us to develop the program of the DC RT-PSM algorithm. Secondly, each processor has its own memory in the DC RT-PSM program, while all processors access to a shared memory in the MC RT-PSM program .
In our case of the DC RT-PSM algorithm, the whole identification procedure is divided into several sub RT-PSM tasks. Results of those computations are combined by a head node . Each sub task runs in an individual worker node of the cluster. In order to achieve the minimum execution time, the head node creates, distributes, synchronizes and monitors tasks in each worker node. The pseudo code of the distributed task management algorithm for the head node is shown in Algorithm 3.
where to is the task initial time.
This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Canada Foundation for Innovation (CFI).
- 6.Zhang J, McQuillan I, Wu FX: Parallelizing peptide spectrum scoring using modern graphics processing units. In Proceedings of 2011 IEEE 1st International Conference on Computational Advances in Bio and Medical Sciences (ICCABS 2011): 3–5 Feb 2011; Orlando. Edited by: Mandoiu I, Miyano S, Przytycka T, Rajasekaran S. Washington DC: IEEE Computer Society; 2011:208–213.CrossRefGoogle Scholar
- 8.Almasi GS, Gottlieb A: Highly parallel computing. Redwood City: The Benjamin/Cummings publishing Company, Inc; 1989.Google Scholar
- 11.Baldi P, Brunak S: Bioinformatics: The machine learning approach (2nd edn). Cambridge: MIT Press; 2002.Google Scholar
- 14.Ben-Ari M: Principles of concurrent and distributed programming. New York: Prentice Hall; 1990.Google Scholar
- 16.Yamazaki K, Ando S: A case-based parallel programming system. In Proceedings of International Symposium on Software Engineering for Parallel and Distributed Systems: 20–21 Apr 1998; Kyoto. Edited by: Krämer B, Uchihira N, Croll P, Pusso S. Los Alamitos: IEEE Computer Society; 1998:238–245.CrossRefGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.