Cloud-based solution to identify statistically significant MS peaks differentiating sample categories
- 2k Downloads
Mass spectrometry (MS) has evolved to become the primary high throughput tool for proteomics based biomarker discovery. Until now, multiple challenges in protein MS data analysis remain: large-scale and complex data set management; MS peak identification, indexing; and high dimensional peak differential analysis with the concurrent statistical tests based false discovery rate (FDR). “Turnkey” solutions are needed for biomarker investigations to rapidly process MS data sets to identify statistically significant peaks for subsequent validation.
Here we present an efficient and effective solution, which provides experimental biologists easy access to “cloud” computing capabilities to analyze MS data. The web portal can be accessed at http://transmed.stanford.edu/ssa/.
Presented web application supplies large scale MS data online uploading and analysis with a simple user interface. This bioinformatic tool will facilitate the discovery of the potential protein biomarkers using MS.
KeywordsMass Spectrometry Data Simple Object Access Protocol Common Gateway Interface Local False Discovery Rate Protein Mass Spectrometry
Discovering potential protein biomarkers using mass spectrometry (MS) is both promising and challenging in high-throughput biology. Among the proteomic profiling techniques, differential protein abundance comparison analysis across samples is an important one. Most previous MS based peak detection approaches lack the integration of discriminating feature selection (MS peaks) tools with methods to determine which proteins are appropriate for validation analyses. Moreover, it has been difficult for experimental biologists to configure MS analytic tools, which usually contain many user-defined parameters. We have developed simultaneous spectrum analysis (SSA)  for effective and efficient MS peak selection. SSA uses only two key parameters, one of which is to locate peaks in the MS spectra and the other is to set quality thresholds to select robust features. Compared with the other existing methods, SSA improves the number and quality of lower signal intensity peaks. In addition, SSA is less likely to introduce systematic bias when normalizing spectra. Subsequent to feature selection, false discovery rate (global or local FDR) analyses (gFDR ; lFDR [4, 5]), which can simultaneously analyze vast number of features, needs to be applied for high dimensional peak differential analysis. Therefore, integration of SSA and FDR methods can be an effective approach for MS peak analyses between sample populations.
However, there are practical difficulties in the integration of SSA-FDR methods. Each method usually requires difficult computations. Another challenge is that the volume of data of a typical MS analysis task includes hundreds of spectra files and easily exceeds 100 megabytes of content. To address these issues, a cloud based web portal was developed to easily upload large MS data sets, select reproducible features (peaks), identify statistically significant peaks, correct for multiple hypotheses in order to determine which differentially expressed proteins are worth pursuing for subsequent biomarker validations quickly, and dynamically generate graphic output for meaningful interpretation.
With large mass spectrometry data, any online MS analytical tool requires multiple files to be uploaded in parallel to be a practical solution. Also the MS data could be very large which is very hard to upload via http protocol. To handle these problems, the Uploadify module (http://www.uploadify.com/) was integrated to allow large MS raw data to be uploaded efficiently. Uploadify is a jQuery(http://www.jquery.com/) plugin which supports multiple file upload functionality to the website. It has HTML5 and Flash versions, which support all common used web browsers well. Using R and perl, the algorithm functionalities are developed as web services for web-based applications. SSA-based peak finding, alignment and indexing method was effectively integrated with FDR analysis for differential feature discovery. The analysis results are summarized as graphs, excel tables, and text files, that can readily be downloaded from the server as a zipped package.
To demonstrate the efficiency and effectiveness of the web portal, an MS data set with 135 megabytes of data, available online for download, is included as a demonstration example. The data include 202 spectra, each of which is recorded in a comma-separated file, and one metadata file describing the spectra data. The data file analysis was performed on a laptop with 4 gigabyte memory and Intel Core i5-2467M 1.6GHz CPU. It requires only 21 seconds to upload all files in the MS data set. An additional 40 seconds was required to complete the subsequent calculation as well as the report with graphs and tables.
Results and discussion
The web site mainly includes two applications: (i) common peak discovery across spectra using SSA; (ii) differential analysis of indexed peaks and FDR correction. Detailed instructions can be found on the web site. The applications can be applied to high-throughput MS data analysis with large sample sizes. The flowchart of the application is shown in Figure 1B. Both the spectra and the metadata are uploaded as MS raw data. After data upload, the MS data are processed with SSA for peak detection.
The resulting peak list is then subjected to differential test analysis with t or U test that assigns p-values to each MS peak comparing the groups. Multiple hypotheses testing of features (protein MS peaks) is addressed by the subsequent FDR analysis. The total discoveries count features with Student’s t-test or Mann-Whitney U-test p-value lower than a predefined threshold. Thresholds of single feature test p-value can be surveyed comprehensively to reveal total or false discoveries to calculate gFDR. To rank the MS peaks, the lFDR assigns significance measures to each feature. The analysis results are summarized in graph (gFDR, Figure 2B) and table (lFDR, Figure 2C) forms. The users can get FDR at different levels by manipulating the downloadable excel files.
Our MS analysis portal is mainly designed for experimental biologists to process MS data sets and compare protein abundance to discover biomarkers. The analysis begins with a raw data upload and ends with a set of data sheets and graphs for easy data presentation and visualization. With all the functions provided by the Stanford University’s computing cloud, no informatics knowledge is required for the end users and all the analysis results, including graphs, Excel and text files can be downloaded from the web site. The website works best when the uploaded spectra represent the entire sample (e.g. unfractionated plasma, serum, CSF). For fractionated samples, the downloaded results files for each fraction can be compiled into a single file representing all the results and submitted to the FDR website. With the computational capacity guaranteed by cloud-based computing, the end users can access the server, and get the computing results rapidly and accurately. This cloud algorithm integration should be of general interest to those working in the field of high-throughput proteomics-based biomarker discovery.
Availability and requirements
The website can be accessed using any major browsers as follows: Webserver: http://transmed.stanford.edu/ssa/Sample MS data: http://transmed.stanford.edu/ssa/sample.data/3277_pH.12_CM10_1170005079_F_3400.0.csvSampe meta data: http://transmed.stanford.edu/ssa/sample.data/metadata.txt Site help: http://transmed.stanford.edu/ssa/ssa_analysis.ppt
XBL is supported by a Children’s Health Initiative grant. JJ is supported by China Scholarship Council grant (No. 2011632151).
- 6.Dempster A, Laird N, Rubin D: Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodological). 1977, 39 (1): 1-38.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.