Statistical Methods in Proteomics

Yu, Weichuan; Wu, Baolin; Huang, Tao; Li, Xiaoye; Williams, Kenneth; Zhao, Hongyu

doi:10.1007/978-1-84628-288-1_34

Weichuan Yu²,
Baolin Wu³,
Tao Huang⁴,
Xiaoye Li⁵,
Kenneth Williams⁶ &
…
Hongyu Zhao⁷

Part of the book series: Springer Handbooks ((SHB))

8573 Accesses
6 Citations

Abstract

Proteomics technologies are rapidly evolving and attracting great attention in the post-genome era. In this chapter, we review two key applications of proteomics techniques: disease biomarker discovery and protein/peptide identification. For each of the applications, we state the major issues related to statistical modeling and analysis, review related work, discuss their strengths and weaknesses, and point out unsolved problems for future research.

We organize this chapter as follows. Section 34.1 briefly introduces mass spectrometry (MS) and tandem MS/MS with a few sample plots showing the data format. Section 34.2 focuses on MS data preprocessing. We first review approaches in peak identification and then address the problem of peak alignment. After that, we point out unsolved problems and propose a few possible solutions.

Section 34.3 addresses the issue of feature selection. We start with a simple example showing the effect of a large number of features. Then we address the interaction of different features and discuss methods of reducing the influence of noise. We finish this section with some discussion on the application of machine learning methods in feature selection. Section 34.4 addresses the problem of sample classification. We describe the random forest method in detail in Sect. 34.5.

In Sect. 34.6 we address protein/peptide identification. We first review database searching methods in Sect. 34.6.1 and then focus on de novo MS/MS sequencing in Sect. 34.6.2. After reviewing major protein/peptide identification programs like SEQUEST and MASCOT in Sect. 34.6.3, we conclude the section by pointing out some major issues that need to be addressed in protein/peptide identification.

Proteomics technologies are considered the major player in the analysis and understanding of protein function and biological pathways. The development of statistical methods and software for proteomics data analysis will continue to be the focus of proteomics for years to come.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 309.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Abbreviations

CART:: classification and regression tree
CID:: collision-induced dissociation
CV:: cross-validation
DP:: dynamic programming
MS:: mass spectrometry

References

D. Greenbaum, C. Colangelo, K. Williams, M. Gerstein: Computing protein abundance and mRNA expression levels on a genomic scale, Genome Biol. 4, 117.1–117.8 (2003)
Article Google Scholar
M. Wagner, D. Naik, A. Pothen: Protocols for disease classification from mass spectrometry data, Proteomics 3(9), 1692–1698 (2003)
Article Google Scholar
Y. Yasui, M. Pepe, M. L. Thompson, B. Adam, G. L. Wright Jr., Y. Qu, J. D. Potter, M. Winget, M. Thornquist, Z. Feng: A data-analytic strategy for protein biomarker discovery: profiling of high-dimensional proteomic data for cancer detection, Biostatistics 4(3), 449–463 (2003)
Article MATH Google Scholar
K. R. Coombes, H. A. Fritsche, Jr, C. Clarke, J. Chen, K. A. Baggerly, J. S. Morris, L. Xiao, M. Hung, H. M. Kuerer: Quality control, peak finding for proteomics data collected from nipple aspirate fluid by surface-enhanced laser desorption, ionization, Clinical Chemistry 49(10), 1615–1623 (2003)
Article Google Scholar
B. Wu, T. Abbott, D. Fishman, W. McMurray, G. Mor, K. Stone, D. Ward, K. Williams, H. Zhao: Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data, Bioinformatics 19(13), 1636–1643 (2003)
Article Google Scholar
Q. Liu, B. Krashnapuram, P. Pratapa, X. Liao, A. Hartemink, L. Carin: Identification of differentially expressed proteins using maldi-tof mass spectra. In: ASILOMAR Conference: Biological Aspects of Signal Processing 2003)
Google Scholar
Y. Yasui, D. McLerran, B. L. Adam, M. Winget, M. Thornquist, Z. D. Z. D. Feng: An automated peak identification/calibration procedure for high-dimensional protein measures from mass spectrometers, J. Biomed. Biotec. 4, 242–248 (2003)
Article Google Scholar
G. A. Satten, S. Datta, H. Moura, A. R. Woolfitt, G. Carvalho, R. Facklam, J. R. Barr: Standardization and denoising algorithms for mass spectra to classify whole-organism bacterial specimens, Bioinformatics 20(17), 3128–3136 (2004)
Article Google Scholar
K. R. Coombes, S. Tsavachidis, J. S. Morris, K. A. Baggerly, M. Hung, H. M. Kuerer: Improved peak detection, quantification of mass spectrometry data acquired from surface-enhanced laser desorption, ionization by denoising spectra with the undecimated discrete wavelet transform, Technical report (Univ. Texas M.D. Anderson Cancer Center, Houston 2004)
Google Scholar
T.W. Randolph and Y. Yasui: Multiscale processing of mass spectrometry data, University of Washington Biostatistics Working Paper Series, Number 230, (2004)
Google Scholar
W. Yu, B. Wu, N. Lin, K. Stone, K. Williams, H. Zhao: Detecting, aligning peaks in mass spectrometry data with applications to MALDI, Comput. Biol. Chem. (2005) in press
Google Scholar
R. J. O. Torgrip, M. Aberg, B. Karlberg, S. P. Jacobsson: Peak alignment using reduced set mapping, J. Chemometrics 17, 573–582 (2003)
Article Google Scholar
P. H. C. Eilers: Parametric time warping, Analytical Chemistry 76(2), 404–411 (2004)
Article MathSciNet Google Scholar
R. Tibshirani, T. Hastie, B. Narasimhan, S. Soltys, G. Shi, A. Koong, Q. Le: Sample classification from protein mass spectrometry, by “peak probability contrasts”, Bioinformatics 20(17), 3034–3044 (2004)
Article Google Scholar
K. J. Johnson, B. W. Wright, K. H. Jarman, R. E. Synovec: High-speed peak matching algorithm for retention time alignment of gas chromatographic data for chemometric analysis, J. Chromatography A 996, 141–155 (2003)
Article Google Scholar
N. V. Nielsen, J. M. Carstensen, J. Smedsgaard: Aligning of single, multiple wavelength chromatographic profiles for chemometric data analysis using correlation optimised warping, J. Chromatography A 805, 17–35 (1998)
Article Google Scholar
J. Aach, G. M. Church: Aligning gene expression time series with time warping algorithms, Bioinformatics 17(6), 495–508 (2001)
Google Scholar
S. Dudoit, Y. H. Yang, T. P. Speed, M. J. Callow: Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments, Stat. Sinica 12(1), 111–139 (2002)
MathSciNet MATH Google Scholar
V. G. Tusher, R. Tibshirani, G. Chu: Significance analysis of microarrays applied to the ionizing radiation response, Proc. Natl. Acad. Sci. 98(9), 5116–5121 (2001)
Article MATH Google Scholar
X. Cui, G. A. Churchill: Statistical tests for differential expression in cDNA microarray experiments, Genome Biology 4(4), 210 (2003)
Article Google Scholar
Y. Lai, B. Wu, L. Chen, H. Zhao: Statistical method for identifying differential gene–gene coexpression patterns, Bioinformatics 20(17), 3146–3155 (2004)
Article Google Scholar
L. Breiman, J. H. Friedman, R. A. Olshen, C. J. Stone: Classification and Regression Trees (Kluwer Academic, 1984)
Google Scholar
E. C. Gunther, D. J. Stone, R. W. Gerwien, P. Bento, M. P. Heyes: Prediction of clinical drug efficacy by classification of drug-induced genomic expression profiles in vitro, Proc. Natl. Acad. Sci 100(16), 9608–9613 (2003)
Article Google Scholar
L. Breiman: Bagging predictors, Machine Learning 24, 123–140 (1996)
MathSciNet MATH Google Scholar
Y. Freund, R. Schapire: A decision-theoretic generalization of online learning, an application to boosting, J. Computer, System Sci. 55(1), 119–139 (1997)
Article MathSciNet MATH Google Scholar
B. Adam, Y. Qu, J. W. Davis, M. D. Ward, M. A. Clements, L. H. Cazares, O. J. Semmes, P. F. Schellhammer, Y. Yasui, Z. Feng: Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men, Cancer Research 62(13), 3609–3614 (2002)
Google Scholar
M. Dettling, P. Buhlmann: Boosting for tumor classification with gene expression data, Bioinformatics 19(9), 1061–1069 (2003)
Article Google Scholar
G. Isabelle, W. Jason, B. Stephen, V. Vladimir: Gene selection for cancer classification using support vector machines, Machine Learning 46(1-3), 389–422 (2002)
MATH Google Scholar
Y. Qu, B. L. Adam, Y. Yasui, M. D. Ward, L. H. Cazares, P. F. Schellhammer, Z. Feng, O. J. Semmes, G. L. Wright Jr.: Boosted decision tree analysis of surface-enhanced laser desorption/ionization mass spectral serum profiles discriminates prostate cancer from noncancer patients, Clin. Chem. 48(10), 1835–1843 (2002)
Google Scholar
S. Dudoit, J. Fridlyand, T. P. Speed: Comparison of discrimination methods for the classification of tumors using gene expression data, J. Am. Stat. Assoc. 97(457), 77–87 (2002)
Article MathSciNet MATH Google Scholar
T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, E. S. Lander: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science 286(5439), 531–537 (1999)
Article Google Scholar
L. Breiman: Random forests, Machine Learning 45(1), 5–32 (2001)
Article MATH Google Scholar
V. N. Vapnik: Statistical Learning Theory (Wiley-Interscience, New York 1998)
MATH Google Scholar
C. Ambroise, G. J. McLachlan: Selection bias in gene extraction on the basis of microarray gene-expression data, Proc. Natl. Acad. Sci. 99(10), 6562–6566 (2002)
Article MATH Google Scholar
T. K. Ho: The random subspace method for constructing decision forests, IEEE Trans. Pattern Anal. Mach. Intell. 20(8), 832–844 (1998)
Article Google Scholar
C. Cortes, L. D. Jackel, S. A. Solla, V. Vapnik, J. S. Denker: Learning curves: asymptotic values, rate of convergence, Adv. Neural Info. Proc. Systems 6, 327–334 (1994)
Google Scholar
B. Wu, T. Abbott, D. Fishman, W. McMurray, G. Mor, K. Stone, D. Ward, K. Williams, H. Zhao: Ovarian cancer classification based on mass spectrometry analysis of sera, Cancer Informatics (2005) in press
Google Scholar
W. J. Henzel, T. M. Billeci, J. T. Stults, S. C. Wong, C. Grimley, C. Watanabe: Identifying proteins from two-dimensional gels by molecular mass searching of peptide fragments in protein sequence databases, Proc. Natl. Acad. Sci. 90, 5011–5015 (1993)
Article Google Scholar
P. James, M. Quadroni, E. Carafoli, G. Gonnet: Protein identification by mass profile fingerprinting, Biochem. Biophys. Res. Commun. 195, 58–64 (1993)
Article Google Scholar
M. Mann, P. Hojrup, P. Roepstorff: Use of mass spectrometric molecular weight information to identify proteins in sequence databases, Biol. Mass Spectrom. 22, 338–345 (1993)
Article Google Scholar
D. J. Pappin, P. Hojrup, A. J. Bleasby: Rapid identification of proteins by peptide-mass fingerprinting, Curr. Biol. 3, 327–332 (1993)
Article Google Scholar
J. R. Yates III, S. Speicher, P. R. Griffin, T. Hunkapiller: Peptide mass maps: A highly informative approach to protein identification, Anal. Biochem. 214, 397–408 (1993)
Article Google Scholar
D. N. Perkins, D. J. Pappin, D. M. Creasy, J. S. Cottrell: Probability-based protein identification by searching sequence databases using mass spectrometry data, J. S. Electrophoresis 20, 3551–3567 (1999)
Article Google Scholar
K. R. Clauser, P. Baker, A. I. Burlingame: Role of accurate mass measurement (+/- 10 ppm) in protein identification strategies employing MS or MS/MS and database searching, Anal. Chem. 71, 2871–2882 (1999)
Article Google Scholar
W. Zhang, B. T. Chait: ProFound: An expert system for protein identification using mass spectrometric peptide mapping information, Anal. Chem. 72, 2482–2489 (2000)
Article Google Scholar
J. K. Eng, A. L. McCormack, J. R. Yates: An approach to correlate MS/MS data to amino acid sequences in a protein database, J. Am. Soc. Mass Spectrom. 5, 976–989 (1994)
Article Google Scholar
M. Mann, M. S. Wilm: Error-tolerant identification of peptides in sequence databases by peptide sequence tags, Anal. Chem. 66, 4390–4399 (1994)
Article Google Scholar
P. A. Pevzner, V. Dancik, C. L. Tang: Mutation-tolerant protein identification by mass spectrometry, J. Comput. Biol. 7, 777–787 (2000)
Article Google Scholar
V. Bafna, N. Edwards: SCOPE: A probabilistic model for scoring tandem mass spectra against a peptide database, Bioinformatics 17, S13–21 (2001)
Article Google Scholar
B. T. Hansen, J. A. Jones, D. E. Mason, D. C. Liebler: SALSA: A pattern recognition algorithm to detect electrophile-adducted peptides by automated evaluation of CID spectra in LC-MS-MS analyses, Anal. Chem. 73, 1676–1683 (2001)
Article Google Scholar
D. M. Creasy, J. S. Cottrell: Error-tolerant searching of uninterpreted tandem mass spectrometry data, Proteomics 2, 1426–1434 (2002)
Article Google Scholar
H. I. Field, D. Fenyo, R. C. Beavis: RADARS, a bioinformatics solution that automates proteome mass spectral analysis, optimises protein identification, and archives data in arelational database, Proteomics 2, 36–47 (2002)
Article Google Scholar
A. Keller, A. I. Nesvizhskii, E. Kolker, R. Aebersold: Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search, Anal. Chem. 74, 5389–5392 (2002)
Google Scholar
M. J. MacCoss, C. C. Wu, J. R. Yates: Probability-based validation of protein identifications using amodified SEQUEST algorithm, Anal. Chem. 74, 5593–5599 (2002)
Article Google Scholar
D. C. Anderson, W. Li, D. G. Payan, W. S. Noble: A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST scores, J. Proteome Res. 2, 137–146 (2003)
Article Google Scholar
J. Colinge, A. Masselot, M. Giron, T. Dessigny, J. Magnin: OLAV: towards high throughput tandem mass spectrometry data identification, Proteomics 3, 1454–1463 (2003)
Article Google Scholar
E. Gasteiger, A. Gattiker, C. Hoogland, I. Ivanyi, R. D. Appel, A. Bairoch: ExPASy: The proteomics server for in-depth protein knowledge and analysis, Nucleic Acids Res. 3, 3784–3788 (2003)
Article Google Scholar
M. Havilio, Y. Haddad, Z. Smilansky: Intensity-based statistical scorer for tandem mass spectrometry, Anal. Chem. 75, 435–444 (2003)
Article Google Scholar
P. Hernandez, R. Gras, J. Frey, R. D. Appel: Popitam: towards new heuristic strategies to improve protein identification from tandem mass spectrometry data, Proteomics 3, 870–878 (2003)
Article Google Scholar
B. Lu, T. Chen: A suffix tree approach to the interpretation of tandem mass spectra: applications to peptides of non-specific digestion, post-translational modifications, Bioinformatics 19, 113–121 (2003)
Article Google Scholar
A. I. Nesvizhskii, A. Keller, E. Kolker, R. Aebersold: A statistical model for identifying proteins by tandem mass spectrometry, Anal. Chem. 75, 4646–4658 (2003)
Article Google Scholar
J. A. Taylor, R. S. Johnson: Sequence database searches via de novo peptide sequencing by tandem mass spectrometry, Rapid Commun. Mass Spectrom. 11, 1067–75 (1997)
Article Google Scholar
V. Dancik, T. A. Addona, K. R. Clauser, J. E. Vath, P. A. Pevzner: De Novo peptide sequencing via tandem mass spectrometry, J. Comput. Biol. 6, 327–342 (1999)
Article Google Scholar
T. Chen, M. Y. Kao, M. Tepel, J. Rush, G. M. Church: A dynamic programming approach to de novo peptide sequencing via tandem mass spectrometry, J. Comput. Biol. 8, 325–337 (2001)
Article Google Scholar
B. Ma, K. Zhang, C. Hendrie, C. Liang, M. Li, A. Doherty-Kirby, G. Lajoie: PEAKS: Powerful software for peptide de novo sequencing by tandem mass spectrometry, Rapid Commun. Mass Spectrom. 17, 2337–2342 (2003)
Article Google Scholar
E. A. Kapp, F. Schütz, G. E. Reid, J. S. Eddes, R. L. Moritz, R. A. J. OʼHair, T. P. Speed, R. J. Simpson: Mining a tandem mass spectrometry database to determine the trends and global factors influencing peptide fragmentation, Anal. Chem. 75, 6251–6264 (2003)
Article Google Scholar
D. C. Chamrad, G. Koerting, J. Gobom, H. Thiele, J. Klose, H. E. Meyer, M. Blueggel: Interpretation of mass spectrometry data for high-throughput proteomics, Anal. Bioanal. Chem. 376, 1014–1022 (2003)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Molecular Biophysics and Biochemistry, Yale Center for Statistical Genomics and Proteomics, Yale University, 300 George Street, 06511, New Haven, CT, USA
Weichuan Yu
Division of Biostatistics, University of Minnesota, School of Public Health, A460 Mayo Building, MMC 303, 420 Delaware St SE, 55455, Minneapolis, MN, USA
Baolin Wu
Department of Epidemiology and Public Health, Yale University, School of Medicine, 60 College Street, 06520, New Haven, CT, USA
Tao Huang
Department of Applied Mathematics, Yale University, 300 George Street, 06511, New Heaven, CT, USA
Xiaoye Li
Molecular Biophysics and Biochemistry, Yale University, 300 George Street, G005, 06520, New Haven, CT, USA
Kenneth Williams
Department of Epidemiology and Public Health, Yale University School of Medicine, 60 College Street, 06520-8034, New Haven, CT, USA
Hongyu Zhao

Authors

Weichuan Yu
View author publications
You can also search for this author in PubMed Google Scholar
Baolin Wu
View author publications
You can also search for this author in PubMed Google Scholar
Tao Huang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoye Li
View author publications
You can also search for this author in PubMed Google Scholar
Kenneth Williams
View author publications
You can also search for this author in PubMed Google Scholar
Hongyu Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Weichuan Yu , Baolin Wu , Tao Huang , Xiaoye Li , Kenneth Williams or Hongyu Zhao .

Editor information

Editors and Affiliations

Department of Industrial and Systems Engineering, Rutgers the State University of New Jersey, 96 Frelinghuysen Road, 08854, Piscataway, NJ, USA
Hoang Pham Prof.

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Yu, W., Wu, B., Huang, T., Li, X., Williams, K., Zhao, H. (2006). Statistical Methods in Proteomics. In: Pham, H. (eds) Springer Handbook of Engineering Statistics. Springer Handbooks. Springer, London. https://doi.org/10.1007/978-1-84628-288-1_34

Download citation

DOI: https://doi.org/10.1007/978-1-84628-288-1_34
Publisher Name: Springer, London
Print ISBN: 978-1-85233-806-0
Online ISBN: 978-1-84628-288-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics