Abstract
Proteomics technologies are rapidly evolving and attracting great attention in the post-genome era. In this chapter, we review two key applications of proteomics techniques: disease biomarker discovery and protein/peptide identification. For each of the applications, we state the major issues related to statistical modeling and analysis, review related work, discuss their strengths and weaknesses, and point out unsolved problems for future research.
We organize this chapter as follows. Section 34.1 briefly introduces mass spectrometry (MS) and tandem MS/MS with a few sample plots showing the data format. Section 34.2 focuses on MS data preprocessing. We first review approaches in peak identification and then address the problem of peak alignment. After that, we point out unsolved problems and propose a few possible solutions.
Section 34.3 addresses the issue of feature selection. We start with a simple example showing the effect of a large number of features. Then we address the interaction of different features and discuss methods of reducing the influence of noise. We finish this section with some discussion on the application of machine learning methods in feature selection. Section 34.4 addresses the problem of sample classification. We describe the random forest method in detail in Sect. 34.5.
In Sect. 34.6 we address protein/peptide identification. We first review database searching methods in Sect. 34.6.1 and then focus on de novo MS/MS sequencing in Sect. 34.6.2. After reviewing major protein/peptide identification programs like SEQUEST and MASCOT in Sect. 34.6.3, we conclude the section by pointing out some major issues that need to be addressed in protein/peptide identification.
Proteomics technologies are considered the major player in the analysis and understanding of protein function and biological pathways. The development of statistical methods and software for proteomics data analysis will continue to be the focus of proteomics for years to come.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Abbreviations
- CART:
-
classification and regression tree
- CID:
-
collision-induced dissociation
- CV:
-
cross-validation
- DP:
-
dynamic programming
- MS:
-
mass spectrometry
References
D. Greenbaum, C. Colangelo, K. Williams, M. Gerstein: Computing protein abundance and mRNA expression levels on a genomic scale, Genome Biol. 4, 117.1–117.8 (2003)
M. Wagner, D. Naik, A. Pothen: Protocols for disease classification from mass spectrometry data, Proteomics 3(9), 1692–1698 (2003)
Y. Yasui, M. Pepe, M. L. Thompson, B. Adam, G. L. Wright Jr., Y. Qu, J. D. Potter, M. Winget, M. Thornquist, Z. Feng: A data-analytic strategy for protein biomarker discovery: profiling of high-dimensional proteomic data for cancer detection, Biostatistics 4(3), 449–463 (2003)
K. R. Coombes, H. A. Fritsche, Jr, C. Clarke, J. Chen, K. A. Baggerly, J. S. Morris, L. Xiao, M. Hung, H. M. Kuerer: Quality control, peak finding for proteomics data collected from nipple aspirate fluid by surface-enhanced laser desorption, ionization, Clinical Chemistry 49(10), 1615–1623 (2003)
B. Wu, T. Abbott, D. Fishman, W. McMurray, G. Mor, K. Stone, D. Ward, K. Williams, H. Zhao: Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data, Bioinformatics 19(13), 1636–1643 (2003)
Q. Liu, B. Krashnapuram, P. Pratapa, X. Liao, A. Hartemink, L. Carin: Identification of differentially expressed proteins using maldi-tof mass spectra. In: ASILOMAR Conference: Biological Aspects of Signal Processing 2003)
Y. Yasui, D. McLerran, B. L. Adam, M. Winget, M. Thornquist, Z. D. Z. D. Feng: An automated peak identification/calibration procedure for high-dimensional protein measures from mass spectrometers, J. Biomed. Biotec. 4, 242–248 (2003)
G. A. Satten, S. Datta, H. Moura, A. R. Woolfitt, G. Carvalho, R. Facklam, J. R. Barr: Standardization and denoising algorithms for mass spectra to classify whole-organism bacterial specimens, Bioinformatics 20(17), 3128–3136 (2004)
K. R. Coombes, S. Tsavachidis, J. S. Morris, K. A. Baggerly, M. Hung, H. M. Kuerer: Improved peak detection, quantification of mass spectrometry data acquired from surface-enhanced laser desorption, ionization by denoising spectra with the undecimated discrete wavelet transform, Technical report (Univ. Texas M.D. Anderson Cancer Center, Houston 2004)
T.W. Randolph and Y. Yasui: Multiscale processing of mass spectrometry data, University of Washington Biostatistics Working Paper Series, Number 230, (2004)
W. Yu, B. Wu, N. Lin, K. Stone, K. Williams, H. Zhao: Detecting, aligning peaks in mass spectrometry data with applications to MALDI, Comput. Biol. Chem. (2005) in press
R. J. O. Torgrip, M. Aberg, B. Karlberg, S. P. Jacobsson: Peak alignment using reduced set mapping, J. Chemometrics 17, 573–582 (2003)
P. H. C. Eilers: Parametric time warping, Analytical Chemistry 76(2), 404–411 (2004)
R. Tibshirani, T. Hastie, B. Narasimhan, S. Soltys, G. Shi, A. Koong, Q. Le: Sample classification from protein mass spectrometry, by “peak probability contrasts”, Bioinformatics 20(17), 3034–3044 (2004)
K. J. Johnson, B. W. Wright, K. H. Jarman, R. E. Synovec: High-speed peak matching algorithm for retention time alignment of gas chromatographic data for chemometric analysis, J. Chromatography A 996, 141–155 (2003)
N. V. Nielsen, J. M. Carstensen, J. Smedsgaard: Aligning of single, multiple wavelength chromatographic profiles for chemometric data analysis using correlation optimised warping, J. Chromatography A 805, 17–35 (1998)
J. Aach, G. M. Church: Aligning gene expression time series with time warping algorithms, Bioinformatics 17(6), 495–508 (2001)
S. Dudoit, Y. H. Yang, T. P. Speed, M. J. Callow: Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments, Stat. Sinica 12(1), 111–139 (2002)
V. G. Tusher, R. Tibshirani, G. Chu: Significance analysis of microarrays applied to the ionizing radiation response, Proc. Natl. Acad. Sci. 98(9), 5116–5121 (2001)
X. Cui, G. A. Churchill: Statistical tests for differential expression in cDNA microarray experiments, Genome Biology 4(4), 210 (2003)
Y. Lai, B. Wu, L. Chen, H. Zhao: Statistical method for identifying differential gene–gene coexpression patterns, Bioinformatics 20(17), 3146–3155 (2004)
L. Breiman, J. H. Friedman, R. A. Olshen, C. J. Stone: Classification and Regression Trees (Kluwer Academic, 1984)
E. C. Gunther, D. J. Stone, R. W. Gerwien, P. Bento, M. P. Heyes: Prediction of clinical drug efficacy by classification of drug-induced genomic expression profiles in vitro, Proc. Natl. Acad. Sci 100(16), 9608–9613 (2003)
L. Breiman: Bagging predictors, Machine Learning 24, 123–140 (1996)
Y. Freund, R. Schapire: A decision-theoretic generalization of online learning, an application to boosting, J. Computer, System Sci. 55(1), 119–139 (1997)
B. Adam, Y. Qu, J. W. Davis, M. D. Ward, M. A. Clements, L. H. Cazares, O. J. Semmes, P. F. Schellhammer, Y. Yasui, Z. Feng: Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men, Cancer Research 62(13), 3609–3614 (2002)
M. Dettling, P. Buhlmann: Boosting for tumor classification with gene expression data, Bioinformatics 19(9), 1061–1069 (2003)
G. Isabelle, W. Jason, B. Stephen, V. Vladimir: Gene selection for cancer classification using support vector machines, Machine Learning 46(1-3), 389–422 (2002)
Y. Qu, B. L. Adam, Y. Yasui, M. D. Ward, L. H. Cazares, P. F. Schellhammer, Z. Feng, O. J. Semmes, G. L. Wright Jr.: Boosted decision tree analysis of surface-enhanced laser desorption/ionization mass spectral serum profiles discriminates prostate cancer from noncancer patients, Clin. Chem. 48(10), 1835–1843 (2002)
S. Dudoit, J. Fridlyand, T. P. Speed: Comparison of discrimination methods for the classification of tumors using gene expression data, J. Am. Stat. Assoc. 97(457), 77–87 (2002)
T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, E. S. Lander: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science 286(5439), 531–537 (1999)
L. Breiman: Random forests, Machine Learning 45(1), 5–32 (2001)
V. N. Vapnik: Statistical Learning Theory (Wiley-Interscience, New York 1998)
C. Ambroise, G. J. McLachlan: Selection bias in gene extraction on the basis of microarray gene-expression data, Proc. Natl. Acad. Sci. 99(10), 6562–6566 (2002)
T. K. Ho: The random subspace method for constructing decision forests, IEEE Trans. Pattern Anal. Mach. Intell. 20(8), 832–844 (1998)
C. Cortes, L. D. Jackel, S. A. Solla, V. Vapnik, J. S. Denker: Learning curves: asymptotic values, rate of convergence, Adv. Neural Info. Proc. Systems 6, 327–334 (1994)
B. Wu, T. Abbott, D. Fishman, W. McMurray, G. Mor, K. Stone, D. Ward, K. Williams, H. Zhao: Ovarian cancer classification based on mass spectrometry analysis of sera, Cancer Informatics (2005) in press
W. J. Henzel, T. M. Billeci, J. T. Stults, S. C. Wong, C. Grimley, C. Watanabe: Identifying proteins from two-dimensional gels by molecular mass searching of peptide fragments in protein sequence databases, Proc. Natl. Acad. Sci. 90, 5011–5015 (1993)
P. James, M. Quadroni, E. Carafoli, G. Gonnet: Protein identification by mass profile fingerprinting, Biochem. Biophys. Res. Commun. 195, 58–64 (1993)
M. Mann, P. Hojrup, P. Roepstorff: Use of mass spectrometric molecular weight information to identify proteins in sequence databases, Biol. Mass Spectrom. 22, 338–345 (1993)
D. J. Pappin, P. Hojrup, A. J. Bleasby: Rapid identification of proteins by peptide-mass fingerprinting, Curr. Biol. 3, 327–332 (1993)
J. R. Yates III, S. Speicher, P. R. Griffin, T. Hunkapiller: Peptide mass maps: A highly informative approach to protein identification, Anal. Biochem. 214, 397–408 (1993)
D. N. Perkins, D. J. Pappin, D. M. Creasy, J. S. Cottrell: Probability-based protein identification by searching sequence databases using mass spectrometry data, J. S. Electrophoresis 20, 3551–3567 (1999)
K. R. Clauser, P. Baker, A. I. Burlingame: Role of accurate mass measurement (+/- 10 ppm) in protein identification strategies employing MS or MS/MS and database searching, Anal. Chem. 71, 2871–2882 (1999)
W. Zhang, B. T. Chait: ProFound: An expert system for protein identification using mass spectrometric peptide mapping information, Anal. Chem. 72, 2482–2489 (2000)
J. K. Eng, A. L. McCormack, J. R. Yates: An approach to correlate MS/MS data to amino acid sequences in a protein database, J. Am. Soc. Mass Spectrom. 5, 976–989 (1994)
M. Mann, M. S. Wilm: Error-tolerant identification of peptides in sequence databases by peptide sequence tags, Anal. Chem. 66, 4390–4399 (1994)
P. A. Pevzner, V. Dancik, C. L. Tang: Mutation-tolerant protein identification by mass spectrometry, J. Comput. Biol. 7, 777–787 (2000)
V. Bafna, N. Edwards: SCOPE: A probabilistic model for scoring tandem mass spectra against a peptide database, Bioinformatics 17, S13–21 (2001)
B. T. Hansen, J. A. Jones, D. E. Mason, D. C. Liebler: SALSA: A pattern recognition algorithm to detect electrophile-adducted peptides by automated evaluation of CID spectra in LC-MS-MS analyses, Anal. Chem. 73, 1676–1683 (2001)
D. M. Creasy, J. S. Cottrell: Error-tolerant searching of uninterpreted tandem mass spectrometry data, Proteomics 2, 1426–1434 (2002)
H. I. Field, D. Fenyo, R. C. Beavis: RADARS, a bioinformatics solution that automates proteome mass spectral analysis, optimises protein identification, and archives data in arelational database, Proteomics 2, 36–47 (2002)
A. Keller, A. I. Nesvizhskii, E. Kolker, R. Aebersold: Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search, Anal. Chem. 74, 5389–5392 (2002)
M. J. MacCoss, C. C. Wu, J. R. Yates: Probability-based validation of protein identifications using amodified SEQUEST algorithm, Anal. Chem. 74, 5593–5599 (2002)
D. C. Anderson, W. Li, D. G. Payan, W. S. Noble: A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST scores, J. Proteome Res. 2, 137–146 (2003)
J. Colinge, A. Masselot, M. Giron, T. Dessigny, J. Magnin: OLAV: towards high throughput tandem mass spectrometry data identification, Proteomics 3, 1454–1463 (2003)
E. Gasteiger, A. Gattiker, C. Hoogland, I. Ivanyi, R. D. Appel, A. Bairoch: ExPASy: The proteomics server for in-depth protein knowledge and analysis, Nucleic Acids Res. 3, 3784–3788 (2003)
M. Havilio, Y. Haddad, Z. Smilansky: Intensity-based statistical scorer for tandem mass spectrometry, Anal. Chem. 75, 435–444 (2003)
P. Hernandez, R. Gras, J. Frey, R. D. Appel: Popitam: towards new heuristic strategies to improve protein identification from tandem mass spectrometry data, Proteomics 3, 870–878 (2003)
B. Lu, T. Chen: A suffix tree approach to the interpretation of tandem mass spectra: applications to peptides of non-specific digestion, post-translational modifications, Bioinformatics 19, 113–121 (2003)
A. I. Nesvizhskii, A. Keller, E. Kolker, R. Aebersold: A statistical model for identifying proteins by tandem mass spectrometry, Anal. Chem. 75, 4646–4658 (2003)
J. A. Taylor, R. S. Johnson: Sequence database searches via de novo peptide sequencing by tandem mass spectrometry, Rapid Commun. Mass Spectrom. 11, 1067–75 (1997)
V. Dancik, T. A. Addona, K. R. Clauser, J. E. Vath, P. A. Pevzner: De Novo peptide sequencing via tandem mass spectrometry, J. Comput. Biol. 6, 327–342 (1999)
T. Chen, M. Y. Kao, M. Tepel, J. Rush, G. M. Church: A dynamic programming approach to de novo peptide sequencing via tandem mass spectrometry, J. Comput. Biol. 8, 325–337 (2001)
B. Ma, K. Zhang, C. Hendrie, C. Liang, M. Li, A. Doherty-Kirby, G. Lajoie: PEAKS: Powerful software for peptide de novo sequencing by tandem mass spectrometry, Rapid Commun. Mass Spectrom. 17, 2337–2342 (2003)
E. A. Kapp, F. Schütz, G. E. Reid, J. S. Eddes, R. L. Moritz, R. A. J. OʼHair, T. P. Speed, R. J. Simpson: Mining a tandem mass spectrometry database to determine the trends and global factors influencing peptide fragmentation, Anal. Chem. 75, 6251–6264 (2003)
D. C. Chamrad, G. Koerting, J. Gobom, H. Thiele, J. Klose, H. E. Meyer, M. Blueggel: Interpretation of mass spectrometry data for high-throughput proteomics, Anal. Bioanal. Chem. 376, 1014–1022 (2003)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag
About this entry
Cite this entry
Yu, W., Wu, B., Huang, T., Li, X., Williams, K., Zhao, H. (2006). Statistical Methods in Proteomics. In: Pham, H. (eds) Springer Handbook of Engineering Statistics. Springer Handbooks. Springer, London. https://doi.org/10.1007/978-1-84628-288-1_34
Download citation
DOI: https://doi.org/10.1007/978-1-84628-288-1_34
Publisher Name: Springer, London
Print ISBN: 978-1-85233-806-0
Online ISBN: 978-1-84628-288-1
eBook Packages: EngineeringEngineering (R0)