Skip to main content

Statistical Methods in Proteomics

  • Reference work entry
Springer Handbook of Engineering Statistics

Part of the book series: Springer Handbooks ((SHB))

Abstract

Proteomics technologies are rapidly evolving and attracting great attention in the post-genome era. In this chapter, we review two key applications of proteomics techniques: disease biomarker discovery and protein/peptide identification. For each of the applications, we state the major issues related to statistical modeling and analysis, review related work, discuss their strengths and weaknesses, and point out unsolved problems for future research.

We organize this chapter as follows. Section 34.1 briefly introduces mass spectrometry (MS) and tandem MS/MS with a few sample plots showing the data format. Section 34.2 focuses on MS data preprocessing. We first review approaches in peak identification and then address the problem of peak alignment. After that, we point out unsolved problems and propose a few possible solutions.

Section 34.3 addresses the issue of feature selection. We start with a simple example showing the effect of a large number of features. Then we address the interaction of different features and discuss methods of reducing the influence of noise. We finish this section with some discussion on the application of machine learning methods in feature selection. Section 34.4 addresses the problem of sample classification. We describe the random forest method in detail in Sect. 34.5.

In Sect. 34.6 we address protein/peptide identification. We first review database searching methods in Sect. 34.6.1 and then focus on de novo MS/MS sequencing in Sect. 34.6.2. After reviewing major protein/peptide identification programs like SEQUEST and MASCOT in Sect. 34.6.3, we conclude the section by pointing out some major issues that need to be addressed in protein/peptide identification.

Proteomics technologies are considered the major player in the analysis and understanding of protein function and biological pathways. The development of statistical methods and software for proteomics data analysis will continue to be the focus of proteomics for years to come.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 309.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Abbreviations

CART:

classification and regression tree

CID:

collision-induced dissociation

CV:

cross-validation

DP:

dynamic programming

MS:

mass spectrometry

References

  1. D. Greenbaum, C. Colangelo, K. Williams, M. Gerstein: Computing protein abundance and mRNA expression levels on a genomic scale, Genome Biol. 4, 117.1–117.8 (2003)

    Article  Google Scholar 

  2. M. Wagner, D. Naik, A. Pothen: Protocols for disease classification from mass spectrometry data, Proteomics 3(9), 1692–1698 (2003)

    Article  Google Scholar 

  3. Y. Yasui, M. Pepe, M. L. Thompson, B. Adam, G. L. Wright Jr., Y. Qu, J. D. Potter, M. Winget, M. Thornquist, Z. Feng: A data-analytic strategy for protein biomarker discovery: profiling of high-dimensional proteomic data for cancer detection, Biostatistics 4(3), 449–463 (2003)

    Article  MATH  Google Scholar 

  4. K. R. Coombes, H. A.  Fritsche, Jr, C. Clarke, J. Chen, K. A. Baggerly, J. S. Morris, L. Xiao, M. Hung, H. M. Kuerer: Quality control, peak finding for proteomics data collected from nipple aspirate fluid by surface-enhanced laser desorption, ionization, Clinical Chemistry 49(10), 1615–1623 (2003)

    Article  Google Scholar 

  5. B. Wu, T. Abbott, D. Fishman, W. McMurray, G. Mor, K. Stone, D. Ward, K. Williams, H. Zhao: Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data, Bioinformatics 19(13), 1636–1643 (2003)

    Article  Google Scholar 

  6. Q. Liu, B.  Krashnapuram, P. Pratapa, X. Liao, A. Hartemink, L. Carin: Identification of differentially expressed proteins using maldi-tof mass spectra. In: ASILOMAR Conference: Biological Aspects of Signal Processing 2003)

    Google Scholar 

  7. Y. Yasui, D. McLerran, B. L. Adam, M. Winget, M. Thornquist, Z. D. Z. D. Feng: An automated peak identification/calibration procedure for high-dimensional protein measures from mass spectrometers, J. Biomed. Biotec. 4, 242–248 (2003)

    Article  Google Scholar 

  8. G. A. Satten, S. Datta, H. Moura, A. R. Woolfitt, G. Carvalho, R. Facklam, J. R. Barr: Standardization and denoising algorithms for mass spectra to classify whole-organism bacterial specimens, Bioinformatics 20(17), 3128–3136 (2004)

    Article  Google Scholar 

  9. K. R. Coombes, S. Tsavachidis, J. S. Morris, K. A. Baggerly, M. Hung, H. M. Kuerer: Improved peak detection, quantification of mass spectrometry data acquired from surface-enhanced laser desorption, ionization by denoising spectra with the undecimated discrete wavelet transform, Technical report (Univ. Texas M.D. Anderson Cancer Center, Houston 2004)

    Google Scholar 

  10. T.W. Randolph and Y. Yasui: Multiscale processing of mass spectrometry data, University of Washington Biostatistics Working Paper Series, Number 230, (2004)

    Google Scholar 

  11. W. Yu, B. Wu, N. Lin, K. Stone, K. Williams, H. Zhao: Detecting, aligning peaks in mass spectrometry data with applications to MALDI, Comput. Biol. Chem. (2005) in press

    Google Scholar 

  12. R. J. O. Torgrip, M. Aberg, B. Karlberg, S. P. Jacobsson: Peak alignment using reduced set mapping, J. Chemometrics 17, 573–582 (2003)

    Article  Google Scholar 

  13. P. H. C. Eilers: Parametric time warping, Analytical Chemistry 76(2), 404–411 (2004)

    Article  MathSciNet  Google Scholar 

  14. R. Tibshirani, T. Hastie, B. Narasimhan, S. Soltys, G. Shi, A. Koong, Q. Le: Sample classification from protein mass spectrometry, by “peak probability contrasts”, Bioinformatics 20(17), 3034–3044 (2004)

    Article  Google Scholar 

  15. K. J. Johnson, B. W. Wright, K. H. Jarman, R. E. Synovec: High-speed peak matching algorithm for retention time alignment of gas chromatographic data for chemometric analysis, J. Chromatography A 996, 141–155 (2003)

    Article  Google Scholar 

  16. N. V. Nielsen, J. M. Carstensen, J. Smedsgaard: Aligning of single, multiple wavelength chromatographic profiles for chemometric data analysis using correlation optimised warping, J. Chromatography A 805, 17–35 (1998)

    Article  Google Scholar 

  17. J. Aach, G. M. Church: Aligning gene expression time series with time warping algorithms, Bioinformatics 17(6), 495–508 (2001)

    Google Scholar 

  18. S. Dudoit, Y. H. Yang, T. P. Speed, M. J. Callow: Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments, Stat. Sinica 12(1), 111–139 (2002)

    MathSciNet  MATH  Google Scholar 

  19. V. G. Tusher, R. Tibshirani, G. Chu: Significance analysis of microarrays applied to the ionizing radiation response, Proc. Natl. Acad. Sci. 98(9), 5116–5121 (2001)

    Article  MATH  Google Scholar 

  20. X. Cui, G. A. Churchill: Statistical tests for differential expression in cDNA microarray experiments, Genome Biology 4(4), 210 (2003)

    Article  Google Scholar 

  21. Y. Lai, B. Wu, L. Chen, H. Zhao: Statistical method for identifying differential gene–gene coexpression patterns, Bioinformatics 20(17), 3146–3155 (2004)

    Article  Google Scholar 

  22. L. Breiman, J. H. Friedman, R. A. Olshen, C. J. Stone: Classification and Regression Trees (Kluwer Academic, 1984)

    Google Scholar 

  23. E. C. Gunther, D. J. Stone, R. W. Gerwien, P. Bento, M. P. Heyes: Prediction of clinical drug efficacy by classification of drug-induced genomic expression profiles in vitro, Proc. Natl. Acad. Sci 100(16), 9608–9613 (2003)

    Article  Google Scholar 

  24. L. Breiman: Bagging predictors, Machine Learning 24, 123–140 (1996)

    MathSciNet  MATH  Google Scholar 

  25. Y. Freund, R. Schapire: A decision-theoretic generalization of online learning, an application to boosting, J. Computer, System Sci. 55(1), 119–139 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  26. B. Adam, Y. Qu, J. W. Davis, M. D. Ward, M. A. Clements, L. H. Cazares, O. J. Semmes, P. F. Schellhammer, Y. Yasui, Z. Feng: Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men, Cancer Research 62(13), 3609–3614 (2002)

    Google Scholar 

  27. M. Dettling, P. Buhlmann: Boosting for tumor classification with gene expression data, Bioinformatics 19(9), 1061–1069 (2003)

    Article  Google Scholar 

  28. G. Isabelle, W. Jason, B. Stephen, V. Vladimir: Gene selection for cancer classification using support vector machines, Machine Learning 46(1-3), 389–422 (2002)

    MATH  Google Scholar 

  29. Y. Qu, B. L. Adam, Y. Yasui, M. D. Ward, L. H. Cazares, P. F. Schellhammer, Z. Feng, O. J. Semmes, G. L. Wright Jr.: Boosted decision tree analysis of surface-enhanced laser desorption/ionization mass spectral serum profiles discriminates prostate cancer from noncancer patients, Clin. Chem. 48(10), 1835–1843 (2002)

    Google Scholar 

  30. S. Dudoit, J. Fridlyand, T. P. Speed: Comparison of discrimination methods for the classification of tumors using gene expression data, J. Am. Stat. Assoc. 97(457), 77–87 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  31. T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, E. S. Lander: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science 286(5439), 531–537 (1999)

    Article  Google Scholar 

  32. L. Breiman: Random forests, Machine Learning 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  33. V. N. Vapnik: Statistical Learning Theory (Wiley-Interscience, New York 1998)

    MATH  Google Scholar 

  34. C. Ambroise, G. J. McLachlan: Selection bias in gene extraction on the basis of microarray gene-expression data, Proc. Natl. Acad. Sci. 99(10), 6562–6566 (2002)

    Article  MATH  Google Scholar 

  35. T. K. Ho: The random subspace method for constructing decision forests, IEEE Trans. Pattern Anal. Mach. Intell. 20(8), 832–844 (1998)

    Article  Google Scholar 

  36. C. Cortes, L. D. Jackel, S. A. Solla, V. Vapnik, J. S. Denker: Learning curves: asymptotic values, rate of convergence, Adv. Neural Info. Proc. Systems 6, 327–334 (1994)

    Google Scholar 

  37. B. Wu, T. Abbott, D. Fishman, W. McMurray, G. Mor, K. Stone, D. Ward, K. Williams, H. Zhao: Ovarian cancer classification based on mass spectrometry analysis of sera, Cancer Informatics (2005) in press

    Google Scholar 

  38. W. J. Henzel, T. M. Billeci, J. T. Stults, S. C. Wong, C. Grimley, C. Watanabe: Identifying proteins from two-dimensional gels by molecular mass searching of peptide fragments in protein sequence databases, Proc. Natl. Acad. Sci. 90, 5011–5015 (1993)

    Article  Google Scholar 

  39. P. James, M. Quadroni, E. Carafoli, G. Gonnet: Protein identification by mass profile fingerprinting, Biochem. Biophys. Res. Commun. 195, 58–64 (1993)

    Article  Google Scholar 

  40. M. Mann, P. Hojrup, P. Roepstorff: Use of mass spectrometric molecular weight information to identify proteins in sequence databases, Biol. Mass Spectrom. 22, 338–345 (1993)

    Article  Google Scholar 

  41. D. J. Pappin, P. Hojrup, A. J. Bleasby: Rapid identification of proteins by peptide-mass fingerprinting, Curr. Biol. 3, 327–332 (1993)

    Article  Google Scholar 

  42. J. R. Yates III, S. Speicher, P. R. Griffin, T. Hunkapiller: Peptide mass maps: A highly informative approach to protein identification, Anal. Biochem. 214, 397–408 (1993)

    Article  Google Scholar 

  43. D. N. Perkins, D. J. Pappin, D. M. Creasy, J. S. Cottrell: Probability-based protein identification by searching sequence databases using mass spectrometry data, J. S. Electrophoresis 20, 3551–3567 (1999)

    Article  Google Scholar 

  44. K. R. Clauser, P. Baker, A. I. Burlingame: Role of accurate mass measurement (+/- 10 ppm) in protein identification strategies employing MS or MS/MS and database searching, Anal. Chem. 71, 2871–2882 (1999)

    Article  Google Scholar 

  45. W. Zhang, B. T. Chait: ProFound: An expert system for protein identification using mass spectrometric peptide mapping information, Anal. Chem. 72, 2482–2489 (2000)

    Article  Google Scholar 

  46. J. K. Eng, A. L. McCormack, J. R. Yates: An approach to correlate MS/MS data to amino acid sequences in a protein database, J. Am. Soc. Mass Spectrom. 5, 976–989 (1994)

    Article  Google Scholar 

  47. M. Mann, M. S. Wilm: Error-tolerant identification of peptides in sequence databases by peptide sequence tags, Anal. Chem. 66, 4390–4399 (1994)

    Article  Google Scholar 

  48. P. A. Pevzner, V. Dancik, C. L. Tang: Mutation-tolerant protein identification by mass spectrometry, J. Comput. Biol. 7, 777–787 (2000)

    Article  Google Scholar 

  49. V. Bafna, N. Edwards: SCOPE: A probabilistic model for scoring tandem mass spectra against a peptide database, Bioinformatics 17, S13–21 (2001)

    Article  Google Scholar 

  50. B. T. Hansen, J. A. Jones, D. E. Mason, D. C. Liebler: SALSA: A pattern recognition algorithm to detect electrophile-adducted peptides by automated evaluation of CID spectra in LC-MS-MS analyses, Anal. Chem. 73, 1676–1683 (2001)

    Article  Google Scholar 

  51. D. M. Creasy, J. S. Cottrell: Error-tolerant searching of uninterpreted tandem mass spectrometry data, Proteomics 2, 1426–1434 (2002)

    Article  Google Scholar 

  52. H. I. Field, D. Fenyo, R. C. Beavis: RADARS, a bioinformatics solution that automates proteome mass spectral analysis, optimises protein identification, and archives data in arelational database, Proteomics 2, 36–47 (2002)

    Article  Google Scholar 

  53. A. Keller, A. I. Nesvizhskii, E. Kolker, R. Aebersold: Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search, Anal. Chem. 74, 5389–5392 (2002)

    Google Scholar 

  54. M. J. MacCoss, C. C. Wu, J. R. Yates: Probability-based validation of protein identifications using amodified SEQUEST algorithm, Anal. Chem. 74, 5593–5599 (2002)

    Article  Google Scholar 

  55. D. C. Anderson, W. Li, D. G. Payan, W. S. Noble: A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST scores, J. Proteome Res. 2, 137–146 (2003)

    Article  Google Scholar 

  56. J. Colinge, A. Masselot, M. Giron, T. Dessigny, J. Magnin: OLAV: towards high throughput tandem mass spectrometry data identification, Proteomics 3, 1454–1463 (2003)

    Article  Google Scholar 

  57. E. Gasteiger, A. Gattiker, C. Hoogland, I. Ivanyi, R. D. Appel, A. Bairoch: ExPASy: The proteomics server for in-depth protein knowledge and analysis, Nucleic Acids Res. 3, 3784–3788 (2003)

    Article  Google Scholar 

  58. M. Havilio, Y. Haddad, Z. Smilansky: Intensity-based statistical scorer for tandem mass spectrometry, Anal. Chem. 75, 435–444 (2003)

    Article  Google Scholar 

  59. P. Hernandez, R. Gras, J. Frey, R. D. Appel: Popitam: towards new heuristic strategies to improve protein identification from tandem mass spectrometry data, Proteomics 3, 870–878 (2003)

    Article  Google Scholar 

  60. B. Lu, T. Chen: A suffix tree approach to the interpretation of tandem mass spectra: applications to peptides of non-specific digestion, post-translational modifications, Bioinformatics 19, 113–121 (2003)

    Article  Google Scholar 

  61. A. I. Nesvizhskii, A. Keller, E. Kolker, R. Aebersold: A statistical model for identifying proteins by tandem mass spectrometry, Anal. Chem. 75, 4646–4658 (2003)

    Article  Google Scholar 

  62. J. A. Taylor, R. S. Johnson: Sequence database searches via de novo peptide sequencing by tandem mass spectrometry, Rapid Commun. Mass Spectrom. 11, 1067–75 (1997)

    Article  Google Scholar 

  63. V. Dancik, T. A. Addona, K. R. Clauser, J. E. Vath, P. A. Pevzner: De Novo peptide sequencing via tandem mass spectrometry, J. Comput. Biol. 6, 327–342 (1999)

    Article  Google Scholar 

  64. T. Chen, M. Y. Kao, M. Tepel, J. Rush, G. M. Church: A dynamic programming approach to de novo peptide sequencing via tandem mass spectrometry, J. Comput. Biol. 8, 325–337 (2001)

    Article  Google Scholar 

  65. B. Ma, K. Zhang, C. Hendrie, C. Liang, M. Li, A. Doherty-Kirby, G. Lajoie: PEAKS: Powerful software for peptide de novo sequencing by tandem mass spectrometry, Rapid Commun. Mass Spectrom. 17, 2337–2342 (2003)

    Article  Google Scholar 

  66. E. A. Kapp, F. Schütz, G. E. Reid, J. S. Eddes, R. L. Moritz, R. A. J. OʼHair, T. P. Speed, R. J. Simpson: Mining a tandem mass spectrometry database to determine the trends and global factors influencing peptide fragmentation, Anal. Chem. 75, 6251–6264 (2003)

    Article  Google Scholar 

  67. D. C. Chamrad, G. Koerting, J. Gobom, H. Thiele, J. Klose, H. E. Meyer, M. Blueggel: Interpretation of mass spectrometry data for high-throughput proteomics, Anal. Bioanal. Chem. 376, 1014–1022 (2003)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Weichuan Yu , Baolin Wu , Tao Huang , Xiaoye Li , Kenneth Williams or Hongyu Zhao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag

About this entry

Cite this entry

Yu, W., Wu, B., Huang, T., Li, X., Williams, K., Zhao, H. (2006). Statistical Methods in Proteomics. In: Pham, H. (eds) Springer Handbook of Engineering Statistics. Springer Handbooks. Springer, London. https://doi.org/10.1007/978-1-84628-288-1_34

Download citation

  • DOI: https://doi.org/10.1007/978-1-84628-288-1_34

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-85233-806-0

  • Online ISBN: 978-1-84628-288-1

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics