Protein Identification by Tandem Mass Spectrometry and Sequence Database Searching

  • Alexey I. Nesvizhskii
Part of the Methods in Molecular Biology book series (MIMB, volume 367)


The shotgun proteomics strategy, based on digesting proteins into peptides and sequencing them using tandem mass spectrometry (MS/MS), has become widely adopted. The identification of peptides from acquired MS/MS spectra is most often performed using the database search approach. We provide a detailed description of the peptide identification process and review the most commonly used database search programs. The appropriate choice of the search parameters and the sequence database are important for successful application of this method, and we provide general guidelines for carrying out efficient analysis of MS/MS data. We also discuss various reasons why database search tools fail to assign the correct sequence to many MS/MS spectra, and draw attention to the problem of false-positive identifications that can significantly diminish the value of published data. To assist in the evaluation of peptide assignments to MS/MS spectra, we review the scoring schemes implemented in most frequently used database search tools. We also describe statistical approaches and computational tools for validating peptide assignments to MS/MS spectra, including the concept of expectation values, reversed database searching, and the empirical Bayesian analysis of PeptideProphet. Finally, the process of inferring the identities of the sample proteins given the list of peptide identifications is outlined, and the limitations of shotgun proteomics with regard to discrimination between protein isoforms are discussed.

Key Words

Tandem mass spectrometry proteomics algorithms database protein identification statistical models bioinformatics 


  1. 1.
    Link, A. J., Eng, J., Schieltz, D. M., et al. (1999) Direct analysis of protein complexes using mass spectrometry. Nat. Biotechnol. 17, 676–682.PubMedCrossRefGoogle Scholar
  2. 2.
    Gygi, S. P., Rist, B., Gerber, S. A., Turecek, F., Gelb, M. H., and Aebersold, R. (1999) Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nat. Biotechnol. 17, 994–999.PubMedCrossRefGoogle Scholar
  3. 3.
    Washburn, M. P., Wolters, D., and Yates, J. R. (2001) Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat. Biotechnol. 19, 242–247.PubMedCrossRefGoogle Scholar
  4. 4.
    Aebersold, R. and Mann, M. (2003) Mass spectrometry-based proteomics. Nature 422, 198–207.PubMedCrossRefGoogle Scholar
  5. 5.
    Gorg, A., Weiss, W., and Dunn, M. J. (2004) Current two-dimensional electrophoresis technology for proteomics. Proteomics 4, 3665–3685.PubMedCrossRefGoogle Scholar
  6. 6.
    Reid, G. E. and McLuckey, S. A. (2002) ‘Top down’ protein characterization via tandem mass spectrometry. J. Mass Spectrom. 37, 663–675.PubMedCrossRefGoogle Scholar
  7. 7.
    Nesvizhskii, A. I. and Aebersold, R. (2004) Analysis, statistical validation and dissemination of large-scale proteomics datasets generated by tandem MS. Drug Disc. Today 9, 173–181.CrossRefGoogle Scholar
  8. 8.
    Dancik, V., Addona, T. A., Clauser, K. R., Vath, J. E., and Pevzner, P. A. (1999) De novo peptide sequencing via tandem mass spectrometry. J. Comput. Biol. 6, 327–342.PubMedCrossRefGoogle Scholar
  9. 9.
    Taylor, J. A. and Johnson, R. C. (2001) Implementation and uses of automated de novo peptide sequencing by tandem mass spectrometry. Anal. Chem. 73, 2594–2604.PubMedCrossRefGoogle Scholar
  10. 10.
    Chen, T., Kao, M. Y., Tepel, M., Rush, J., and Church, G. M. (2001) A dynamic programming approach to de novo sequencing via tandem mass spectrometry. J. Comput. Biol. 8, 325–337.PubMedCrossRefGoogle Scholar
  11. 11.
    Ma, B., Zhang, K., Hendrie, C., et al. (2003) PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. Rapid Commun. Mass Spectrom. 17, 2337–2342.PubMedCrossRefGoogle Scholar
  12. 12.
    Frank, A. and Pevzner, P. (2005) PepNovo: de novo peptide sequencing via probabilistic network modeling. Anal. Chem. 77, 964–973.PubMedCrossRefGoogle Scholar
  13. 13.
    Eng, J. K., McCormack, A. L., and Yates, J. R. (1994) An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5, 976–989.CrossRefGoogle Scholar
  14. 14.
    Perkins, D. N., Pappin, D. J. C., Creasy, D. M., and Cottrell, J. C. (1999) Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551–3567.PubMedCrossRefGoogle Scholar
  15. 15.
    Clauser, K. R., Baker, P., and Burlingame, A. L. (1999) Role of accurate mass measurement (+/−10 ppm) in protein identification strategies employing MS or MS/MS and database searching. Anal. Chem. 71, 2871–2882.PubMedCrossRefGoogle Scholar
  16. 16.
    Field, H. I., Fenyo, D., and Beavis, R. C. (2002) RADARS, a bioinformatics solution that automates proteome mass spectral analysis, optimizes protein identification, and archives data in a relational database. Proteomics 2, 36–47.PubMedCrossRefGoogle Scholar
  17. 17.
    Craig, R. and Beavis, R. C. (2004) TANDEM: matching proteins with tandem mass spectra. Bioinformatics 20, 1466–1467.PubMedCrossRefGoogle Scholar
  18. 18.
    Geer, L. Y., Markey, S. P., Kowalak, J. A., et al. (2004) Open mass spectrometry search algorithm. J. Proteome Res. 3, 958–964.PubMedCrossRefGoogle Scholar
  19. 19.
    Zhang, N., Aebersold, R., and Schwikowski, B. (2002) ProbID: a probabilistic algorithm to identify peptides through sequence database searching using tandem mass spectral data. Proteomics 2, 406–1412.CrossRefGoogle Scholar
  20. 20.
    Sadygov, R. G. and Yates, J. R. (2003) A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases. Anal. Chem. 75, 3792–3798.PubMedCrossRefGoogle Scholar
  21. 21.
    Allet, N., Barrillat, N., Baussant, T., et al. (2004) In vitro and in silico processes to identify differentially expressed proteins. Proteomics 4, 2333–2351.PubMedCrossRefGoogle Scholar
  22. 22.
    Mann, M. and Wilm, M. (1994) Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal. Chem. 66, 4390–4399.PubMedCrossRefGoogle Scholar
  23. 23.
    Tabb, D. L., Saraf, A., and Yates, J. R. (2003) GutenTag: high-throughput sequence tagging via an empirically derived fragmentation model. Anal. Chem. 75, 6415–6421.PubMedCrossRefGoogle Scholar
  24. 24.
    Frank, A., Tanner, S., Bafna, V., and Pevzner, P. (2005) Peptide sequence tags for fast database search in mass-spectrometry. J. Proteome Res. 4, 1287–1295.PubMedCrossRefGoogle Scholar
  25. 25.
    Salek, M., Di Bartolo, V., Cittaro, D., et al. (2005) Sequence tag scanning: a new explorative strategy for recognition of unexpected protein alterations by nanoelectrospray ionization-tandem mass spectrometry. Proteomics 5, 667–674.PubMedCrossRefGoogle Scholar
  26. 26.
    Liska, A. J. and Shevchenko, A. (2003) Expanding the organismal scope of proteomics: cross-species protein identification by mass spectrometry and its implications. Proteomics 3, 19–28.PubMedCrossRefGoogle Scholar
  27. 27.
    Shevchenko, A., Sunyaev, S., Loboda, A., et al. (2001) Charting the proteomes of organisms with unsequenced genomes by MALDI-quadrupole time-of-flight mass spectrometry and BLAST homology searching. Anal. Chem. 73, 1917–1926.PubMedCrossRefGoogle Scholar
  28. 28.
    Aebersold, R. and Goodlett, D. R. (2001) Mass spectrometry in proteomics. Chem. Rev. 101, 269–295.PubMedCrossRefGoogle Scholar
  29. 29.
    Keller, A., Nesvizhskii, A. I., Kolker, E., and Aebersold, R. (2002) Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 74, 5383–5392.PubMedCrossRefGoogle Scholar
  30. 30.
    Apweiler, R., Bairoch, A., and Wu, C. H. (2004) Protein sequence databases. Curr. Opin. Chem. Biol. 8, 76–80.PubMedCrossRefGoogle Scholar
  31. 31.
    Kuster, B., Mortensen, P., Andersen, J. S., and Mann, M. (2001) Mass spectrometry allows direct identification of proteins in large genomes. Proteomics 1, 641–650.PubMedCrossRefGoogle Scholar
  32. 32.
    Choudhary, J. S., Blackstock, W. P., Creasy, D. M., and Cottrell, J. S. (2001) Interrogating the human genome using uninterpreted mass spectrometry data. Proteomics 1, 651–667.PubMedCrossRefGoogle Scholar
  33. 33.
    Patterson, S. D. (2003) Data analysis: the Achilles heel of proteomics. Nat. Biotechnol. 21, 221–222.PubMedCrossRefGoogle Scholar
  34. 34.
    Tabb, D. L., Smith, L. L., Breci, L. A., Wysocki, V. H., Lin, D., and Yates, J. R. (2003) Statistical characterization of ion trap tandem mass spectra from doubly charged tryptic peptides. Anal. Chem. 75, 1155–1163.PubMedCrossRefGoogle Scholar
  35. 35.
    Kapp, E. A., Schütz, F., Reid, G. E., et al. (2003) Mining a tandem mass spectrometry database to determine the trends and global factors influencing peptide fragmentation. Anal. Chem. 75, 6251–6254.PubMedCrossRefGoogle Scholar
  36. 36.
    Resing, K. A., Meyer-Arendt, K., Mendoza, A. M., et al. (2004) Improving reproducibility and sensitivity in identifying human proteins by shotgun proteomics. Anal. Chem. 76, 3556–3568.PubMedCrossRefGoogle Scholar
  37. 37.
    Nesvizhskii, A. I., Roos, R. F., Grossmann, J., et al. (2006) Dynamic spectrum quality assessment and iterative computational analysis of shotgun proteomic data: toward more efficient identification of post-translational modifications, sequence polymorphisms, and novel peptides. Mol. Cell. Proteomics 5, 652–670.PubMedGoogle Scholar
  38. 38.
    Von Haller, P. D., Yi, E., Donohoe, S., et al. (2003) The application of new software tools to quantitative protein profiling via ICAT and tandem mass spectrometry: II. Evaluation of tandem mass spectrometry methodologies for large-scale protein analysis and the application of statistical tools for data analysis and interpretation. Mol. Cell. Proteomics 2, 428–442.Google Scholar
  39. 39.
    Carr, S., Aebersold, R., Baldwin, M., Burlingame, A., Clauser, K., and Nesvizhskii, A. (2004) The need for guidelines in publication of peptide and protein identification data. Mol. Cell. Proteomics 3, 531–533.PubMedCrossRefGoogle Scholar
  40. 40.
    Bafna, V. and Edwards, N. (2001) SCOPE: a probabilistic model for scoring tandem mass spectra against a peptide database. Bioinformatics 17(Suppl.), S13–S21.PubMedGoogle Scholar
  41. 41.
    Havilio, M., Haddad, Y., and Smilansky, Z. (2003) Intensity-based statistical scorer for tandem mass spectrometry. Anal. Chem. 75, 435–444.PubMedCrossRefGoogle Scholar
  42. 42.
    Baldwin, M. A. (2004) Protein identification by mass spectrometry: issues to be considered. Mol. Cell. Proteomics 3, 1–9.PubMedGoogle Scholar
  43. 43.
    Fenyo, D. and Beavis, R. C. (2003) A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. Anal. Chem. 75, 768–774.PubMedCrossRefGoogle Scholar
  44. 44.
    Karlin, S. and Altschul, S. F. (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87, 2264–2268.PubMedCrossRefGoogle Scholar
  45. 45.
    Storey, J. D. and Tibshirani, R. (2003) Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. USA 100, 9440–9445.PubMedCrossRefGoogle Scholar
  46. 46.
    Peng, J., Elias, J. E., Thoreen, C. C., Licklider, L. J., and Gygi, S. P. (2003) Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large scale protein analysis: the yeast proteome. J. Proteome Res. 2, 43–50.PubMedCrossRefGoogle Scholar
  47. 47.
    Nesvizhskii, A. I., Keller, A., Kolker, E., and Aebersold, R. (2003) A statistical model for identifying proteins by tandem mass spectrometry. Anal. Chem. 75, 4646–4658.PubMedCrossRefGoogle Scholar
  48. 48.
    Rappsilber, J. and Mann, M. (2002) What does it mean to identify a protein in proteomics? Trends in Biochem. Sci. 27, 74–78.CrossRefGoogle Scholar
  49. 49.
    Nesvizhskii, A. I. and Aebersold, R. (2005) Interpretation of shotgun proteomic data: The protein inference problem. Mol. Cell. Proteomics 4, 1419–1440.PubMedCrossRefGoogle Scholar
  50. 50.
    Han, D. K., Eng, J., Zhou, H., and Aebersold, R. (2001) Quantitative profiling of differentiation-induced microsomal proteins using isotope-coded affinity tags and mass spectrometry Nat. Biotechnol. 19, 946–951.PubMedCrossRefGoogle Scholar
  51. 51.
    Tabb, D. L., McDonald, W. H., and Yates, J. R. (2002) DTASelect and Contrast: tools for assembling and comparing protein identifications from shotgun proteomics. J. Proteome Res. 1, 21–26.PubMedCrossRefGoogle Scholar
  52. 52.
    Yang, X., Dondeti, V., Dezube, R., et al. (2004) DBParser: web-based software for shotgun proteomic data analyses. J. Proteome Res. 3, 1002–1008.PubMedCrossRefGoogle Scholar
  53. 53.
    Li, X. J., Zhang, H., Ranish, J. A., and Aebersold, R. (2003) Automated statistical analysis of protein abundance ratios from data generated by stable-isotope dilution and tandem mass spectrometry. Anal. Chem. 75, 6648–6657.PubMedCrossRefGoogle Scholar

Copyright information

© Humana Press Inc., Totowa, NJ 2007

Authors and Affiliations

  • Alexey I. Nesvizhskii
    • 1
  1. 1.Department of PathologyUniversity of MichiganAnn Arbor

Personalised recommendations