Skip to main content

Choosing an Optimal Database for Protein Identification from Tandem Mass Spectrometry Data

  • Protocol
  • First Online:
Proteome Bioinformatics

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1549))

Abstract

Database searching is the preferred method for protein identification from digital spectra of mass to charge ratios (m/z) detected for protein samples through mass spectrometers. The search database is one of the major influencing factors in discovering proteins present in the sample and thus in deriving biological conclusions. In most cases the choice of search database is arbitrary. Here we describe common search databases used in proteomic studies and their impact on final list of identified proteins. We also elaborate upon factors like composition and size of the search database that can influence the protein identification process. In conclusion, we suggest that choice of the database depends on the type of inferences to be derived from proteomics data. However, making additional efforts to build a compact and concise database for a targeted question should generally be rewarding in achieving confident protein identifications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Steen H, Mann M (2004) The ABC’s (and XYZ’s) of peptide sequencing. Nat Rev Mol Cell Biol 5:699–711

    Article  CAS  PubMed  Google Scholar 

  2. Fenn JB, Mann M, Meng CK, Wong SF, Whitehouse CM (1989) Electrospray ionization for mass spectrometry of large biomolecules. Science 246:64–71

    Article  CAS  PubMed  Google Scholar 

  3. Tanaka K, Waki H, Ido Y, Akita S, Yoshida Y, Yoshida T, Matsuo T (1988) Protein and polymer analyses up to m/z 100 000 by laser ionization time-of-flight mass spectrometry. Rapid Commun Mass Spectrom 2:151–153

    Article  CAS  Google Scholar 

  4. Hunt DF, Yates JR III, Shabanowitz J, Winston S, Hauer CR (1986) Protein sequencing by tandem mass spectrometry. Proc Natl Acad Sci U S A 83:6233–6237

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Syka JE, Coon JJ, Schroeder MJ, Shabanowitz J, Hunt DF (2004) Peptide and protein sequence analysis by electron transfer dissociation mass spectrometry. Proc Natl Acad Sci U S A 101:9528–9533

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Dancik V, Addona TA, Clauser KR, Vath JE, Pevzner PA (1999) De novo peptide sequencing via tandem mass spectrometry. J Comput Biol 6:327–342

    Article  CAS  PubMed  Google Scholar 

  7. Frank A, Pevzner P (2005) PepNovo: de novo peptide sequencing via probabilistic network modeling. Anal Chem 77:964–973

    Article  CAS  PubMed  Google Scholar 

  8. Frank AM, Savitski MM, Nielsen ML, Zubarev RA, Pevzner PA (2007) De novo peptide sequencing and identification with precision mass spectrometry. J Proteome Res 6:114–123

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Eng JK, Searle BC, Clauser KR, Tabb DL (2011) A face in the crowd: recognizing peptides through database search. Mol Cell Proteomics 10:R111

    Article  PubMed  PubMed Central  Google Scholar 

  10. Kall L, Storey JD, MacCoss MJ, Noble WS (2008) Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. J Proteome Res 7:29–34

    Article  PubMed  Google Scholar 

  11. Elias JE, Gygi SP (2007) Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat Methods 4:207–214

    Article  CAS  PubMed  Google Scholar 

  12. Gaudet P, Argoud-Puy G, Cusin I, Duek P, Evalet O, Gateau A, Gleizes A, Pereira M, Zahn-Zabal M, Zwahlen C, Bairoch A, Lane L (2013) neXtProt: organizing protein knowledge in the context of human proteome projects. J Proteome Res 12:293–298

    Article  CAS  PubMed  Google Scholar 

  13. Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken BL, Barrell D, Zadissa A, Searle S, Barnes I, Bignell A, Boychenko V, Hunt T, Kay M, Mukherjee G, Rajan J, Despacio-Reyes G, Saunders G, Steward C, Harte R, Lin M, Howald C, Tanzer A, Derrien T, Chrast J, Walters N, Balasubramanian S, Pei B, Tress M, Rodriguez JM, Ezkurdia I, Van BJ, Brent M, Haussler D, Kellis M, Valencia A, Reymond A, Gerstein M, Guigo R, Hubbard TJ (2012) GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res 22:1760–1774

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Shiromizu T, Adachi J, Watanabe S, Murakami T, Kuga T, Muraoka S, Tomonaga T (2013) Identification of missing proteins in the neXtProt database and unregistered phosphopeptides in the PhosphoSitePlus database as part of the Chromosome-centric Human Proteome Project. J Proteome Res 12:2414–2421

    Article  CAS  PubMed  Google Scholar 

  15. Brosch M, Yu L, Hubbard T, Choudhary J (2009) Accurate and sensitive peptide identification with Mascot Percolator. J Proteome Res 8:3176–3181

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Craig R, Beavis RC (2004) TANDEM: matching proteins with tandem mass spectra. Bioinformatics 20:1466–1467

    Article  CAS  PubMed  Google Scholar 

  17. Geer LY, Markey SP, Kowalak JA, Wagner L, Xu M, Maynard DM, Yang X, Shi W, Bryant SH (2004) Open mass spectrometry search algorithm. J Proteome Res 3:958–964

    Article  CAS  PubMed  Google Scholar 

  18. Yadav AK, Kumar D, Dash D (2012) Learning from decoys to improve the sensitivity and specificity of proteomics database search results. PLoS One 7, e50651

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Yadav AK, Kumar D, Dash D (2011) MassWiz: a novel scoring algorithm with target-decoy based analysis pipeline for tandem mass spectrometry. J Proteome Res 10:2154–2160

    Article  CAS  PubMed  Google Scholar 

  20. Moore RE, Young MK, Lee TD (2002) Qscore: an algorithm for evaluating SEQUEST database search results. J Am Soc Mass Spectrom 13:378–386

    Article  CAS  PubMed  Google Scholar 

  21. Ma ZQ, Dasari S, Chambers MC, Litton MD, Sobecki SM, Zimmerman LJ, Halvey PJ, Schilling B, Drake PM, Gibson BW, Tabb DL (2009) IDPicker 2.0: improved protein assembly with high discrimination peptide identification filtering. J Proteome Res 8:3872–3881

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Ong SE, Blagoev B, Kratchmarova I, Kristensen DB, Steen H, Pandey A, Mann M (2002) Stable isotope labeling by amino acids in cell culture, SILAC, as a simple and accurate approach to expression proteomics. Mol Cell Proteomics 1:376–386

    Article  CAS  PubMed  Google Scholar 

  23. Ross PL, Huang YN, Marchese JN, Williamson B, Parker K, Hattan S, Khainovski N, Pillai S, Dey S, Daniels S, Purkayastha S, Juhasz P, Martin S, Bartlet-Jones M, He F, Jacobson A, Pappin DJ (2004) Multiplexed protein quantitation in Saccharomyces cerevisiae using amine-reactive isobaric tagging reagents. Mol Cell Proteomics 3:1154–1169

    Article  CAS  PubMed  Google Scholar 

  24. Gillet LC, Navarro P, Tate S, Rost H, Selevsek N, Reiter L, Bonner R, Aebersold R (2012) Targeted data extraction of the MS/MS spectra generated by data-independent acquisition: a new concept for consistent and accurate proteome analysis. Mol Cell Proteomics 11:O111

    Article  PubMed  PubMed Central  Google Scholar 

  25. Jaffe JD, Berg HC, Church GM (2004) Proteogenomic mapping as a complementary method to perform genome annotation. Proteomics 4:59–77

    Article  CAS  PubMed  Google Scholar 

  26. Castellana N, Bafna V (2010) Proteogenomics to discover the full coding content of genomes: a computational perspective. J Proteomics 73:2124–2135

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Kumar D, Yadav AK, Kadimi PK, Nagaraj SH, Grimmond SM, Dash D (2013) Proteogenomic analysis of Bradyrhizobium japonicum USDA110 using GenoSuite, an automated multi-algorithmic pipeline. Mol Cell Proteomics 12:3388–3397

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Risk BA, Spitzer WJ, Giddings MC (2013) Peppy: proteogenomic search software. J Proteome Res 12:3019–3025

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Kelkar DS, Kumar D, Kumar P, Balakrishnan L, Muthusamy B, Yadav AK, Shrivastava P, Marimuthu A, Anand S, Sundaram H, Kingsbury R, Harsha HC, Nair B, Prasad TS, Chauhan DS, Katoch K, Katoch VM, Kumar P, Chaerkady R, Ramachandran S, Dash D, Pandey A (2011) Proteogenomic analysis of Mycobacterium tuberculosis by high resolution mass spectrometry. Mol Cell Proteomics 10:M111

    Article  PubMed  PubMed Central  Google Scholar 

  30. Zhao L, Liu L, Leng W, Wei C, Jin Q (2011) A proteogenomic analysis of Shigella flexneri using 2D LC-MALDI TOF/TOF. BMC Genomics 12:528

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Ghali F, Krishna R, Perkins S, Collins A, Xia D, Wastling J, Jones AR (2014) ProteoAnnotator – open source proteogenomics annotation software supporting PSI standards. Proteomics 14:2731–2741

    Article  CAS  PubMed  Google Scholar 

  32. Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10:57–63

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Wang X, Liu Q, Zhang B (2014) Leveraging the complementary nature of RNA-Seq and shotgun proteomics data. Proteomics 14:2676–2687

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Castellana NE, Shen Z, He Y, Walley JW, Cassidy CJ, Briggs SP, Bafna V (2014) An automated proteogenomic method uses mass spectrometry to reveal novel genes in Zea mays. Mol Cell Proteomics 13:157–167

    Article  CAS  PubMed  Google Scholar 

  35. Wang X, Zhang B (2013) CustomProDB: an R package to generate customized protein databases from RNA-Seq data for proteomics search. Bioinformatics 29:3235–3237

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Sun H, Chen C, Shi M, Wang D, Liu M, Li D, Yang P, Li Y, Xie L (2014) Integration of mass spectrometry and RNA-Seq data to confirm human ab initio predicted genes and lncRNAs. Proteomics 14:2760–2768

    Article  CAS  PubMed  Google Scholar 

  37. Woo S, Cha SW, Merrihew G, He Y, Castellana N, Guest C, MacCoss M, Bafna V (2014) Proteogenomic database construction driven from large scale RNA-seq data. J Proteome Res 13:21–28

    Article  CAS  PubMed  Google Scholar 

  38. Omasits U, Quebatte M, Stekhoven DJ, Fortes C, Roschitzki B, Robinson MD, Dehio C, Ahrens CH (2013) Directed shotgun proteomics guided by saturated RNA-seq identifies a complete expressed prokaryotic proteome. Genome Res 23:1916–1927

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. de Souza GA, Arntzen MO, Wiker HG (2010) MSMSpdbb: providing protein databases of closely related organisms to improve proteomic characterization of prokaryotic microbes. Bioinformatics 26:698–699

    Article  PubMed  PubMed Central  Google Scholar 

  40. de Souza GA, Arntzen MO, Fortuin S, Schurch AC, Malen H, McEvoy CR, Van SD, Thiede B, Warren RM, Wiker HG (2011) Proteogenomic analysis of polymorphisms and gene annotation divergences in prokaryotes using a clustered mass spectrometry-friendly database. Mol Cell Proteomics 10:M110

    Article  PubMed  Google Scholar 

  41. Nagaraj SH, Waddell N, Madugundu AK, Wood S, Jones A, Mandyam RA, Nones K, Pearson JV, Grimmond SM (2015) PGTools: a software suite for proteogenomic data analysis and visualization. J Proteome Res 14:2255–2266

    Article  CAS  PubMed  Google Scholar 

  42. Brinkman DL, Aziz A, Loukas A, Potriquet J, Seymour J, Mulvenna J (2012) Venom proteome of the box jellyfish Chironex fleckeri. PLoS One 7, e47866

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Brinkman DL, Jia X, Potriquet J, Kumar D, Dash D, Kvaskoff D, Mulvenna J (2015) Transcriptome and venom proteome of the box jellyfish Chironex fleckeri. BMC Genomics 16:407

    Article  PubMed  PubMed Central  Google Scholar 

  44. Renard BY, Xu B, Kirchner M, Zickmann F, Winter D, Korten S, Brattig NW, Tzur A, Hamprecht FA, Steen H (2012) Overcoming species boundaries in peptide identification with Bayesian information criterion-driven error-tolerant peptide search (BICEPS). Mol Cell Proteomics 11:M111

    Article  PubMed  PubMed Central  Google Scholar 

  45. Delmotte N, Knief C, Chaffron S, Innerebner G, Roschitzki B, Schlapbach R, Von MC, Vorholt JA (2009) Community proteogenomics reveals insights into the physiology of phyllosphere bacteria. Proc Natl Acad Sci U S A 106:16428–16433

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Debasis Dash .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer Science+Business Media LLC

About this protocol

Cite this protocol

Kumar, D., Yadav, A.K., Dash, D. (2017). Choosing an Optimal Database for Protein Identification from Tandem Mass Spectrometry Data. In: Keerthikumar, S., Mathivanan, S. (eds) Proteome Bioinformatics. Methods in Molecular Biology, vol 1549. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-6740-7_3

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-6740-7_3

  • Published:

  • Publisher Name: Humana Press, New York, NY

  • Print ISBN: 978-1-4939-6738-4

  • Online ISBN: 978-1-4939-6740-7

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics