Choosing an Optimal Database for Protein Identification from Tandem Mass Spectrometry Data

Kumar, Dhirendra; Yadav, Amit Kumar; Dash, Debasis

doi:10.1007/978-1-4939-6740-7_3

Dhirendra Kumar⁴,
Amit Kumar Yadav⁴ &
Debasis Dash⁴

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1549))

3980 Accesses
23 Citations
3 Altmetric

Abstract

Database searching is the preferred method for protein identification from digital spectra of mass to charge ratios (m/z) detected for protein samples through mass spectrometers. The search database is one of the major influencing factors in discovering proteins present in the sample and thus in deriving biological conclusions. In most cases the choice of search database is arbitrary. Here we describe common search databases used in proteomic studies and their impact on final list of identified proteins. We also elaborate upon factors like composition and size of the search database that can influence the protein identification process. In conclusion, we suggest that choice of the database depends on the type of inferences to be derived from proteomics data. However, making additional efforts to build a compact and concise database for a targeted question should generally be rewarding in achieving confident protein identifications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Steen H, Mann M (2004) The ABC’s (and XYZ’s) of peptide sequencing. Nat Rev Mol Cell Biol 5:699–711
Article CAS PubMed Google Scholar
Fenn JB, Mann M, Meng CK, Wong SF, Whitehouse CM (1989) Electrospray ionization for mass spectrometry of large biomolecules. Science 246:64–71
Article CAS PubMed Google Scholar
Tanaka K, Waki H, Ido Y, Akita S, Yoshida Y, Yoshida T, Matsuo T (1988) Protein and polymer analyses up to m/z 100 000 by laser ionization time-of-flight mass spectrometry. Rapid Commun Mass Spectrom 2:151–153
Article CAS Google Scholar
Hunt DF, Yates JR III, Shabanowitz J, Winston S, Hauer CR (1986) Protein sequencing by tandem mass spectrometry. Proc Natl Acad Sci U S A 83:6233–6237
Article CAS PubMed PubMed Central Google Scholar
Syka JE, Coon JJ, Schroeder MJ, Shabanowitz J, Hunt DF (2004) Peptide and protein sequence analysis by electron transfer dissociation mass spectrometry. Proc Natl Acad Sci U S A 101:9528–9533
Article CAS PubMed PubMed Central Google Scholar
Dancik V, Addona TA, Clauser KR, Vath JE, Pevzner PA (1999) De novo peptide sequencing via tandem mass spectrometry. J Comput Biol 6:327–342
Article CAS PubMed Google Scholar
Frank A, Pevzner P (2005) PepNovo: de novo peptide sequencing via probabilistic network modeling. Anal Chem 77:964–973
Article CAS PubMed Google Scholar
Frank AM, Savitski MM, Nielsen ML, Zubarev RA, Pevzner PA (2007) De novo peptide sequencing and identification with precision mass spectrometry. J Proteome Res 6:114–123
Article CAS PubMed PubMed Central Google Scholar
Eng JK, Searle BC, Clauser KR, Tabb DL (2011) A face in the crowd: recognizing peptides through database search. Mol Cell Proteomics 10:R111
Article PubMed PubMed Central Google Scholar
Kall L, Storey JD, MacCoss MJ, Noble WS (2008) Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. J Proteome Res 7:29–34
Article PubMed Google Scholar
Elias JE, Gygi SP (2007) Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat Methods 4:207–214
Article CAS PubMed Google Scholar
Gaudet P, Argoud-Puy G, Cusin I, Duek P, Evalet O, Gateau A, Gleizes A, Pereira M, Zahn-Zabal M, Zwahlen C, Bairoch A, Lane L (2013) neXtProt: organizing protein knowledge in the context of human proteome projects. J Proteome Res 12:293–298
Article CAS PubMed Google Scholar
Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken BL, Barrell D, Zadissa A, Searle S, Barnes I, Bignell A, Boychenko V, Hunt T, Kay M, Mukherjee G, Rajan J, Despacio-Reyes G, Saunders G, Steward C, Harte R, Lin M, Howald C, Tanzer A, Derrien T, Chrast J, Walters N, Balasubramanian S, Pei B, Tress M, Rodriguez JM, Ezkurdia I, Van BJ, Brent M, Haussler D, Kellis M, Valencia A, Reymond A, Gerstein M, Guigo R, Hubbard TJ (2012) GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res 22:1760–1774
Article CAS PubMed PubMed Central Google Scholar
Shiromizu T, Adachi J, Watanabe S, Murakami T, Kuga T, Muraoka S, Tomonaga T (2013) Identification of missing proteins in the neXtProt database and unregistered phosphopeptides in the PhosphoSitePlus database as part of the Chromosome-centric Human Proteome Project. J Proteome Res 12:2414–2421
Article CAS PubMed Google Scholar
Brosch M, Yu L, Hubbard T, Choudhary J (2009) Accurate and sensitive peptide identification with Mascot Percolator. J Proteome Res 8:3176–3181
Article CAS PubMed PubMed Central Google Scholar
Craig R, Beavis RC (2004) TANDEM: matching proteins with tandem mass spectra. Bioinformatics 20:1466–1467
Article CAS PubMed Google Scholar
Geer LY, Markey SP, Kowalak JA, Wagner L, Xu M, Maynard DM, Yang X, Shi W, Bryant SH (2004) Open mass spectrometry search algorithm. J Proteome Res 3:958–964
Article CAS PubMed Google Scholar
Yadav AK, Kumar D, Dash D (2012) Learning from decoys to improve the sensitivity and specificity of proteomics database search results. PLoS One 7, e50651
Article CAS PubMed PubMed Central Google Scholar
Yadav AK, Kumar D, Dash D (2011) MassWiz: a novel scoring algorithm with target-decoy based analysis pipeline for tandem mass spectrometry. J Proteome Res 10:2154–2160
Article CAS PubMed Google Scholar
Moore RE, Young MK, Lee TD (2002) Qscore: an algorithm for evaluating SEQUEST database search results. J Am Soc Mass Spectrom 13:378–386
Article CAS PubMed Google Scholar
Ma ZQ, Dasari S, Chambers MC, Litton MD, Sobecki SM, Zimmerman LJ, Halvey PJ, Schilling B, Drake PM, Gibson BW, Tabb DL (2009) IDPicker 2.0: improved protein assembly with high discrimination peptide identification filtering. J Proteome Res 8:3872–3881
Article CAS PubMed PubMed Central Google Scholar
Ong SE, Blagoev B, Kratchmarova I, Kristensen DB, Steen H, Pandey A, Mann M (2002) Stable isotope labeling by amino acids in cell culture, SILAC, as a simple and accurate approach to expression proteomics. Mol Cell Proteomics 1:376–386
Article CAS PubMed Google Scholar
Ross PL, Huang YN, Marchese JN, Williamson B, Parker K, Hattan S, Khainovski N, Pillai S, Dey S, Daniels S, Purkayastha S, Juhasz P, Martin S, Bartlet-Jones M, He F, Jacobson A, Pappin DJ (2004) Multiplexed protein quantitation in Saccharomyces cerevisiae using amine-reactive isobaric tagging reagents. Mol Cell Proteomics 3:1154–1169
Article CAS PubMed Google Scholar
Gillet LC, Navarro P, Tate S, Rost H, Selevsek N, Reiter L, Bonner R, Aebersold R (2012) Targeted data extraction of the MS/MS spectra generated by data-independent acquisition: a new concept for consistent and accurate proteome analysis. Mol Cell Proteomics 11:O111
Article PubMed PubMed Central Google Scholar
Jaffe JD, Berg HC, Church GM (2004) Proteogenomic mapping as a complementary method to perform genome annotation. Proteomics 4:59–77
Article CAS PubMed Google Scholar
Castellana N, Bafna V (2010) Proteogenomics to discover the full coding content of genomes: a computational perspective. J Proteomics 73:2124–2135
Article CAS PubMed PubMed Central Google Scholar
Kumar D, Yadav AK, Kadimi PK, Nagaraj SH, Grimmond SM, Dash D (2013) Proteogenomic analysis of Bradyrhizobium japonicum USDA110 using GenoSuite, an automated multi-algorithmic pipeline. Mol Cell Proteomics 12:3388–3397
Article CAS PubMed PubMed Central Google Scholar
Risk BA, Spitzer WJ, Giddings MC (2013) Peppy: proteogenomic search software. J Proteome Res 12:3019–3025
Article CAS PubMed PubMed Central Google Scholar
Kelkar DS, Kumar D, Kumar P, Balakrishnan L, Muthusamy B, Yadav AK, Shrivastava P, Marimuthu A, Anand S, Sundaram H, Kingsbury R, Harsha HC, Nair B, Prasad TS, Chauhan DS, Katoch K, Katoch VM, Kumar P, Chaerkady R, Ramachandran S, Dash D, Pandey A (2011) Proteogenomic analysis of Mycobacterium tuberculosis by high resolution mass spectrometry. Mol Cell Proteomics 10:M111
Article PubMed PubMed Central Google Scholar
Zhao L, Liu L, Leng W, Wei C, Jin Q (2011) A proteogenomic analysis of Shigella flexneri using 2D LC-MALDI TOF/TOF. BMC Genomics 12:528
Article CAS PubMed PubMed Central Google Scholar
Ghali F, Krishna R, Perkins S, Collins A, Xia D, Wastling J, Jones AR (2014) ProteoAnnotator – open source proteogenomics annotation software supporting PSI standards. Proteomics 14:2731–2741
Article CAS PubMed Google Scholar
Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10:57–63
Article CAS PubMed PubMed Central Google Scholar
Wang X, Liu Q, Zhang B (2014) Leveraging the complementary nature of RNA-Seq and shotgun proteomics data. Proteomics 14:2676–2687
Article CAS PubMed PubMed Central Google Scholar
Castellana NE, Shen Z, He Y, Walley JW, Cassidy CJ, Briggs SP, Bafna V (2014) An automated proteogenomic method uses mass spectrometry to reveal novel genes in Zea mays. Mol Cell Proteomics 13:157–167
Article CAS PubMed Google Scholar
Wang X, Zhang B (2013) CustomProDB: an R package to generate customized protein databases from RNA-Seq data for proteomics search. Bioinformatics 29:3235–3237
Article CAS PubMed PubMed Central Google Scholar
Sun H, Chen C, Shi M, Wang D, Liu M, Li D, Yang P, Li Y, Xie L (2014) Integration of mass spectrometry and RNA-Seq data to confirm human ab initio predicted genes and lncRNAs. Proteomics 14:2760–2768
Article CAS PubMed Google Scholar
Woo S, Cha SW, Merrihew G, He Y, Castellana N, Guest C, MacCoss M, Bafna V (2014) Proteogenomic database construction driven from large scale RNA-seq data. J Proteome Res 13:21–28
Article CAS PubMed Google Scholar
Omasits U, Quebatte M, Stekhoven DJ, Fortes C, Roschitzki B, Robinson MD, Dehio C, Ahrens CH (2013) Directed shotgun proteomics guided by saturated RNA-seq identifies a complete expressed prokaryotic proteome. Genome Res 23:1916–1927
Article CAS PubMed PubMed Central Google Scholar
de Souza GA, Arntzen MO, Wiker HG (2010) MSMSpdbb: providing protein databases of closely related organisms to improve proteomic characterization of prokaryotic microbes. Bioinformatics 26:698–699
Article PubMed PubMed Central Google Scholar
de Souza GA, Arntzen MO, Fortuin S, Schurch AC, Malen H, McEvoy CR, Van SD, Thiede B, Warren RM, Wiker HG (2011) Proteogenomic analysis of polymorphisms and gene annotation divergences in prokaryotes using a clustered mass spectrometry-friendly database. Mol Cell Proteomics 10:M110
Article PubMed Google Scholar
Nagaraj SH, Waddell N, Madugundu AK, Wood S, Jones A, Mandyam RA, Nones K, Pearson JV, Grimmond SM (2015) PGTools: a software suite for proteogenomic data analysis and visualization. J Proteome Res 14:2255–2266
Article CAS PubMed Google Scholar
Brinkman DL, Aziz A, Loukas A, Potriquet J, Seymour J, Mulvenna J (2012) Venom proteome of the box jellyfish Chironex fleckeri. PLoS One 7, e47866
Article CAS PubMed PubMed Central Google Scholar
Brinkman DL, Jia X, Potriquet J, Kumar D, Dash D, Kvaskoff D, Mulvenna J (2015) Transcriptome and venom proteome of the box jellyfish Chironex fleckeri. BMC Genomics 16:407
Article PubMed PubMed Central Google Scholar
Renard BY, Xu B, Kirchner M, Zickmann F, Winter D, Korten S, Brattig NW, Tzur A, Hamprecht FA, Steen H (2012) Overcoming species boundaries in peptide identification with Bayesian information criterion-driven error-tolerant peptide search (BICEPS). Mol Cell Proteomics 11:M111
Article PubMed PubMed Central Google Scholar
Delmotte N, Knief C, Chaffron S, Innerebner G, Roschitzki B, Schlapbach R, Von MC, Vorholt JA (2009) Community proteogenomics reveals insights into the physiology of phyllosphere bacteria. Proc Natl Acad Sci U S A 106:16428–16433
Article CAS PubMed PubMed Central Google Scholar

Download references

Author information

Authors and Affiliations

G.N. Ramachandran Knowledge Centre for Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, South Campus, Sukhdev Vihar, Mathura Road, Delhi, 110025, India
Dhirendra Kumar, Amit Kumar Yadav & Debasis Dash

Authors

Dhirendra Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Amit Kumar Yadav
View author publications
You can also search for this author in PubMed Google Scholar
Debasis Dash
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Debasis Dash .

Editor information

Editors and Affiliations

Department of Biochemistry and Genetics, La Trobe Institute for Molecular Science, La Trobe University, Melbourne, Victoria, Australia
Shivakumar Keerthikumar
Department of Biochemistry and Genetics, La Trobe Institute for Molecular Science, La Trobe University, Melbourne, Victoria, Australia
Suresh Mathivanan

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Kumar, D., Yadav, A.K., Dash, D. (2017). Choosing an Optimal Database for Protein Identification from Tandem Mass Spectrometry Data. In: Keerthikumar, S., Mathivanan, S. (eds) Proteome Bioinformatics. Methods in Molecular Biology, vol 1549. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-6740-7_3

Download citation

DOI: https://doi.org/10.1007/978-1-4939-6740-7_3
Published: 15 December 2016
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-6738-4
Online ISBN: 978-1-4939-6740-7
eBook Packages: Springer Protocols

Publish with us

Policies and ethics