Sequence Database Compression for Peptide Identification from Tandem Mass Spectra

Edwards, Nathan; Lippert, Ross

doi:10.1007/978-3-540-30219-3_20

Sequence Database Compression for Peptide Identification from Tandem Mass Spectra

Nathan Edwards²¹ &
Ross Lippert²¹

Conference paper

592 Accesses
4 Citations

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 3240))

Abstract

The identification of peptides from tandem mass spectra is an important part of many high-throughput proteomics pipelines. In the high-throughput setting, the spectra are typically identified using software that matches tandem mass spectra with putative peptides from amino-acid sequence databases. The effectiveness of these search engines depends heavily on the completeness of the amino-acid sequence database used, but suitably complete amino-acid sequence databases are large, and the sequence database search engines typically have search times that are proportional to the size of the sequence database.

We demonstrate that the peptide content of an amino-acid sequence database can be represented by a reformulated amino-acid sequence database containing fewer amino-acid symbols than the original. In some cases, where the original amino-acid sequence database contains many redundant peptides, we have been able to reduce the size of the amino-acid sequence to almost half of its original size. We develop a lower bound for achievable compression and demonstrate empirically that regardless of the peptide redundancy of the original amino-acid sequence database, we can compress the sequence to within 15-25% of this lower bound. We believe this may provide a principled way to combine amino-acid sequence data from many sources without unduly bloating the resulting sequence database with redundant peptide sequences.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Edwards, N., Lippert, R.: Generating peptide candidates from amino-acid sequence databases for protein identification via mass spectrometry. In: Proceedings of the Second International Workshop on Algorithms in Bioinformatics, pp. 68–81. Springer, Heidelberg (2002)
Chapter Google Scholar
Cieliebak, M., et al.: Algorithmic complexity of protein identification: Combinatorics of weighted strings. Discrete Applied Mathematics 137, 27–46 (2004)
Article MATH MathSciNet Google Scholar
Eng, J., McCormack, A., Yates, J.: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. Journal of American Society of Mass Spectrometry 5, 976–989 (1994)
Article Google Scholar
Perkins, D., et al.: Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551–3567 (1997)
Article Google Scholar
Bafna, V., Edwards, N.: SCOPE: A probabilistic model for scoring tandem mass spectra against a peptide database. Bioinformatics 17, S13–S21 (2001)
Article Google Scholar
Apweiler, R., Bairoch, A., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., Martin, M.J., Natale, D.A., O’Donovan, C., Redaschi, N., Yeh, L.S.L.: UniProt: the Universal Protein knowledgebase. Nucl. Acids. Res. 32, D115–119 (2004)
Article Google Scholar
Welcome to UniProt — UniProt [the Universal Protein Resource] [online, cited June 24, 2004], Available from http://www.uniprot.org/
Kersey, P., Hermjakob, H., Apweiler, R.: VARSPLIC: Alternatively-spliced protein sequences derived from SWISS-PROT and TrEMBL. Bioinformatics 16, 1048–1049 (2000)
Article Google Scholar
UniProt/Swiss-Prot Tools [online, cited June 24, 2004], Available from: http://www.ebi.ac.uk/swissprot/tools.html
CSC/ICSM Proteomics Section Home Page [online, cited June 24, 2004], Available from: http://csc-fserve.hh.med.ic.ac.uk/msdb.html
The BLAST Databases [online, cited June 24, 2004], Available from: ftp://ftp.ncbi.nlm.nih.gov/blast/db/
NRP (Non-Redundant Protein) Database [online, cited June 24, 2004], Available from: ftp://ftp.ncifcrf.gov/pub/nonredun/
EBI Databases — International Protein Index [online, cited June 24, 2004], Available from: http://www.ebi.ac.uk/IPI/IPIhelp.html
Garey, R., Johnson, D.: Computers and Intractability: A guide to the theory of NP-completeness. W. H. Freeman and Company, San Francisco (1979)
MATH Google Scholar
Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)
Book MATH Google Scholar
Bains, W., Smith, G.: A novel method for nucleic acid sequence determination. Journal of Theoretical Biology 135, 303–307 (1988)
Article Google Scholar
Lysov, Y., Floretiev, V., Khorlyn, A., Khrapko, K., Shick, V., Mirzabekov, A.: DNA sequencing by hybridization with oligonucleotides. Dokl. Acad. Sci. USSR 303, 1508–1511 (1988)
Google Scholar
Drmanac, R., Labat, I., Bruckner, I., Crkvenjakov, R.: Sequencing of megabase plus DNA by hybridization. Genomics 4, 114–128 (1989)
Article Google Scholar
Pevzner, P.A.: l-tuple DNA sequencing: Computer analysis. J. Biomol. Struct. Dyn. 7, 63–73 (1989)
Google Scholar
de Bruijn, N.: A combinatorial problem. In: Proc. Kon. Ned. Akad. Wetensch., vol. 49, pp. 758–764 (1946)
Google Scholar
Kwan, M.K.: Graphic programming using odd or even points. Chinese Mathematics 1, 273–277 (1962)
MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Informatics Research, Advanced Research and Technology, Applied Biosystems, 45 W. Gude Drive, Rockville, MD, 20850
Nathan Edwards & Ross Lippert

Authors

Nathan Edwards
View author publications
You can also search for this author in PubMed Google Scholar
Ross Lippert
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Informatics and Computational Biology Unit, HIB, University of Bergen, 5020, Bergen, Norway
Inge Jonassen
Department of Biology,, Penn Center for Bioinformatics, Penn Genomics Institute, 415 S. University Ave., PA 19104, Philadelphia, USA
Junhyong Kim

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Edwards, N., Lippert, R. (2004). Sequence Database Compression for Peptide Identification from Tandem Mass Spectra. In: Jonassen, I., Kim, J. (eds) Algorithms in Bioinformatics. WABI 2004. Lecture Notes in Computer Science(), vol 3240. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30219-3_20

Download citation

DOI: https://doi.org/10.1007/978-3-540-30219-3_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23018-2
Online ISBN: 978-3-540-30219-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics