Sequence Database Compression for Peptide Identification from Tandem Mass Spectra

  • Nathan Edwards
  • Ross Lippert
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3240)


The identification of peptides from tandem mass spectra is an important part of many high-throughput proteomics pipelines. In the high-throughput setting, the spectra are typically identified using software that matches tandem mass spectra with putative peptides from amino-acid sequence databases. The effectiveness of these search engines depends heavily on the completeness of the amino-acid sequence database used, but suitably complete amino-acid sequence databases are large, and the sequence database search engines typically have search times that are proportional to the size of the sequence database.

We demonstrate that the peptide content of an amino-acid sequence database can be represented by a reformulated amino-acid sequence database containing fewer amino-acid symbols than the original. In some cases, where the original amino-acid sequence database contains many redundant peptides, we have been able to reduce the size of the amino-acid sequence to almost half of its original size. We develop a lower bound for achievable compression and demonstrate empirically that regardless of the peptide redundancy of the original amino-acid sequence database, we can compress the sequence to within 15-25% of this lower bound. We believe this may provide a principled way to combine amino-acid sequence data from many sources without unduly bloating the resulting sequence database with redundant peptide sequences.


Sequence Database Tandem Mass Spectrum Peptide Candidate Eulerian Tour International Protein Index 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Edwards, N., Lippert, R.: Generating peptide candidates from amino-acid sequence databases for protein identification via mass spectrometry. In: Proceedings of the Second International Workshop on Algorithms in Bioinformatics, pp. 68–81. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  2. 2.
    Cieliebak, M., et al.: Algorithmic complexity of protein identification: Combinatorics of weighted strings. Discrete Applied Mathematics 137, 27–46 (2004)zbMATHCrossRefMathSciNetGoogle Scholar
  3. 3.
    Eng, J., McCormack, A., Yates, J.: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. Journal of American Society of Mass Spectrometry 5, 976–989 (1994)CrossRefGoogle Scholar
  4. 4.
    Perkins, D., et al.: Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551–3567 (1997)CrossRefGoogle Scholar
  5. 5.
    Bafna, V., Edwards, N.: SCOPE: A probabilistic model for scoring tandem mass spectra against a peptide database. Bioinformatics 17, S13–S21 (2001)CrossRefGoogle Scholar
  6. 6.
    Apweiler, R., Bairoch, A., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., Martin, M.J., Natale, D.A., O’Donovan, C., Redaschi, N., Yeh, L.S.L.: UniProt: the Universal Protein knowledgebase. Nucl. Acids. Res. 32, D115–119 (2004)CrossRefGoogle Scholar
  7. 7.
    Welcome to UniProt — UniProt [the Universal Protein Resource] [online, cited June 24, 2004], Available from
  8. 8.
    Kersey, P., Hermjakob, H., Apweiler, R.: VARSPLIC: Alternatively-spliced protein sequences derived from SWISS-PROT and TrEMBL. Bioinformatics 16, 1048–1049 (2000)CrossRefGoogle Scholar
  9. 9.
    UniProt/Swiss-Prot Tools [online, cited June 24, 2004], Available from:
  10. 10.
    CSC/ICSM Proteomics Section Home Page [online, cited June 24, 2004], Available from:
  11. 11.
    The BLAST Databases [online, cited June 24, 2004], Available from:
  12. 12.
    NRP (Non-Redundant Protein) Database [online, cited June 24, 2004], Available from:
  13. 13.
    EBI Databases — International Protein Index [online, cited June 24, 2004], Available from:
  14. 14.
    Garey, R., Johnson, D.: Computers and Intractability: A guide to the theory of NP-completeness. W. H. Freeman and Company, San Francisco (1979)zbMATHGoogle Scholar
  15. 15.
    Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)zbMATHCrossRefGoogle Scholar
  16. 16.
    Bains, W., Smith, G.: A novel method for nucleic acid sequence determination. Journal of Theoretical Biology 135, 303–307 (1988)CrossRefGoogle Scholar
  17. 17.
    Lysov, Y., Floretiev, V., Khorlyn, A., Khrapko, K., Shick, V., Mirzabekov, A.: DNA sequencing by hybridization with oligonucleotides. Dokl. Acad. Sci. USSR 303, 1508–1511 (1988)Google Scholar
  18. 18.
    Drmanac, R., Labat, I., Bruckner, I., Crkvenjakov, R.: Sequencing of megabase plus DNA by hybridization. Genomics 4, 114–128 (1989)CrossRefGoogle Scholar
  19. 19.
    Pevzner, P.A.: l-tuple DNA sequencing: Computer analysis. J. Biomol. Struct. Dyn. 7, 63–73 (1989)Google Scholar
  20. 20.
    de Bruijn, N.: A combinatorial problem. In: Proc. Kon. Ned. Akad. Wetensch., vol. 49, pp. 758–764 (1946)Google Scholar
  21. 21.
    Kwan, M.K.: Graphic programming using odd or even points. Chinese Mathematics 1, 273–277 (1962)MathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Nathan Edwards
    • 1
  • Ross Lippert
    • 1
  1. 1.Informatics Research, Advanced Research and TechnologyApplied BiosystemsRockville

Personalised recommendations