Blocked Pattern Matching Problem and Its Applications in Proteomics

Ng, Julio; Amir, Amihood; Pevzner, Pavel A.

doi:10.1007/978-3-642-20036-6_27

Blocked Pattern Matching Problem and Its Applications in Proteomics

Julio Ng²¹,
Amihood Amir^22,23 &
Pavel A. Pevzner²⁴

Conference paper

1358 Accesses
4 Citations

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 6577))

Abstract

Matching a mass spectrum against a text (a key computational task in proteomics) is slow since the existing text indexing algorithms (with search time independent of the text size) are not applicable in the domain of mass spectrometry. As a result, many important applications (e.g., searches for mutated peptides) are prohibitively time-consuming and even the standard search for non-mutated peptides is becoming too slow with recent advances in high-throughput genomics and proteomics technologies. We introduce a new paradigm – the Blocked Pattern Matching (BPM) Problem - that models peptide identification. BPM corresponds to matching a pattern against a text (over the alphabet of integers) under the assumption that each symbol a in the pattern can match a block of consecutive symbols in the text with total sum a.

BPM opens a new, still unexplored, direction in combinatorial pattern matching and leads to the Mutated BPM (modeling identification of mutated peptides) and Fused BPM (modeling identification of fused peptides in tumor genomes). We illustrate how BPM algorithms solve problems that are beyond the reach of existing proteomics tools.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abascal, F., Posada, D., Knight, R.D., Zardoya, R.: Parallel evolution of the genetic code in arthropod mitochondrial genomes. PLoS Biol. 4(5), e127 (2006)
Article Google Scholar
Amir, A.: Asynchronous pattern matching. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 1–10. Springer, Heidelberg (2006)
Chapter Google Scholar
Amir, A., Aumann, Y., Indyk, P., Levy, A., Porat, E.: Efficient computations of ℓ₁ and ℓ_∞ rearrangement distances. In: Ziviani, N., Baeza-Yates, R. (eds.) SPIRE 2007. LNCS, vol. 4726, pp. 39–49. Springer, Heidelberg (2007)
Chapter Google Scholar
Amir, A., Aumann, Y., Benson, G., Levy, A., Lipsky, O., Porat, E., Skiena, S., Vishne, U.: Pattern matching with address errors: rearrangement distances. In: Proc. 17th ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1221–1229 (2006)
Google Scholar
Amir, A., Aumann, Y., Kapah, O., Levy, A., Porat, E.: Approximate string matching with address bit errors. In: Ferragina, P., Landau, G.M. (eds.) CPM 2008. LNCS, vol. 5029, pp. 118–129. Springer, Heidelberg (2008)
Chapter Google Scholar
Amir, A., Eisenberg, E., Keller, O., Levy, A., Porat, E.: Approximate string matching with stuck address bits. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 395–405. Springer, Heidelberg (2010)
Chapter Google Scholar
Amir, A., Hartman, T., Kapah, O., Levy, A., Porat, E.: On the cost of interchange rearrangement in strings. In: Arge, L., Hoffmann, M., Welzl, E. (eds.) ESA 2007. LNCS, vol. 4698, pp. 99–110. Springer, Heidelberg (2007)
Chapter Google Scholar
Amir, A., Kapah, O., Porat, E.: Deterministic length reduction: Fast convolution in sparse data and applications. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 183–194. Springer, Heidelberg (2007)
Chapter Google Scholar
Baeza-Yates, R.: A fast set intersection algorithm for sorted sequences. In: Sahinalp, S., Muthukrishnan, S., Dogrusoz, U. (eds.) CPM 2004. LNCS, vol. 3109, pp. 400–408. Springer, Heidelberg (2004)
Chapter Google Scholar
Besemer, J., Lomsadze, A., Borodovsky, M.: Genemarks: a self-training method for prediction of gene starts in microbial genomes implications for finding sequence motifs in regulatory regions. Nucleic Acids Research 29(12), 2607–2618 (2001)
Article Google Scholar
Cardoze, D.E., Schulman, L.J.: Pattern matching for spatial point sets. In: Proc. 39th Annu. IEEE Sympos. Found. Comput. Sci., pp. 156–165 (1998)
Google Scholar
Castellana, N.E., Payne, S.H., Shen, Z., Stanke, M., Bafna, V., Briggs, S.P.: Discovery and revision of arabidopsis genes by proteogenomics. Proceedings of the National Academy of Sciences 105(52), 21034–21038 (2008)
Article Google Scholar
Cohen, H., Porat, E.: Fast set intersection and two-patterns matching. In: López-Ortiz, A. (ed.) LATIN 2010. LNCS, vol. 6034, pp. 234–242. Springer, Heidelberg (2010)
Chapter Google Scholar
Cole, R., Hariharan, R.: Approximate string matching: A simpler faster algorithm. SIAM J. Comput. 31(6), 1761–1782 (2002)
Article MathSciNet MATH Google Scholar
Cormen, T.H., Leiserson, C.E., Rivest, R.L.: Introduction to Algorithms. MIT Press and McGraw-Hill (1992)
Google Scholar
Demaine, E.D., López-Ortiz, A., Munro, I.J.: Adaptive set intersections, unions and differences. In: Proc. 11th ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 743–752 (2000)
Google Scholar
Dietz, P., Mehlhorn, K., Raman, R., Uhrig, C.: Lower bounds for set intersection queries. Algorithmica 14(2), 154–168 (1993)
Article MathSciNet MATH Google Scholar
Elenitoba-Johnson, K.S.J., Crockett, D.K., Schumacher, J.A., Jenson, S.D., Coffin, C.M., Rockwood, A.L., Lim, M.S.: Proteomic identification of oncogenic chromosomal translocation partners encoding chimeric anaplastic lymphoma kinase fusion proteins. Proceedings of the National Academy of Sciences 103(19), 7402–7407 (2006)
Article Google Scholar
Eng, J., McCormack, A., Yates, J.: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. Journal of the American Society for Mass Spectrometry 5(11), 976–989 (1994)
Article Google Scholar
Frank, A.M., Pevzner, P.A.: PepNovo: De Novo Peptide Sequencing via Probabilistic Network Modeling. Anal. Chem. 77, 964–973 (2005)
Article Google Scholar
Guigó, R., Gusfield, D., Edwards, N., Lippert, R.: Generating peptide candidates from amino-acid sequence databases for protein identification via mass spectrometry. In: Guigó, R., Gusfield, D. (eds.) WABI 2002. LNCS, vol. 2452, pp. 68–81. Springer, Heidelberg (2002)
Chapter Google Scholar
Gupta, N., Tanner, S., Jaitly, N., Adkins, J., Lipton, M., Edwards, R., Romine, M., Osterman, A., Bafna, V., Smith, R., Pevzner, P.: Whole proteome analysis of post-translational modifications: applications of mass-spectrometry for proteogenomic annotation. Genome Res. 17, 1362–1377 (2007)
Article Google Scholar
Gupta, N., Pevzner, P.A.: False discovery rates of protein identifications: A strike against the two-peptide rule. Journal of Proteome Research 8(9), 4173–4181 (2009)
Article Google Scholar
Gusfield, D.: Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, New York (1997)
Book MATH Google Scholar
Jaffe, J.D., Stange-Thomann, N., Smith, C., DeCaprio, D., Fisher, S., Butler, J., Calvo, S., Elkins, T., FitzGerald, M.G., Hafez, N., Kodira, C.D., Major, J., Wang, S., Wilkinson, J., Nicol, R., Nusbaum, C., Birren, B., Berg, H.C., Church, G.M.: The complete genome and proteome of mycoplasma mobile. Genome Research 14(8), 1447–1461 (2004)
Article Google Scholar
Jeong, K., Bandeira, N., Kim, S., Pevzner, P.A.: Gapped spectral dictionaries and their applications for database searches of tandem mass spectra. Mol. Cell. Proteomics (2010) (in press)
Google Scholar
Kapah, O., Landau, G.M., Levy, A., Oz, N.: Interchange rearrangement: The element-cost model. Theoretical Computer Science 410(43), 4315–4326 (2009)
Article MathSciNet MATH Google Scholar
Kim, S., Bandeira, N., Pevzner, P.A.: Spectral profiles: A novel representation of tandem mass spectra and its applications for de novo peptide sequencing and identification. Mol. Cell. Proteomics 8, 1391–1400 (2009)
Article Google Scholar
Kim, S., Gupta, N., Bandeira, N., Pevzner, P.A.: Spectral dictionaries: Integrating de novo peptide sequencing with database search of tandem mass spectra. Mol. Cell. Proteomics 8(1), 53–69 (2009)
Article Google Scholar
Kim, S., Gupta, N., Pevzner, P.A.: Spectral probabilities and generating functions of tandem mass spectra: A strike against decoy databases. Journal of Proteome Research 7(8), 3354–3363 (2008)
Article Google Scholar
Knight, R.D., Freeland, S.J., Landweber, L.F.: Rewiring the keyboard: evolvability of the genetic code. Nat. Rev. Genet. 2(1), 49–58 (2001)
Article Google Scholar
Merrihew, G.E., Davis, C., Ewing, B., Williams, G., Käll, L., Frewen, B.E., Noble, W.S., Green, P., Thomas, J.H., MacCoss, M.J.: Use of shotgun proteomics for the identification, confirmation, and correction of c. elegans gene annotations. Genome Research 18(10), 1660–1669 (2008)
Article Google Scholar
Muthukrishnan, S.: New results and open problems related to non-standard stringology. In: Galil, Z., Ukkonen, E. (eds.) CPM 1995. LNCS, vol. 937, pp. 298–317. Springer, Heidelberg (1995)
Chapter Google Scholar
Ng, J., Pevzner, P.A.: Algorithm for identification of fusion proteins via mass spectrometry. Journal of Proteome Research 7(1), 89–95 (2008)
Article Google Scholar
Nielsen, P., Krogh, A.: Large-scale prokaryotic gene prediction and comparison to genome annotation. Bioinformatics 21(24), 4322–4329 (2005)
Article Google Scholar
Park, C.Y., Klammer, A.A., Käll, L., MacCoss, M.J., Noble, W.S.: Rapid and accurate peptide identification from tandem mass spectra. Journal of Proteome Research 7(7), 3022–3027 (2008)
Article Google Scholar
Shilov, I.V., Seymour, S.L., Patel, A.A., Loboda, A., Tang, W.H., Keating, S.P., Hunter, C.L., Nuwaysir, L.M., Schaeffer, D.A.: The paragon algorithm, a next generation search engine that uses sequence temperature values and feature probabilities to identify peptides from tandem mass spectra. Molecular & Cellular Proteomics 6(9), 1638–1655 (2007)
Article Google Scholar
Tanner, S., Shu, H., Frank, A., Wang, L.C., Zandi, E., Mumby, M., Pevzner, P.A., Bafna, V.: Inspect: Identification of posttranslationally modified peptides from tandem mass spectra. Analytical Chemistry 77(14), 4626–4639 (2005)
Article Google Scholar
Tsur, D., Tanner, S., Zandi, E., Bafna, V., Pevzner, P.: Identification of post-translational modifications by blind search of mass spectra. Nature Biotechnology 23(12), 1562–1567 (2005)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Bioinformatics and Systems Biology Program, University of California, San Diego, USA
Julio Ng
Department of Computer Science, Bar-Ilan University, Israel
Amihood Amir
Department of Computer Science, Johns Hopkins University, USA
Amihood Amir
Department of Computer Science and Eng., University of California, San Diego, USA
Pavel A. Pevzner

Authors

Julio Ng
View author publications
You can also search for this author in PubMed Google Scholar
Amihood Amir
View author publications
You can also search for this author in PubMed Google Scholar
Pavel A. Pevzner
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

EBU3b, University of California San Diego, #4218, 9500 Gilman Drive, 92093-0404, La Jolla, CA, USA
Vineet Bafna
School of Computing Science, Simon Fraser University, 8888 University Drive, V5A 1S6, Burnaby, BC, Canada
S. Cenk Sahinalp

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ng, J., Amir, A., Pevzner, P.A. (2011). Blocked Pattern Matching Problem and Its Applications in Proteomics. In: Bafna, V., Sahinalp, S.C. (eds) Research in Computational Molecular Biology. RECOMB 2011. Lecture Notes in Computer Science(), vol 6577. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20036-6_27

Download citation

DOI: https://doi.org/10.1007/978-3-642-20036-6_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20035-9
Online ISBN: 978-3-642-20036-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics