Advertisement

Genome analysis: Pattern search in biological macromolecules

  • H. W. Mewes
  • K. Heumann
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 937)

Abstract

Biological sequence data analysis has developed into an inevitable tool for macromolecular biology, key to any detailed understanding of the living cell. A brief survey on the biological macromolecules and their function is given. Sequence data analysis is introduced as a basic tool for the experimental bench biologist. So far, most queries for such analyses are issued on flat files and static indices. We discuss position tree structures and their potential in sequence data analysis. The hash position tree is introduced as a persistent, dynamic data structure for pattern searches in large sequence databases in biology.

Keywords

Position Tree Biological Sequence Suffix Tree Disk Access Sequence Data Analysis 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    P. Edman: A method for the determination of the amino acid sequences in peptides Arch. Biochem. 22, 457 (1949)Google Scholar
  2. 2.
    F. Sanger: The arrangement of amino acids in proteins. Adv. ProteinChem. 7:1–67 (1952)Google Scholar
  3. 3.
    F. Sanger, E.O.P. Thompson: The amino-acid sequence in the phenylalanyl chain of insulin. Biochem. J. 53, 366–374 (1953)PubMedGoogle Scholar
  4. 4.
    M. Dayhoff (edt.): Atlas of Protein Sequence and Structure” National Biomedical Research Foundation. Silver Spring, Maryland (1978)Google Scholar
  5. 5.
    A.M. Maxam, W. Gilbert W.: A new method for sequencing DNA. Proc. Natl. Acad. Sci. USA 74, 560–564 (1977)PubMedGoogle Scholar
  6. 6.
    R.M. Schwartz, M.O. Dayhoff: Origins of Prokaryotes, Eukaryotes, Mitochondria, and Chloroplasts. Science 199, 355 (1978)Google Scholar
  7. 7.
    J. Devereux, P. Haeberli; O. Smithies: A comprehensive set of sequence analysis programs for the VAX. Nucleic Acids Res. 12, 387–395 (1984)PubMedGoogle Scholar
  8. 8.
    C. Rawlings: “Software Directory for Molecular Biologists.” MacMillan, London (1986)Google Scholar
  9. 9.
    W.C. Barker, D.G. George, H.W. Mewes, F. Pfeiffer, A. Tsugita: The PIR-International databases: Nucl. Acids Res. 22, 3089–3092 (1994)Google Scholar
  10. 10.
    D.A. Benson, M. Boguski, D.J. Lipman, J. Ostell: GenBank. Nucl. Acids Res. 22, 3441–3444 (1994)PubMedGoogle Scholar
  11. 11.
    D.B. Emmert, P.J. Stoehr, G. Stoesser, G. Cameron: The European Bioinformatics Institute (EBI) databases. Nucl. Acids Res. 22, 3445–3449 (1994)PubMedGoogle Scholar
  12. 12.
    K.H. Fasman, A.J. Cuticchia, D.T. Kingsbury: The GDB (TM) human genome data base anno 1994. Nucl. Acids Res. 22, 3462–3469 (1994)PubMedGoogle Scholar
  13. 13.
    A. Bairoch, B. Boeckmann: The SWISS-PROT protein sequence data bank: current status. Nucl. Acids. Research 22, 1994: 22, 3578–3580Google Scholar
  14. 14.
    Goffeau A. (edt.): Sequencing the Yeast Genome, A detailed assessment. Commission of the European Communities (1988)Google Scholar
  15. 15.
    S.G. Oliver, Q.J.M. van der Aart, M.L. Agostoni-Carbone, M. Aigle, L. Alberghina, D. Alexandraki, G. Antoine, R. Anwar, J.P.G. Ballesta, P. Benit, G. Berben, E. Bergantino, N. Biteau, P.A. Bolle, M. Bolotin-Fukuhara, A. Brown, A.J.P. Brown, J.M. Buhler, C. Carcano, G. Carignani, H. Cederberg, R. Chanet, R. Contreras, M. Crouzet, B. Daignan-Fornier, E. Defoor, M. Delgado, C. Doira, J. Demolder, E. Dubois, B. Dujon, A. Dusterhoft, D. Erdmann, M. Esteban, F. Fabre, C. Fairhead, G. Faye, H. Feldmann, W. Fiers, M.C. Francingues-Gaillard, L. Franco, L. Frontali, H. Fukuhara, L.J. Fuller, P. Galland, M.E. Gent, D. Gigot, V. Gilliquet, N. Glansdorff, A. Goffeau, M. Grenson, P. Grisanti, L.A. Grivell, M. de Haan, M. Haasemann, D. Hatat, J. Hoenicka, J. Hegemann, C.J. Herbert, F. Hilger, S. Hohmann, C.P. Hollenberg, K. Huse, F. Iborra, K.J. Indge, K. Isono, C. Jacq, M. Jacquet, C.M. James, J.C. Jauniaux, Y. Jia, A. Jimenez, A. Kelly, Kleinhans U., Kreisl P., G. Lanfranchi, C. Lewis, C.G. van der Linden, G. Lucchini, K. Lutzenkirchen, M.J. Maat, G. Mannhaupt, E. Martegani, A. Mathieu, C.T.C. Maurer, D. McConnell, R.A. McKee, H.W. Mewes, F. Messenguy, F. Molemans, M.A. Montague, M. Falconi, F. Muzi, L. Navas, C.S. Newlon, D. Noone, C. Pallier, L. Panzeri, B.M. Pearson, Perea J., P. Philippsen, A. Pierard, R.J. Planta, P. Plevani, B. Poetsch, F. Pohl, B. Purnelle, M. Ramezani-Rad, S.W. Rasmussen, A. Raynal, M. Remacha, P. Richterich, A.B. Roberts, F. Rodriguez, E. Sanz, I. Schaaff-Gerstenschlager, B. Scherens, B. Schweitzer, Y. Shu, J. Skala, P.P. Slonimski, F. Sor, C. Soustelle, R. Spiegelberg, L.I. Stateva, H.Y. Steensma, S. Steiner, A. Thierry, G. Thireos, M. Tzermia, L.A. Urrestarazu, G. Valle, I. Vetter, J.C. van Vliet-Reedijk, M. Voet, G. Volckaert, P. Vreken, H. Wang, J.R. Warmington, D. von Wettstein, B.L. Wicksteed, C. Wilson, H. Wurst, G. Xu, F.K. Zimmermann, J.G. Sgouros: The complete DNA sequence of yeast chromosome III. Nature 357, 38–46 (1992)PubMedGoogle Scholar
  16. 16.
    B. Dujon, D. Alexandraki, B. Andre, W. Ansorge, V. Baladron, J.P.G. Ballesta, A. Banrevi, P.A. A. Bolle, M. Bolotin-Fukuhara, P. Bossier, G. Bou, J. Boyer, M.J. Bultrago, G. Cheret, L. Colleaux, B. Daignan-Fornier, F. del Rey, C. Dion, H. Domdey, A. Duesterhoeft, S. Duesterhus, K.D. Entian, H. Erfle, P.F. Esteban, H. Feldmann, L. Fernandes, G.M. Fobo, C. Fritz, H. Fukuhara, C. Gabel, L. Gaillon, J.M. Carcia-Cantalejo, J.J. Garcia-Ramirez, M.E. Gent, M. Ghazvini, A. Goffeau, A. Gonzalez, D. Grothues, P. Guerreiro, J. Hegemann, N. Hewitt, F. Hilger, C.P. Hollenberg, O. Horaitis, K.J. Indge, A. Jacquier, C.M. James, J.C. Jauniaux, A. Jimenez, H. Keuchel, L. Kirchrath, K. Kleine, P. Koetter, P. Legrain, S. Liebl, E.J. Louis, A. Maia e Silva, C. Marck, A.L. Monnier, D. Moestl, S. Mueller, B. Obermaier, S.G. Oliver, C. Pallier, S. Pascolo, F. Pfeiffer, P. Philippsen, R.J. Planta, F.M. Pohl, T.M. Pohl, R. Poehlmann, D. Porteteile, B. Purnelle, V. Puzos, M.R. Rad, S.W. Rasmussen, M. Remacha, J.L. Revuelta, G.F. Richard, M. Rieger, C. Rodrigues-Pousada, M. Rose, T. Rupp, M.A. Santos, C Schwager, C. Sensen, J. Skala, H. Soares, F. Sor, J. Stegemann, H. Tettelin, A. Thierry, M. Tzermia, L.A. Urrestarazu, L. van Dyck, J.C. van Vliet-Reedijk, M. Valens, M. Vandenbol, C. Vilela, S. Vissers, D. von Wettstein, H. Voss, S. Wiemann, G. Xu, J. Zimmermann, M. Haasemann, I. Becker, H.W. Mewes H.W; “The complete sequence of chromosome XI of Saccharomyces Cerevisiae”, Nature (1994) 396, 371–378Google Scholar
  17. 17.
    H. Feldmann, M. Aigle, G. Aljinovic, B. Andre, M.C Baclet, A. Barthe, C. Baur, A.M. Becam, N. Biteau, E. Boles, T. Brandt, M. Brendel, M. Bruckner, F. Busereau, C. Christiansen, R. Contreras, M. Crouzet, C. Cziepluch, N. Demolis, T. Delaveau, F. Doignon, H. Domdey, S. Dusterhus, E. Dubois, B. Dujon, M. Elbakkoury, K.D. Entian, M. Feuermann, W. Fiers, G.M. Fobo, C. Fritz, H. Gassenhuber, N. Glansdorff, A. Goffeau, L.A. Grivell, M. Dehaan, C. Hein, C.J. Herbert, C.P. Hollenberg, K. Holmstrom, C. Jacq, M. Jacquet, J.C. Jauniaux, J.L. Jonniaux, T. Kallesoe, P. Kiesau, L. Kirchrath, P. Kotter, S. Koroll, S. Liebl, M. Logghe, A.J.E. Lohan, EJ. Louis, ZY. Li, M.J. Maat, L. Mallet, G. Mannhaupt, F. Messenguy, T. Miosga, F. Molemans, W. Muller, S. Nasr, B. Obermaier, J. Perea, A. Pierard, E. Piravandi, F.M. Pohl, T.M. Pohl, S. Potier, M. Proft, B. Purnelle, M.R. Rad, M. Rieger, M. Rose, I. Schaaff-Gerstenschlager, C. Scherens, B. Schwarzlose, J. Skala, P.P. Slonimski, P.H.M. Smits, J.L. Souciet, H.Y. Steensma, R. Stucka, A. Urrestarazu, Q.J.M. Vanderaart, L. Vandyck, A. Vassarotti, I. Vetter, S. Vierendeels, F. Vissers, G. Wagner, P. Dewergifosse, K.H. Wolfe, M. Zagulski, F.K. Zimmermann, H.W. Mewes, K. Kleine:’ Complete DNA-Sequence of Yeast Chromosome-II', EMBO JOURNAL (1994) 13, 5795–5809PubMedGoogle Scholar
  18. 18.
    M. Johnston, S. Andrews, R. Brinkman, J. Cooper, H. Ding, J. Dover, Z. Du, A. Favello, L. Fulton, S. Gattung, C. Geisel, J. Kirsten, T. Kucaba, L. Hillier, M. Jier, L. Johnston, Y. Langston, P. Latreille, E.J. Louis, C. Macri, E. Mardis, S. Menezes, L. Mouser, M. Nhan, L. Rifkin, L. Riles, H. St. Peter, E. Trevaskis, K. Vaughan, D. Vignati, L. Wilcox, P. Wohldman, R. Waterston, R. Wilson, M. Vaudin: Compltete Nucleiotide Sequence of Saccharomyces cerevisiae Chromosome VIII. Science 256, 2077–2082 (1994)Google Scholar
  19. 19.
    P. Bork, C. Ouzounis, C. Sander, M. Scharf, R. Schneider, E. Sonnhammer: Comprehensive sequence analysis of the 182 predicted open reading frames of yeast chromosome III. Protein Science 1:1677–1690 (1992)PubMedGoogle Scholar
  20. 20.
    E.V. Koonin, P. Bork, C. Sander: Yeast chromosome III: new gene functions. EMBO Journal 13, 493–503 (1994)PubMedGoogle Scholar
  21. 21.
    Dujon B. et al.,: Detailed evalutation of the complete sequence of chromosome XI of S. cerevisiae'. Manuscript in preparation.Google Scholar
  22. 22.
    R.F. Doolitle: Of URFs and ORFs: A Primer on How to Analyze Derived Amino Acid Sequences. University Science Books, Mill Valley, CA (1987)Google Scholar
  23. 23.
    A.M. Lesk: Computational Molecular Biology. In: Encyclopedia of Computer Science and Technology Vol. 31, Marcel Dekker, New York (1994)Google Scholar
  24. 24.
    R.F. Doolittle: Searching through sequence databases, in: Methods in Enzymology (R.F. Doolittle edt.) 183, 99–110 (1990)Google Scholar
  25. 25.
    P. Argos, M. Vingron, G. Vogt: Protein sequence comparison: methods and significance. Protein Engineering 4, 375–383 (1991)PubMedGoogle Scholar
  26. 26.
    D.G. George, W.C. Barker, L.T. Hunt: Mutation Data Matrix and Its Uses. In: Methods in Enzymology (R.F. Doolittle edt.) 183, 333–351 (1990)Google Scholar
  27. 27.
    S.B. Needleman, C.D. Wunsch: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453 (1970)PubMedGoogle Scholar
  28. 28.
    T.F. Smith, M.S. Waterman, W.M. Fitch: Comparative biosequence metrics. J. Mol. Evol 18, 38–46 (1981)PubMedGoogle Scholar
  29. 29.
    P. Argos: A sensitive procedure to compare amino acid sequences. J. Mol. Biol. 193, 385–396 (1987)PubMedGoogle Scholar
  30. 30.
    J.F. Colllins, S.F. Reddaway: High-Efficiency Sequence Database Searching: Use of the Distributed Array Processor. In: G.I. Bell, T.G. Marr (eds): Computers and DNA, Addison-Wesley (1990)Google Scholar
  31. 31.
    W.J. Wilbur, D.J. Lipman: Rapid similarity searches of nucleic acid and protein data banks. Proc. Natl. Acad. Sci. USA 80, 726–730 (1983)PubMedGoogle Scholar
  32. 32.
    S. Liebl, H.W. Mewes: A dynamic database of sequence similarities. Manuscript in preparationGoogle Scholar
  33. 33.
    M.S. Waterman, M. Vingron: Rapid and accurate estimates of statistical siginificance for sequence data base searches. Proc. Natl. Acad. Sci. USA 91, 4625–4628 (1994)PubMedGoogle Scholar
  34. 34.
    C. Sander, R. Schneider: Database of homology-derived protein structures and the structural meaning of sequence alignment. Protens 9, 56–68 (1991)Google Scholar
  35. 35.
    M. Vingron, M.S. Waterman: Sequence alignment and penalty choice. J. Mol. Biol. 235, 1–12 (1994)PubMedGoogle Scholar
  36. 36.
    P. Bork, R.F. Doolittle R.F.: Proposed acquisition of an animal protein domain by bacteria. Proc. Natl. Acad. Sci. USA 89, 8990–8994 (1992)PubMedGoogle Scholar
  37. 37.
    P. Bork, C. Sander, A. Valencia: An ATPase domain common to prokaryotic cell cycle proteins, sugar kinases, actin, and hsp70 heat shock proteins. Proc. Natl. Acad. Sci. USA 89, 7290–7294 (1992)PubMedGoogle Scholar
  38. 38.
    M. Murata, S.S. Richardson, J.L. Sussman: Simultanous comparison of three protein sequences. Proc. Natl. Acad. Sci. USA 82, 2444–2448 (1985)Google Scholar
  39. 39.
    G.J. Barton, M.J.E. Sternberg: Flexible Protein Sequence Patterns, A Sensitive Method to Detect Weak Structural Similarities. J. Mol. Biol. 212, 389–402 (1990)PubMedGoogle Scholar
  40. 40.
    M. Gribskov, R. Luthy, D. Eisenberg: Profile analysis: Detection of distantly related proteins. Proc. Natl. Acad. Sci. USA 84, 4355–4359 (1987)PubMedGoogle Scholar
  41. 41.
    J.D. Thompson, D.G. Higgins, T.J. Gibbson: “Multiple sequence alignment”, Nucleic Acids Res. 22, 4673–4680 (1994)PubMedGoogle Scholar
  42. 42.
    P. Argos, M. Vingron, G. Vogt. Protein sequence comparison: methods and significance. Protein Engineering 4, 375–383 (1991)PubMedGoogle Scholar
  43. 43.
    Bishop J.: Nucleic Acid and Protein Sequence Analysis. A practical approach. IRL Press (1987)Google Scholar
  44. 44.
    Meier, D., “The compelxity of some problems on subsequences and supersequences”, Jour. Assoc. Comput. Mach. 25 (2) (1978), 322–336.Google Scholar
  45. 45.
    Knuth D.E.: The Art of Computer Programming, Vol.3, Sorting and Searching, Addison-Wessley, Reading Mass. (1973)Google Scholar
  46. 46.
    S.F. Altschul, W. Gish, W. Miller, E.W. Myers, D.J. Lipman: Basic Local Alignment Search Tool. J. Mol. Biol. 215, 403–410 (1990)PubMedGoogle Scholar
  47. 47.
    R. Baeza-Yates, G.H. Gonnet: A new Approach to Text Searching. Com. ACM 35, 10, 74–82 (1992)Google Scholar
  48. 48.
    U. Manber, R. Baeza-Yates: An algorithm for string matching with a sequence of don't cares. Information Processing Letters 37, 133–136 (1991)CrossRefGoogle Scholar
  49. 49.
    R. Pearson: Rapid and Sensitive Sequence Comparision with FASTP and FASTA. In: Methods in Enzymology (R.F. Doolittle edt.) 183, 63–98 (1990)Google Scholar
  50. 50.
    S. Wu, U. Manber Fast Text Searching Allowing Errors. Com. AC 35, 83–91 (1992)Google Scholar
  51. 51.
    A. Califano, I. Rigoutsos: FLASH: A Fast Look-UP Algorithm for String Homology. In: Proceedings, First International Conference on Intelligen Sysem for Molecular Biology (Hunter L., Searls D., Shavlik J. eds.) AAAI Press, Menlo Park, CA, 56–64 (1993)Google Scholar
  52. 52.
    U. Manber, E.W. Meyers: Suffix Arrays: A New Method for On-Line String Searches. Proceedings: First Annual ACM-SIAM Symposium on Diskrete Algorithms. 319–327 (1990)Google Scholar
  53. 53.
    GCG, Genetic Computer Group. GCG-Manual Release 8. Madison, Wisconsin (1994)Google Scholar
  54. 54.
    ATLAS-User's Guide. Document Version 10.0. NBRF Washington D.C. (1994)Google Scholar
  55. 55.
    E.M. McCreight: A space-economical suffix tree construction algorithm; J. As soc. Comp. Mach. 23, 262–272 (1976)Google Scholar
  56. 56.
    M. Kempf, R. Bayer, U. Güntzer: Time Optimal Left to Right Construction of Position Trees. Acta Informatica 24, 461–474 (1987)Google Scholar
  57. 57.
    T.A. Sudkamp: Languages and Machines. Addison-Wesley (1988)Google Scholar
  58. 58.
    K. Heumann:’ The hashed position tree: a dynamic, persistant variant of position trees. Mansucript in preparation.Google Scholar
  59. 59.
    A. Bairoch: PROSITE: a dictionary of sites and patterns in proteins. Nucleic Acids Research 20, 2013–2018 (1992)PubMedGoogle Scholar
  60. 60.
    J.T.L. Wang, T.G. Marr, D. Shasha, B.A. Shipiro, G.-W. Chirn: Discovering active motifs in sets of related protein sequences and using them for classification; Nucl. Acids Res. 22, 2769–2775 (1994)PubMedGoogle Scholar
  61. 61.
    J.D. Ullman: Principles of Dtabase and Knowledge-Base Systems, Vol. I. Computer scinece Press, Rockville. (1988)Google Scholar
  62. 62.
    G. Gonnet, A. Mark, S. Benner: Exhaustive Matching of the Entire Protein Sequence Database. Science 256, 1443–1445 (1992)PubMedGoogle Scholar
  63. 63.
    C. Lefevre, J. Ikeda: Pattern recognition in DNA sequences and its application to consensus foot-printing. Comp. Appl. Biosc. 9, 349–354 (1993)PubMedGoogle Scholar
  64. 64.
    C. Lefevere, J. Ikeda: The position end-set tree: A small automaton for ward recognition in biological sequences. Comp. Appl. Biosc. 9, 343–348 (1993)PubMedGoogle Scholar
  65. 65.
    P. Bieganski, J. Riedl, J.V. Cartis: Generalized suffix trees for biological sequence data: applications and implementation. In: Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences. Vol.V: Biotechnology Computing; IEEE Comput. Soc. Press, 35–44. (1994)Google Scholar
  66. 66.
    Object Design, Inc. (1993) Reference Manual. ObjectStore Release 3.0 Beta. For VAX/VMS Systems. Burlington.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1995

Authors and Affiliations

  • H. W. Mewes
    • 1
  • K. Heumann
    • 1
  1. 1.Max-Planck-Inst. f. BiochemieMartinsriedGermany

Personalised recommendations