Speeding up Parsing of Biological Context-Free Grammars

  • Daniel Fredouille
  • Christopher H. Bryant
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3537)


Grammars have been shown to be a very useful way to model biological sequences families. As both the quantity of biological sequences and the complexity of the biological grammars increase, generic and efficient methods for parsing are needed. We consider two parsers for context-free grammars: depth-first top-down parser and chart parser; we analyse and compare them, both theoretically and empirically, with respect to biological data. The theoretical comparison is based on a common feature of biological grammars: the gap – a gap is an element of the grammars designed to match any subsequence of the parsed string. The empirical comparison is based on grammars and sequences used by the bioinformatics community. Our conclusions are that: (1) the chart parsing algorithm is significantly faster than the depth-first top-down algorithm, (2) designing special treatments in the algorithms for managing gaps is useful, and (3) the way the grammar encodes gaps has to be carefully chosen, when using parsers not optimised for managing gaps, to prevent important increases in running times.


Left Part Recursive Call Biological Sequence Parsing Algorithm Prosite Pattern 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Chomsky, N.: Three models for the description of language. IRE Trans. on Information Theory 2 (1956)Google Scholar
  2. 2.
    Searls, D.B.: The linguistics of DNA. American Scientist 80, 579–591 (1992)Google Scholar
  3. 3.
    Falquet, L., et al.: Protein data bank. Nucleic Acid Research 30, 235–238 (2002)CrossRefGoogle Scholar
  4. 4.
    Pereira, F., Warren, D.H.D.: Definite clause grammars for language analysis – a survey of the formalism and a comparison with augmented transition networks. Artificial Intelligence 13, 231–278 (1980)zbMATHCrossRefMathSciNetGoogle Scholar
  5. 5.
    Searls, D.B.: String variable grammar: A logic grammar formalism for the biological language of DNA. Journal of logic Programming 12 (1993)Google Scholar
  6. 6.
    Dsouza, M., Larsen, N., Overbeek, R.: Searching for patterns in genomic data. Trends in Genetics 13, 497–498 (1997)CrossRefGoogle Scholar
  7. 7.
    Leung, S.w., Mellish, C., Robertson, D.: Basic Gene Grammars and DNAChartParser for language processing of Escherichia coli promoter DNA sequences. Bioinformatics 17, 226–236 (2001)CrossRefGoogle Scholar
  8. 8.
    Grune, D., Jacobs, C.J.: Parsing techniques – a practical guide. Ellis Horwood, Chichester (1990)Google Scholar
  9. 9.
    Gazdar, G., Mellish, C.: Natural Language Processing in Prolog. Addison Wesley, Reading (1989)Google Scholar
  10. 10.
    Aycock, J., Horspool, R.N.: Practical Earley parsing. The Computer Journal 45 (2002)Google Scholar
  11. 11.
    Jay, E.: An efficient context-free parsing algorithm. Commun. ACM 13, 94–102 (1970)CrossRefGoogle Scholar
  12. 12.
    Apweiler, R., et al.: UniProt: the universal protein knowledgebase. Nucl. Acids Res. 32, D115–D119 (2004)CrossRefGoogle Scholar
  13. 13.
    Pesole, G., Liuni, S.: Internet resources for the functional analysis of 5’ and 3’ untranslated regions of eukaryotic mRNA. Trends in Genetics 15, 378 (1999)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Daniel Fredouille
    • 1
  • Christopher H. Bryant
    • 1
  1. 1.The Robert Gordon UniversityAberdeenUK

Personalised recommendations