Skip to main content

Matching and comparing sequences in molecular biology

  • Plenary Survey Lectures
  • Conference paper
  • First Online:
  • 140 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 959))

Abstract

The primary structure of a deoxyribonucleic acid (DNA) molecule is a sequence consisting of four types of letters, A, C., G, and T, each stands for a nucleotide. The length of such a DNA sequence ranges from several thousand letters for a simple virus to three billion letters for a human. We all know that these long and mysterious sequences encode Life as well as genetic diseases, but decoding the sequences is perhaps one of the most challenging tasks in the world. The ultimate goal of molecular biology is to understand what segments of a DNA are responsible for a biological function such as the color of eyes or a genetic disease such as cancer, and how these segments are formed and work. These functionally meaningful segments of a DNA are usually called genes. To find the genes which are responsible for some biological function, a biologist often compares a set of DNA sequences that share the same function and tries to identify regions which are “conserved” in all of these sequences. On the other hand, a biologist may also infer the “closeness” of two organisms by comparing their DNA sequences and computing the degree of similarity of the sequences. Such “closeness” information is useful in the reconstruction of evolutionary histories.

In this talk, we survey various frameworks for the problem of comparing a pair or a set of sequences, including approximate string matching, string edit, (pairwise) sequence alignment, local sequence alignment, and multiple sequence alignment. For the comparison of a pair of sequences, we will describe some standard algorithms and several techniques to improve the time and space efficiency such as preprocessing and divide-and-conquer. Multiple sequence alignment has been identified as one of the most challenging problems in computational molecular biology. In the rest of the talk, we will concentrate on two important variants of multiple sequence alignment: multiple alignment with sum-of-all-pairs (SP) score and multiple alignment with tree score. Since both are NP-hard, we discuss polynomial-time approximation algorithms for these problems with a guaranteed performance bound. If time allows, we will also mention some popular heuristics for performing multiple alignment that seem to work well in practice but do not have a guaranteed performance.

Computational molecular biology is emerging as a fast growing interdisciplinary field involving biology, computer science, statistics, applied mathematics, etc. and is providing a lot of interesting algorithmic questions for theoretical computer scientists to solve. We hope that this brief survey will serve as an introduction to one (important) aspect of the field. Most of the results mentioned above or a pointer to them can be found in the following literature.

Research supported in part by NSERC Research Grant OGP0046613 and MRC/NSERC CGAT Grant GO-12278.

This is a preview of subscription content, log in via an institution.

References

  1. V. Bafna, E. Lawler and P. Pevzner, Approximation algorithms for multiple sequence alignment, Proc. 5th Combinatorial Pattern Matching Conference, 1994, Asilomar, California.

    Google Scholar 

  2. S. Chan, A. Wong and D. Chiu, A survey of multiple sequence comparison methods, Bulletin of Mathematical Biology 54(4), 563–598, 1992.

    Article  PubMed  Google Scholar 

  3. T. Jiang and M. Li, Optimization problems in molecular biology, in Advances in Optimization and Approximation, D.Z. Du and J. Sun (eds.), Kluwer Academic Publishers, MA, 195–216, 1994.

    Google Scholar 

  4. T. Jiang, E. Lawler and L. Wang, Aligning sequences via an evolutionary tree: complexity and approximation, Proc. 26th ACM Symposium on Theory of Computing, 1994, Montreal, Canada; final version to appear in Algorithmica.

    Google Scholar 

  5. E. Lander, R. Langridge and D. Saccocio, Mapping and interpreting biological information, Communications of the ACM 34(11), 33–39, 1991.

    Article  Google Scholar 

  6. R. Lipton, T. Mar and J. Welsh, Computational approaches to discovering semantics in molecular biology, Proceedings of the IEEE 77(7), 1056–60, 1989.

    Article  Google Scholar 

  7. E. Myers, An overview of sequence comparison algorithms, Technical Report 91-29, Dept of Computer Science, University of Arizona, 1991.

    Google Scholar 

  8. D. Sankoff and J. Kruskal (Eds), Time Warps, String Edits, and Macromolecules: the Theory and Practice of Sequence Comparison, Addison-Wesley, Reading, MA, 1983.

    Google Scholar 

  9. M. Waterman, Sequence alignments, in Mathematical Methods for DNA Sequences, M.S. Waterman (ed.), CRC, Boca Raton, FL, 53–92, 1989.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Ding-Zhu Du Ming Li

Rights and permissions

Reprints and permissions

Copyright information

© 1995 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Jiang, T. (1995). Matching and comparing sequences in molecular biology. In: Du, DZ., Li, M. (eds) Computing and Combinatorics. COCOON 1995. Lecture Notes in Computer Science, vol 959. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0030889

Download citation

  • DOI: https://doi.org/10.1007/BFb0030889

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-60216-3

  • Online ISBN: 978-3-540-44733-7

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics